[OPENAM-6039] Asynchronous queue for OAuth2 Tokens can result in token validation failures Created: 22/May/15  Updated: 20/Nov/16  Resolved: 14/Jul/15

Status: Resolved
Project: OpenAM
Component/s: oauth2
Affects Version/s: 12.0.0
Fix Version/s: 12.0.2, 12.0.3, 13.0.0

Type: Bug Priority: Major
Reporter: Matt Miller [X] (Inactive) Assignee: Peter Major [X] (Inactive)
Resolution: Fixed Votes: 1
Labels: CustomerRFE, EDISON, release-notes
Remaining Estimate: 0h
Time Spent: 13h
Original Estimate: 0h

Attachments: File oauth2.jmx    
Sprint: Sprint 83 - Sustaining, Sprint 84 - Sustaining
Support Ticket IDs:
Verified Version/s:

 Description   

CTS tokenstore internally uses blocking queues to distribute the Tasks to the task processors. On operations like Create, the tokenstore queues the request and returns the control to the calling method. In the following scenario this implementation doesnt work

If Create request is executed on server 1 and read/update request is made on server 2 then its possible that request on server 2 is processed before request on server 1

Our usage of OAuth involves 2 steps. 1) get access token and 2) use access token to authenticate users. With OpenAM 11 we have ~zero failure; with OpenAM 12, we are observing close to 1.2-1.3% authenticate failures due to the mentioned problem.

We can't use affinity in this case as there is no session token used in either of the OAuth calls. The fix could be to wait for the operation to be completed by CTS before returning the response to the client. But doing so with current implementation simply negates the purpose of having queues at the first place. Given that, the correct fix in my opinion would be to replace the async CTS with sync CTS.



 Comments   
Comment by Andy Hall [ 18/Jun/15 ]

This was a known limitation of the design, alleviated in web scenarios by server affinity in load balancers.

Potential approaches to resolve this include:

  • offer a synchronous create token API - this would wait until token is persisted but still relies on replication happening in time.
  • investigate options such as affinity for OAuth2 which would resolve the issue completely.

Potential short term deployment workaround:

  • configure LB in active/passive mode to ensure subsequent token calls go back to same server.
Comment by Peter Major [X] (Inactive) [ 14/Jul/15 ]

Fixed with R14615 and R14616

Comment by Peter Major [X] (Inactive) [ 15/Jul/15 ]

The one pager for this issue can be found at:
https://docs.google.com/document/d/15g4egaAtZu1mEWE4GtHy9S1sVK0JBE9Y840htcr1oMU/edit

Comment by Richard Hruza [ 12/May/16 ]

Verified with: OpenAM 12.0.3-RC2 Build 4dbe218a05 (2016-April-25 17:57)
WildFly Full 9.0.2.Final (WildFly Core 1.0.2.Final [-Xms64m -Xmx1536m -XX:MaxPermSize=256m]

Summary

12.0.3 = 0 % errors
12.0.1 = 50 % errors

Details

I executed following commands via both AMs with 150 threads in 50 loops. Script is in attachment.

1.)
POST http://amqa-clone73.test.forgerock.com:8080/openam/oauth2/access_token
POST data:
client_id=oauth2&client_secret=password&username=demo&password=changeit&grant_type=password&realm=/hello

2.)
http://amqa-clone73.test.forgerock.com:8080/openam/oauth2/tokeninfo?access_token=21d9e007-ca4c-42a7-a613-6069651eec84&realm=/hello 

Setup:

  • configure OAuth2 in /hello realm
  • create an "OAuth 2.0/OpenID Connect Client" agent with cn scope
OpenAM 12.0.3-RC2 Build 4dbe218a05 (2016-April-25 17:57)
# JVM_ARGS="-Xms1024m -Xmx1024m" /opt/apache-jmeter-2.12/bin/jmeter.sh -n -t /opt/riso/stability-testing/oauth2.jmx 
Creating summariser <summary>
Created the tree successfully using /opt/riso/stability-testing/oauth2.jmx
Starting the test @ Thu May 12 15:55:53 BST 2016 (1463064953730)
Waiting for possible shutdown message on port 4445
summary +    975 in   6.1s =  159.2/s Avg:   810 Min:   234 Max:  1471 Err:     0 (0.00%) Active: 150 Started: 150 Finished: 0
summary +   4612 in  31.1s =  148.2/s Avg:   983 Min:   328 Max:  2332 Err:     0 (0.00%) Active: 150 Started: 150 Finished: 0
summary =   5587 in  36.1s =  154.7/s Avg:   953 Min:   234 Max:  2332 Err:     0 (0.00%)
summary +   6776 in    31s =  219.1/s Avg:   664 Min:   274 Max:  1232 Err:     0 (0.00%) Active: 150 Started: 150 Finished: 0
summary =  12363 in  66.1s =  187.0/s Avg:   794 Min:   234 Max:  2332 Err:     0 (0.00%)
summary +   2637 in  12.2s =  215.6/s Avg:   647 Min:     4 Max:  1085 Err:     0 (0.00%) Active: 0 Started: 150 Finished: 150
summary =  15000 in    78s =  193.3/s Avg:   769 Min:     4 Max:  2332 Err:     0 (0.00%)
Tidying up ...    @ Thu May 12 15:57:11 BST 2016 (1463065031498)
... end of run
OpenAM 12.0.1 Build 14322 (2015-June-22 16:03)
# JVM_ARGS="-Xms1024m -Xmx1024m" /opt/apache-jmeter-2.12/bin/jmeter.sh -n -t /opt/riso/stability-testing/oauth2.jmx 
Creating summariser <summary>
Created the tree successfully using /opt/riso/stability-testing/oauth2.jmx
Starting the test @ Thu May 12 16:07:12 BST 2016 (1463065632960)
Waiting for possible shutdown message on port 4445
summary +   1651 in    17s =   97.6/s Avg:  1383 Min:   459 Max:  2758 Err:   751 (45.49%) Active: 150 Started: 150 Finished: 0
summary +   2700 in  32.4s =   83.5/s Avg:  1686 Min:   435 Max:  3488 Err:  1350 (50.00%) Active: 150 Started: 150 Finished: 0
summary =   4351 in  47.1s =   92.4/s Avg:  1571 Min:   435 Max:  3488 Err:  2101 (48.29%)
summary +   4272 in    32s =  134.6/s Avg:  1068 Min:   339 Max:  2349 Err:  2172 (50.84%) Active: 150 Started: 150 Finished: 0
summary =   8623 in    77s =  112.4/s Avg:  1322 Min:   339 Max:  3488 Err:  4273 (49.55%)
summary +   4939 in  31.2s =  158.1/s Avg:   907 Min:   294 Max:  1583 Err:  2477 (50.15%) Active: 150 Started: 150 Finished: 0
summary =  13562 in   107s =  127.0/s Avg:  1171 Min:   294 Max:  3488 Err:  6750 (49.77%)
summary +   1438 in   9.2s =  157.2/s Avg:   911 Min:   375 Max:  1551 Err:   750 (52.16%) Active: 0 Started: 150 Finished: 150
summary =  15000 in   115s =  130.3/s Avg:  1146 Min:   294 Max:  3488 Err:  7500 (50.00%)
Tidying up ...    @ Thu May 12 16:09:08 BST 2016 (1463065748404)
... end of run
Comment by wwwnitinkumar [ 15/Sep/16 ]

Hi
There is one more potential failure scenario. What if, after the generation of OAuth2 authorization code by node 1, the request to fetch access token lands on node 2 before CTS replication of the authorization code has occured?
Is this race condition fixed too?
Many thanks
Nitin

Generated at Sat Oct 24 05:47:12 UTC 2020 using Jira 7.13.12#713012-sha1:6e07c38070d5191bbf7353952ed38f111754533a.