Affects Version/s: 5.5.1, 6.0.0, 18.104.22.168, 22.214.171.124, 126.96.36.199, 188.8.131.52, 184.108.40.206, 6.5.0, 220.127.116.11, 5.5.2
Sprint:AM Sustaining Sprint 59
Support Ticket IDs:
Needs QA verification:No
Are the reproduction steps defined?:Yes and I used the same an in the description
The are quite a few customer reports of indicating that the AM logs sometimes have a exceptions when trying to delete some CTS tokens.
The impact is that there is no functional impact since the token is supposed to be deleted anyway. The main impact is the logging of these in ERROR level and each logging entry takes 1.5K.
This might be hard but code analysis indicates that one will need to have a few AM and that that there AM may not have time-synchroized (it is even better if some AM is running faster) and the request to AM is not sticky. The premise of the code analysis indicates that the session appears on two or more AM expiry cache (which means at some point in time) the session has been register to these AM.
It is much possibly quicker if one put a delay in CachedTokenReaper where it collects the token that needs to be expired but do not send this to LDAP yet. In the mean time there is activity to update these token idle time. This will cause the issue to happen
From all these one can see the following conditions
1) AM sessions hits to some AM and they are registered to the Token expiry cache (via scheduleTokenDeletion). [Non-sticky request to AM]
2) Maybe some concurrency of session update and running of the Expiry token cleaner. As mentioned the collection of these expiry token happen then but it's possible maybe some of the request are handling the current session but have not written them back to DJ
3) We know the this should only happen to CTS token of type session since they likely would have coreTokenExpiryDate changed and ONLY the Token reaper uses an LDAP assertion to enforce that the value is the same.
1. Try to make sure that AM request is sticky (so no sessionid is recorded in the token
expiry cache). Definitely no round-robin
2. Make sure all clock in AM is synchronized (so that no AM runs earlier) if condition
#1 happens. At least make it not likely another AM1 touches the session it is not so
likely that AM2 cleanup this in the middle AM1 is doing work.
(The code have provided not to return an expired session so making sure at least
a clocks is consistent helps a lot)
3. Ignore this. There is no impact other then the logs printed in ERROR
4. If this is happening if say things are sticky (all in the same single AM), it would then mean
heavy concurrency where some session is updated but the CTS reaper kicks in
(ie: when the REST API is called the session was still not expired but it expired
inside the AM end)....
- well we have the "org.forgerock.services.cts.reaper.cache.pollFrequencyMilliseconds"
so maybe do not run the reaper so frequently...
i) Well the above code does not cater for Clock-drift between AM and also in fact
to avoid concurrency, maybe the code should not cleanup at the time of
currentTimeMillis() but currentTimeMillis-TIME_SKEW (so to make really sure
there is unlikely any race). I would think maybe something like 5 sec.
We do have the org.forgerock.services.cts.reaper.search.gracePeriodMilliseconds (5 mins) .
ii) Another possibly is does the Assertion actually matter or need to be applied?
iii) There is no "remove" the session from the TokenExpiryCache for those non-sticky request.