[OPENDJ-7169] Apparent issue with failover functionality in rest2ldap Created: 04/May/20  Updated: 15/Jul/20  Resolved: 02/Jul/20

Status: Done
Project: OpenDJ
Component/s: core apis
Affects Version/s: 7.0.0
Fix Version/s: 7.0.0

Type: Bug Priority: Blocker
Reporter: Dirk Hogan Assignee: Yannick Lecaillez
Resolution: Not a defect Votes: 0
Labels: None

Epic Link: Bugs 7.0
Story Points: 5
Dev Assignee: Yannick Lecaillez

 Description   

Jake Feasel and I did testing on a GKE-deployed IDM instance with two DS instances, each configured as both replication and directory servers, in active-passive setup (one specified in primaryLdapServers, one in secondaryLdapServers). Load was generated with JMeter at ~60 user creates per second. Then the pod hosting the primary DS instance was killed. JMeter recorded ~10 failures following the initial pod kill, and again another ~10 failures when Kubernetes restored the pod. These failures were reproducible.

JMeter reported a NoHttpResponseException (stack trace below). Note that no logs were recorded corresponding to the failures in IDM. This means that either:

  1. something other than a ResourceException was returned in the Promise returned from the rest2ldap invocations, or an unchecked exception was thrown,
  2. or the calls to rest2ldap simply did not return. 

If it is important to make this distinction, I could surround the invocation of the repo layer for managed user creations with a try-finally, and increment an AtomicInteger prior to the repo layer invocation, and decrement it in the finally, and then reproduce the issue with Jake. A zero-valued AtomicInteger would indicate an unexpected exception, and a greater-than-0-valued AtomicInteger would indicate a repo-layer/rest2ldap call which simply did not return.

 

org.apache.http.NoHttpResponseException: jake.iam.forgeops.com:443 failed to respond
 at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:141)
 at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
 at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
 at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:286)
 at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:257)
 at org.apache.jmeter.protocol.http.sampler.hc.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:199)
 at org.apache.jmeter.protocol.http.sampler.MeasuringConnectionManager$MeasuredConnection.receiveResponseHeader(MeasuringConnectionManager.java:212)
 at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
 at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
 at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:684)
 at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
 at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:835)
 at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
 at org.apache.jmeter.protocol.http.sampler.HTTPHC4Impl.executeRequest(HTTPHC4Impl.java:695)
 at org.apache.jmeter.protocol.http.sampler.HTTPHC4Impl.sample(HTTPHC4Impl.java:454)
 at org.apache.jmeter.protocol.http.sampler.HTTPSamplerProxy.sample(HTTPSamplerProxy.java:74)
 at org.apache.jmeter.protocol.http.sampler.HTTPSamplerBase.sample(HTTPSamplerBase.java:1189)
 at org.apache.jmeter.protocol.http.sampler.HTTPSamplerBase.sample(HTTPSamplerBase.java:1178)
 at org.apache.jmeter.threads.JMeterThread.executeSamplePackage(JMeterThread.java:498)
 at org.apache.jmeter.threads.JMeterThread.processSampler(JMeterThread.java:424)
 at org.apache.jmeter.threads.JMeterThread.run(JMeterThread.java:255)
 at java.lang.Thread.run(Thread.java:748)


 Comments   
Comment by Matthew Swift [ 05/May/20 ]

Flagging as critical for 7.0 since this bug could leave client applications hanging as well as triggering resource leaks.

Comment by Yannick Lecaillez [ 02/Jul/20 ]

It appears that IDM, rest2ldap and DS SDK operates correctly.

The NoHttpResponseException seems to be caused by the  eng-shared nginx-ingress controller. Indeed, once IDM returns 503 to the ingress, it automatically closes the connection which might contains more pending request due to keep-alive/pipelining.

The failing requests reported by jmeter are not even received by IDM.

When the test is run directly against IDM effectively bypassing the ingress (thanks to a port-forwarding) we can see that jmeter receives few 503 responses during the fail-over, as expected.

Note that I did not find any evidence on Internet about this weird behavior of nginx ingress controller. This is effectively a deduction from pure empirical testing.

Comment by Dirk Hogan [ 13/Jul/20 ]

Yannick Lecaillez Just for my understanding: I was under the impression, when I filed this JIRA, that you and the DS team expected zero failures from DS during the cutover from active->passive and back. It sounds like this impression was incorrect: we should expect a certain, relatively small number of 503 responses. You were concerned about the lack of responses in Jmeter and the lack of errors in IDM - and you are hypothesizing that this can be attributed to the way the nginx-ingress handles 503 responses. Am I understanding this correctly?

Comment by Jean-Noël Rouvignac [ 15/Jul/20 ]

you and the DS team expected zero failures from DS during the cutover from active->passive and back

Exactly and this is what Yannick investigated.
There is zero failures at the DS level, so there is no bug in DS.

It sounds like this impression was incorrect: we should expect a certain, relatively small number of 503 responses.

The impression was correct: there is no bug is DS according to Yannick's investigations.
I am not sure that "we should expect a certain, relatively small number of 503 responses".

It looks like a bug/feature/unexpected behaviour in the nginx ingress controller.
Changing to another implementation of the ingress controller may fix that problem.

There is nothing more than can be done DS side, hence why this bug is fixed as "not a defect".
Maybe something must be changed in the GKE cluster? Maybe the ingress controller must be changed? Maybe something else?
I am not sure how you should proceed from now on.
Warren Strange do you have heard of any such problem in the past? Is there anything that can be done on forgeops side?

Comment by Dirk Hogan [ 15/Jul/20 ]

Jean-Noël Rouvignac I was responding to Yannick Lecaillez's comment:

When the test is run directly against IDM effectively bypassing the ingress (thanks to a port-forwarding) we can see that jmeter receives few 503 responses during the fail-over, as expected.

Seeing as IDM simply dispatches requests to rest2ldap, where do you see the source of these 503 responses? 

Comment by Yannick Lecaillez [ 15/Jul/20 ]

It is expected that rest2ldap returns few 503 on failover for non idempotent request like add (which is the case here).

The issue was created because Ttere was a suspicion of non answered requests by rest2ldap during failover.

Some requests are indeed not answered but this is due to nginx controller. When IDM returns 503 (effectively forwarding the 503 returned by rest2ldap) it seems that nginx is surprisingly closing all connections to that host. That's why from rest2ldap/IDM all seems to work fine while on client side some requests are not answered: nginx effectively closed the underlying connection as a result of this 503 returned by IDM.

Generated at Fri Oct 23 08:38:40 UTC 2020 using Jira 7.13.12#713012-sha1:6e07c38070d5191bbf7353952ed38f111754533a.