[OPENDJ-1135] DS sometimes fails to connect to RS after server restart Created: 06/Sep/13  Updated: 22/Jun/20  Resolved: 29/Mar/17

Status: Done
Project: OpenDJ
Component/s: None
Affects Version/s: 4.0.0, 3.5.0, 3.0.0, 2.8.0, 2.6.0
Fix Version/s: 5.5.0, 3.5.2

Type: Bug Priority: Critical
Reporter: Christophe Sovant Assignee: Fabio Pistolesi
Resolution: Fixed Votes: 1
Labels: Verified, release-notes

Attachments: Text File jstack.txt     Zip Archive node1-before-opendj-support-data-20161020-0919.zip     Zip Archive node2-after-opendj-support-data-20161020-0926.zip     Zip Archive node2-before-opendj-support-data-20161020-0922.zip    
Issue Links:
Backport
is backported by OPENDJ-3964 Backport OPENDJ-1135: DS sometimes fa... Done
is backported by OPENDJ-3968 Backport OPENDJ-1135: DS sometimes fa... Done
is backported by OPENDJ-4201 Backport OPENDJ-1135: DS sometimes fa... Done
Relates
relates to OPENDJ-3940 Consider managing replication servers... Dev backlog
is related to OPENDJ-5496 DS fails to reconnect to an RS, disco... Done
is related to OPENDJ-3309 Replication server connection listene... Done
QA Assignee: Viktor Nawrath [X] (Inactive)
Support Ticket IDs:

 Description   

Found using revision 9496.

This problem appears sometimes during the replication functional tests:

  • Back-up server A
  • Add entry on server A
  • Stop server B
  • Restore back-up on server B
  • Start server B

Scenario

  • configure 2 instances (DS+RS in the same jvm)
  • import data in instance1 (1002 entries)
  • enable replication between the 2 instances
  • check replication status (1002 entries expected on both instances)
    o=example	localhost:11444	1002	true	4563	22794	11112	0		false
    o=example	localhost:22444	1002	true	1665	32342	22222	0		false
    
  • backup instance1 (1002 entries)
  • add one entry to instance1
  • check replication status (1003 entries expected on both instances)
    o=example	localhost:11444	1003	true	4563	22794	11112	0		false
    o=example	localhost:22444	1003	true	1665	32342	22222	0		false
    
  • stop instance2
  • restore instance2 with backup from instance1
    [05/Sep/2013:16:26:24 +0200] category=JEB severity=NOTICE msgID=8847445 msg=Restored: 00000000.jdb (size 2709275)
    
  • start instance2
    [05/Sep/2013:16:26:26 +0200] category=EXTENSIONS severity=NOTICE msgID=1507899 msg=Loaded extension from file '/var/tmp/new_repl/opendj-2/lib/extensions/snmp-mib2605.jar' (build 2.7.0-20130904, revision 9496)
    [05/Sep/2013:16:26:26 +0200] category=CORE severity=NOTICE msgID=458886 msg=OpenDJ 2.7.0-20130904 (build 20130904000054Z, R9496) starting up
    [05/Sep/2013:16:26:27 +0200] category=RUNTIME_INFORMATION severity=NOTICE msgID=20381717 msg=Installation Directory:  /var/tmp/new_repl/opendj-2
    [05/Sep/2013:16:26:27 +0200] category=RUNTIME_INFORMATION severity=NOTICE msgID=20381719 msg=Instance Directory:      /var/tmp/new_repl/opendj-2
    [05/Sep/2013:16:26:27 +0200] category=RUNTIME_INFORMATION severity=NOTICE msgID=20381713 msg=JVM Information: 1.6.0_45-b06 by Sun Microsystems Inc., 64-bit architecture, 1847394304 bytes heap size
    [05/Sep/2013:16:26:27 +0200] category=RUNTIME_INFORMATION severity=NOTICE msgID=20381714 msg=JVM Host: csovant-PC, running Linux 3.5.0-39-generic amd64, 8311566336 bytes physical memory size, number of processors available 4
    [05/Sep/2013:16:26:27 +0200] category=RUNTIME_INFORMATION severity=NOTICE msgID=20381715 msg=JVM Arguments: "-Dorg.opends.server.scriptName=start-ds"
    [05/Sep/2013:16:26:28 +0200] category=JEB severity=NOTICE msgID=8847402 msg=The database backend userRoot containing 1002 entries has started
    [05/Sep/2013:16:26:29 +0200] category=EXTENSIONS severity=NOTICE msgID=1507549 msg=DIGEST-MD5 SASL mechanism using a server fully qualified domain name of: localhost
    [05/Sep/2013:16:26:29 +0200] category=SYNC severity=NOTICE msgID=15139020 msg=Replication server RS(32342) started listening for new connections on address 0.0.0.0 port 22222
    [05/Sep/2013:16:26:39 +0200] category=SYNC severity=SEVERE_WARNING msgID=14811344 msg=Directory server DS(31296) was unable to connect to any replication servers for domain "cn=admin data"
    [05/Sep/2013:16:26:44 +0200] category=SYNC severity=SEVERE_ERROR msgID=14942386 msg=Directory server 31296 was attempting to connect to replication server 32342 but has disconnected in handshake phase
    [05/Sep/2013:16:26:49 +0200] category=SYNC severity=SEVERE_WARNING msgID=14811344 msg=Directory server DS(20293) was unable to connect to any replication servers for domain "cn=schema"
    [05/Sep/2013:16:26:49 +0200] category=SYNC severity=SEVERE_ERROR msgID=14942386 msg=Directory server 20293 was attempting to connect to replication server 32342 but has disconnected in handshake phase
    [05/Sep/2013:16:26:59 +0200] category=SYNC severity=SEVERE_WARNING msgID=14811344 msg=Directory server DS(1665) was unable to connect to any replication servers for domain "o=example"
    [05/Sep/2013:16:26:59 +0200] category=PROTOCOL severity=NOTICE msgID=2556180 msg=Started listening for new connections on Administration Connector 0.0.0.0 port 22444
    [05/Sep/2013:16:26:59 +0200] category=PROTOCOL severity=NOTICE msgID=2556180 msg=Started listening for new connections on LDAP Connection Handler 0.0.0.0 port 22389
    [05/Sep/2013:16:26:59 +0200] category=PROTOCOL severity=NOTICE msgID=2556180 msg=Started listening for new connections on LDAPS Connection Handler 0.0.0.0 port 22636
    [05/Sep/2013:16:26:59 +0200] category=CORE severity=NOTICE msgID=458887 msg=The Directory Server has started successfully
    [05/Sep/2013:16:26:59 +0200] category=CORE severity=NOTICE msgID=458891 msg=The Directory Server has sent an alert notification generated by class org.opends.server.core.DirectoryServer (alert type org.opends.server.DirectoryServerStarted, alert ID 458887):  The Directory Server has started successfully
    

    ==> seems that the RS fails to connect with the DS

  • check replication status (1003 entries expected on both instances)
    o=example	localhost:11444	1003	true	4563	22794	11112	0		false
    o=example	localhost:22444	1002	true	1665	32342	22222			false
    

Traces in the errors log

...
[05/Sep/2013:16:26:59 +0200] category=PROTOCOL severity=NOTICE msgID=2556180 msg=Started listening for new connections on Administration Connector 0.0.0.0 port 22444
[05/Sep/2013:16:26:59 +0200] category=PROTOCOL severity=NOTICE msgID=2556180 msg=Started listening for new connections on LDAP Connection Handler 0.0.0.0 port 22389
[05/Sep/2013:16:26:59 +0200] category=PROTOCOL severity=NOTICE msgID=2556180 msg=Started listening for new connections on LDAPS Connection Handler 0.0.0.0 port 22636
[05/Sep/2013:16:26:59 +0200] category=CORE severity=NOTICE msgID=458887 msg=The Directory Server has started successfully
[05/Sep/2013:16:26:59 +0200] category=CORE severity=NOTICE msgID=458891 msg=The Directory Server has sent an alert notification generated by class org.opends.server.core.DirectoryServer (alert type org.opends.server.DirectoryServerStarted, alert ID 458887):  The Directory Server has started successfully
[05/Sep/2013:16:27:00 +0200] category=SYNC severity=SEVERE_ERROR msgID=14942386 msg=Directory server 1665 was attempting to connect to replication server 32342 but has disconnected in handshake phase
[05/Sep/2013:16:27:11 +0200] category=SYNC severity=SEVERE_ERROR msgID=14942386 msg=Directory server 1665 was attempting to connect to replication server 32342 but has disconnected in handshake phase
[05/Sep/2013:16:27:20 +0200] category=BACKEND severity=NOTICE msgID=9896349 msg=Export task 20130905162720583 started execution
[05/Sep/2013:16:27:20 +0200] category=JEB severity=NOTICE msgID=8847447 msg=Exported 1002 entries and skipped 0 in 0 seconds (average rate 6109.8/sec)
[05/Sep/2013:16:27:20 +0200] category=BACKEND severity=NOTICE msgID=9896350 msg=Export task 20130905162720583 finished execution
[05/Sep/2013:16:27:22 +0200] category=SYNC severity=SEVERE_ERROR msgID=14942386 msg=Directory server 31296 was attempting to connect to replication server 32342 but has disconnected in handshake phase
[05/Sep/2013:16:27:33 +0200] category=SYNC severity=SEVERE_ERROR msgID=14942386 msg=Directory server 20293 was attempting to connect to replication server 32342 but has disconnected in handshake phase
[05/Sep/2013:16:27:44 +0200] category=SYNC severity=SEVERE_ERROR msgID=14942386 msg=Directory server 1665 was attempting to connect to replication server 32342 but has disconnected in handshake phase
[05/Sep/2013:16:27:55 +0200] category=SYNC severity=SEVERE_ERROR msgID=14942386 msg=Directory server 31296 was attempting to connect to replication server 32342 but has disconnected in handshake phase
[05/Sep/2013:16:28:06 +0200] category=SYNC severity=SEVERE_ERROR msgID=14942386 msg=Directory server 20293 was attempting to connect to replication server 32342 but has disconnected in handshake phase
[05/Sep/2013:16:28:12 +0200] category=SYNC severity=SEVERE_ERROR msgID=14942386 msg=Directory server 1665 was attempting to connect to replication server 32342 but has disconnected in handshake phase
[05/Sep/2013:16:28:23 +0200] category=SYNC severity=SEVERE_ERROR msgID=14942386 msg=Directory server 31296 was attempting to connect to replication server 32342 but has disconnected in handshake phase
...

Traces in the replication log

...
[05/Sep/2013:16:25:42 +0200] category=SYNC severity=NOTICE msgID=15138894 msg=The generation ID for domain "o=example" has been reset to 18736508
[05/Sep/2013:16:25:53 +0200] category=SYNC severity=INFORMATION msgID=14680271 msg=Replication server RS(32342) has accepted a connection from directory server DS(31296) for domain "cn=admin data" at localhost/127.0.0.1:55313
[05/Sep/2013:16:25:53 +0200] category=SYNC severity=NOTICE msgID=15138878 msg=Directory server DS(31296) has connected to replication server RS(32342) for domain "cn=admin data" at localhost/127.0.0.1:22222 with generation ID 158123
[05/Sep/2013:16:25:53 +0200] category=SYNC severity=INFORMATION msgID=14680271 msg=Replication server RS(32342) has accepted a connection from directory server DS(20293) for domain "cn=schema" at localhost/127.0.0.1:55316
[05/Sep/2013:16:25:53 +0200] category=SYNC severity=NOTICE msgID=15138878 msg=Directory server DS(20293) has connected to replication server RS(32342) for domain "cn=schema" at localhost/127.0.0.1:22222 with generation ID 8408
[05/Sep/2013:16:25:53 +0200] category=SYNC severity=INFORMATION msgID=14680271 msg=Replication server RS(32342) has accepted a connection from directory server DS(1665) for domain "o=example" at localhost/127.0.0.1:55319
[05/Sep/2013:16:25:53 +0200] category=SYNC severity=NOTICE msgID=15138878 msg=Directory server DS(1665) has connected to replication server RS(32342) for domain "o=example" at localhost/127.0.0.1:22222 with generation ID 18736508
[05/Sep/2013:16:26:03 +0200] category=SYNC severity=INFORMATION msgID=14680270 msg=Replication server RS(32342) has accepted a connection from replication server RS(28142) for domain "cn=schema" at localhost/127.0.0.1:55339
[05/Sep/2013:16:26:10 +0200] category=SYNC severity=MILD_WARNING msgID=14745663 msg=Replication server RS(32342) at localhost/127.0.0.1:22222 has closed the connection to this directory server DS(1665). This directory server will now try to connect to another replication server in order to receive changes for the domain "o=example"
[05/Sep/2013:16:26:29 +0200] category=SYNC severity=NOTICE msgID=15139020 msg=Replication server RS(32342) started listening for new connections on address 0.0.0.0 port 22222
[05/Sep/2013:16:26:30 +0200] category=SYNC severity=INFORMATION msgID=14680270 msg=Replication server RS(32342) has accepted a connection from replication server RS(22794) for domain "cn=admin data" at localhost/127.0.0.1:55447
[05/Sep/2013:16:26:30 +0200] category=SYNC severity=INFORMATION msgID=14680270 msg=Replication server RS(32342) has accepted a connection from replication server RS(22794) for domain "o=example" at localhost/127.0.0.1:55448
[05/Sep/2013:16:26:30 +0200] category=SYNC severity=INFORMATION msgID=14680270 msg=Replication server RS(32342) has accepted a connection from replication server RS(22794) for domain "cn=schema" at localhost/127.0.0.1:55449
[05/Sep/2013:16:27:11 +0200] category=SYNC severity=INFORMATION msgID=14680169 msg=Replication server accepted a connection from localhost/127.0.0.1:55497 to local address 0.0.0.0/0.0.0.0:22222 but the SSL handshake failed. This is probably benign, but may indicate a transient network outage or a misconfigured client application connecting to this replication server. The error was: Remote host closed connection during handshake
[05/Sep/2013:16:27:11 +0200] category=SYNC severity=INFORMATION msgID=14680169 msg=Replication server accepted a connection from localhost/127.0.0.1:55498 to local address 0.0.0.0/0.0.0.0:22222 but the SSL handshake failed. This is probably benign, but may indicate a transient network outage or a misconfigured client application connecting to this replication server. The error was: Remote host closed connection during handshake
[05/Sep/2013:16:27:22 +0200] category=SYNC severity=INFORMATION msgID=14680169 msg=Replication server accepted a connection from localhost/127.0.0.1:55524 to local address 0.0.0.0/0.0.0.0:22222 but the SSL handshake failed. This is probably benign, but may indicate a transient network outage or a misconfigured client application connecting to this replication server. The error was: Remote host closed connection during handshake
[05/Sep/2013:16:27:22 +0200] category=SYNC severity=INFORMATION msgID=14680169 msg=Replication server accepted a connection from localhost/127.0.0.1:55525 to local address 0.0.0.0/0.0.0.0:22222 but the SSL handshake failed. This is probably benign, but may indicate a transient network outage or a misconfigured client application connecting to this replication server. The error was: Remote host closed connection during handshake
[05/Sep/2013:16:27:33 +0200] category=SYNC severity=INFORMATION msgID=14680169 msg=Replication server accepted a connection from localhost/127.0.0.1:55546 to local address 0.0.0.0/0.0.0.0:22222 but the SSL handshake failed. This is probably benign, but may indicate a transient network outage or a misconfigured client application connecting to this replication server. The error was: Remote host closed connection during handshake
[05/Sep/2013:16:27:33 +0200] category=SYNC severity=INFORMATION msgID=14680169 msg=Replication server accepted a connection from localhost/127.0.0.1:55547 to local address 0.0.0.0/0.0.0.0:22222 but the SSL handshake failed. This is probably benign, but may indicate a transient network outage or a misconfigured client application connecting to this replication server. The error was: Remote host closed connection during handshake
...

In attachment the jstack.



 Comments   
Comment by Christophe Sovant [ 13/Dec/13 ]

I executed the same scenario in loop using revision 9955 and impossible to reproduce.

Comment by Christophe Sovant [ 07/Jan/14 ]

This problems reappeared in the functional tests (revision 10082).

Comment by Christophe Sovant [ 04/Feb/14 ]

On the trunk in revision 10240.
This problem occurs every day during the functional tests on Windows.

[03/Feb/2014:08:58:41 +0000] category=EXTENSIONS severity=NOTICE msgID=1507899 msg=Loaded extension from file \'C:\\opendj\\lib\\extensions\\snmp-mib2605.jar\' (build 2.7.0-20140203, revision 10240)
[03/Feb/2014:08:58:41 +0000] category=CORE severity=NOTICE msgID=458886 msg=OpenDJ 2.7.0-20140203 (build 20140203024124Z, R10240) starting up
[03/Feb/2014:08:58:45 +0000] category=RUNTIME_INFORMATION severity=NOTICE msgID=20381717 msg=Installation Directory: C:\\opendj
[03/Feb/2014:08:58:45 +0000] category=RUNTIME_INFORMATION severity=NOTICE msgID=20381719 msg=Instance Directory: C:\\opendj
[03/Feb/2014:08:58:45 +0000] category=RUNTIME_INFORMATION severity=NOTICE msgID=20381713 msg=JVM Information: 1.7.0_45-b18 by Oracle Corporation, 64-bit architecture, 954728448 bytes heap size
[03/Feb/2014:08:58:46 +0000] category=RUNTIME_INFORMATION severity=NOTICE msgID=20381714 msg=JVM Host: localhost, running Windows Server 2008 R2 6.1 amd64, 4294500352 bytes physical memory size, number of processors available 2
[03/Feb/2014:08:58:46 +0000] category=RUNTIME_INFORMATION severity=NOTICE msgID=20381715 msg=JVM Arguments: "-Dorg.opends.server.scriptName=start-ds"
[03/Feb/2014:08:58:48 +0000] category=JEB severity=NOTICE msgID=8847402 msg=The database backend userRoot containing 160 entries has started
[03/Feb/2014:08:58:49 +0000] category=EXTENSIONS severity=NOTICE msgID=1507549 msg=DIGEST-MD5 SASL mechanism using a server fully qualified domain name of: localhost
[03/Feb/2014:08:58:54 +0000] category=SYNC severity=NOTICE msgID=15139020 msg=Replication server RS(13567) started listening for new connections on address 0.0.0.0 port 11112
[03/Feb/2014:08:59:09 +0000] category=SYNC severity=SEVERE_WARNING msgID=14811344 msg=Directory server DS(18574) was unable to connect to any replication servers for domain "cn=admin data"
[03/Feb/2014:08:59:24 +0000] category=SYNC severity=SEVERE_WARNING msgID=14811344 msg=Directory server DS(25374) was unable to connect to any replication servers for domain "cn=schema"
[03/Feb/2014:08:59:34 +0000] category=SYNC severity=SEVERE_ERROR msgID=14942386 msg=Directory server 18574 was attempting to connect to replication server 13567 but has disconnected in handshake phase
[03/Feb/2014:08:59:37 +0000] category=SYNC severity=SEVERE_WARNING msgID=14811281 msg=Timed out while trying to acquire the domain lock for domain "cn=schema". The connection attempt from replication server RS(9750) at localhost/100.10.10.10:57053 to this replication server RS(13567) will be aborted. This is probably benign and a result of a simultaneous cross connection attempt
[03/Feb/2014:08:59:38 +0000] category=SYNC severity=NOTICE msgID=15138878 msg=Directory server DS(29360) has connected to replication server RS(26642) for domain "o=example" at localhost/100.10.10.10:33333 with generation ID 2898897
[03/Feb/2014:08:59:38 +0000] category=PROTOCOL severity=NOTICE msgID=2556180 msg=Started listening for new connections on Administration Connector 0.0.0.0 port 11444
[03/Feb/2014:08:59:38 +0000] category=PROTOCOL severity=NOTICE msgID=2556180 msg=Started listening for new connections on LDAP Connection Handler 0.0.0.0 port 11389
[03/Feb/2014:08:59:38 +0000] category=PROTOCOL severity=NOTICE msgID=2556180 msg=Started listening for new connections on LDAPS Connection Handler 0.0.0.0 port 11636
[03/Feb/2014:08:59:38 +0000] category=CORE severity=NOTICE msgID=458887 msg=The Directory Server has started successfully
[03/Feb/2014:08:59:38 +0000] category=CORE severity=NOTICE msgID=458891 msg=The Directory Server has sent an alert notification generated by class org.opends.server.core.DirectoryServer (alert type org.opends.server.DirectoryServerStarted, alert ID 458887): The Directory Server has started successfully
Comment by Chris Ridd [ 15/Sep/16 ]

The jstack shows that the replication server connection listener thread is not present, so nothing new can connect to the RS.

It should be called "Replication server RS(<rs-id>) connection listener on port <rs-port>", and this is logged when the thread starts:

Replication server RS(%d) started listening for new connections on address %s port %d

This may be related to OPENDJ-3309.

Comment by Chris Ridd [ 03/Oct/16 ]

A customer ran with a debug version of the OPENDJ-3309 fix. We observed:

  • the server had three socket close exceptions (one for each replication domain)
  • immediately a remote server connected and sent us a StopMsg

I suspect the reconnect is from the server that just had its sockets closed.

Comment by Lee Trujillo [ 20/Oct/16 ]

Reproduced on 2.6.4, 3.0.0, 3.5.0 and 4.0.0 (163b057).

Reproduction steps:

  1. Create 2 Linux VM's
  2. Install one DS+RS DJ instance on each VM.
  3. Enable replication between each from Node 1.
  4. Add/Mod entries on Node 1 to ensure replication is flowing.
  5. On Node 2 stop-ds but leave the VM running.
  6. On Node 1 stop-ds and shutdown the VM.
  7. On Node 2 start-ds again.
  8. Attempt writes on Node 2.

The RS will start, but all/most of the DS domains, especially dc=example,dc=com will not be able to connect to it. All subsequent writes to the DS will fail with result=53 and we see constant "handshake phase" messages logged to errors.

[20/Oct/2016:09:25:30 -0600] category=SYNC severity=NOTICE msgID=204 msg=Replication server RS(2644) started listening for new connections on address 0.0.0.0 port 8989
[20/Oct/2016:09:25:41 -0600] category=SYNC severity=WARNING msgID=208 msg=Directory server DS(25816) was unable to connect to any replication servers for domain "cn=schema"
[20/Oct/2016:09:25:48 -0600] category=SYNC severity=WARNING msgID=208 msg=Directory server DS(3692) was unable to connect to any replication servers for domain "cn=admin data"
[20/Oct/2016:09:25:49 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 25816 was attempting to connect to replication server 2644 but has disconnected in handshake phase
[20/Oct/2016:09:25:55 -0600] category=SYNC severity=WARNING msgID=208 msg=Directory server DS(5688) was unable to connect to any replication servers for domain "dc=example,dc=com"
[20/Oct/2016:09:25:55 -0600] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on Administration Connector 0.0.0.0 port 4444
[20/Oct/2016:09:25:55 -0600] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on LDAPS Connection Handler 0.0.0.0 port 1636
[20/Oct/2016:09:25:55 -0600] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on LDAP Connection Handler 0.0.0.0 port 1389
[20/Oct/2016:09:25:55 -0600] category=CORE severity=NOTICE msgID=135 msg=The Directory Server has started successfully
[20/Oct/2016:09:25:55 -0600] category=CORE severity=NOTICE msgID=139 msg=The Directory Server has sent an alert notification generated by class org.opends.server.core.DirectoryServer (alert type org.opends.server.DirectoryServerStarted, alert ID org.opends.messages.core-135): The Directory Server has started successfully
[20/Oct/2016:09:25:59 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 5688 was attempting to connect to replication server 2644 but has disconnected in handshake phase
[20/Oct/2016:09:26:08 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 5688 was attempting to connect to replication server 2644 but has disconnected in handshake phase
[20/Oct/2016:09:26:16 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 5688 was attempting to connect to replication server 2644 but has disconnected in handshake phase
[20/Oct/2016:09:26:25 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 3692 was attempting to connect to replication server 2644 but has disconnected in handshake phase
[20/Oct/2016:09:26:35 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 3692 was attempting to connect to replication server 2644 but has disconnected in handshake phase
[20/Oct/2016:09:26:44 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 25816 was attempting to connect to replication server 2644 but has disconnected in handshake phase
[20/Oct/2016:09:26:11 -0600] BIND REQ conn=0 op=0 msgID=1 version=3 type=SIMPLE dn="cn=Directory Manager"
[20/Oct/2016:09:26:11 -0600] BIND RES conn=0 op=0 msgID=1 result=0 authDN="cn=Directory Manager,cn=Root DNs,cn=config" etime=13
[20/Oct/2016:09:26:11 -0600] ADD REQ conn=0 op=1 msgID=2 dn="uid=user.10011,ou=People,dc=example,dc=com"
[20/Oct/2016:09:26:11 -0600] ADD RES conn=0 op=1 msgID=2 result=53 message="The Replication is configured for suffix dc=example,dc=com but was not able to connect to any Replication Server" etime=3

Workaround

Starting the offline Node 1 VM always allows the Node 2 DJ to connect to an RS.

  • Restarting the Node 2 DJ instance has no effect.
  • Rebooting the Node 2 VM has no effect.
  • Booting the Node 1 VM alone allows the Node 2 DS' to connect to its RS. Note: the Node 1 instance was "not started".

As soon as Node 1 hits the networking run level on boot, the Node 2 DS instance can connect to the RS...again, the Node 1 DJ instance is not running.

[root@opendj ~]# date; uptime
Thu Oct 20 12:29:00 MDT 2016
 12:29:00 up 0 min,  2 users,  load average: 2.10, 0.52, 0.17

[20/Oct/2016:12:27:41 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 5688 was attempting to connect to replication server 2644 but has disconnected in handshake phase
[20/Oct/2016:12:27:50 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 3692 was attempting to connect to replication server 2644 but has disconnected in handshake phase
[20/Oct/2016:12:27:59 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 3692 was attempting to connect to replication server 2644 but has disconnected in handshake phase
[20/Oct/2016:12:28:08 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 25816 was attempting to connect to replication server 2644 but has disconnected in handshake phase
[20/Oct/2016:12:28:17 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 25816 was attempting to connect to replication server 2644 but has disconnected in handshake phase
[20/Oct/2016:12:28:26 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 3692 was attempting to connect to replication server 2644 but has disconnected in handshake phase
[20/Oct/2016:12:28:29 -0600] category=SYNC severity=NOTICE msgID=62 msg=Directory server DS(3692) has connected to replication server RS(2644) for domain "cn=admin data" at 192.168.0.59:8989 with generation ID 152746
[20/Oct/2016:12:28:29 -0600] category=SYNC severity=NOTICE msgID=62 msg=Directory server DS(25816) has connected to replication server RS(2644) for domain "cn=schema" at 192.168.0.59:8989 with generation ID 8408
[20/Oct/2016:12:28:29 -0600] category=SYNC severity=NOTICE msgID=62 msg=Directory server DS(5688) has connected to replication server RS(2644) for domain "dc=example,dc=com" at 192.168.0.59:8989 with generation ID 19363676

Netstats

The following netsats were taken from Node 2’s system:

With both Node’s DJ instanes running

[root@opendj bin]# netstat -an | grep 8989
tcp6       0      0 :::8989                 :::*                    LISTEN     
tcp6       0      0 192.168.0.59:55868      192.168.0.59:8989       ESTABLISHED
tcp6       0      0 192.168.0.59:8989       192.168.0.58:39732      ESTABLISHED
tcp6       0      0 192.168.0.59:8989       192.168.0.58:39735      ESTABLISHED
tcp6       0      0 192.168.0.59:8989       192.168.0.59:55868      ESTABLISHED
tcp6       0      0 192.168.0.59:8989       192.168.0.58:39734      ESTABLISHED
tcp6       0      0 192.168.0.59:55872      192.168.0.59:8989       ESTABLISHED
tcp6       0      0 192.168.0.59:55864      192.168.0.59:8989       ESTABLISHED
tcp6       0      0 192.168.0.59:8989       192.168.0.59:55864      ESTABLISHED
tcp6       0      0 192.168.0.59:8989       192.168.0.59:55872      ESTABLISHED

After Node 2 DJ stopped:

[root@opendj bin]# netstat -an | grep 8989
tcp6       0      0 192.168.0.59:55868      192.168.0.59:8989       TIME_WAIT  
tcp6       0      0 192.168.0.59:55864      192.168.0.59:8989       TIME_WAIT  

After Node 1 DJ stopped:

[root@opendj bin]# netstat -an | grep 8989
tcp6       0      0 192.168.0.59:55868      192.168.0.59:8989       TIME_WAIT  
tcp6       0      0 192.168.0.59:55864      192.168.0.59:8989       TIME_WAIT  

After Node 1 VM shutdown: no connections.

[root@opendj bin]# netstat -an | grep 8989

After Node 2 DJ instance started

[root@opendj bin]# netstat -an | grep 8989

tcp6       3      0 :::8989                 :::*                    LISTEN     
tcp6     227      0 192.168.0.59:8989       192.168.0.59:56216      ESTABLISHED
tcp6       0      0 192.168.0.59:56217      192.168.0.59:8989       ESTABLISHED
tcp6       0      1 192.168.0.59:47653      192.168.0.58:8989       SYN_SENT   
tcp6       0      0 192.168.0.59:56218      192.168.0.59:8989       ESTABLISHED
tcp6     227      0 192.168.0.59:8989       192.168.0.59:56217      ESTABLISHED
tcp6       1      0 192.168.0.59:8989       192.168.0.59:56209      CLOSE_WAIT 
tcp6     227      0 192.168.0.59:8989       192.168.0.59:56218      ESTABLISHED
tcp6       0      0 192.168.0.59:56216      192.168.0.59:8989       ESTABLISHED
tcp6       0      0 192.168.0.59:56209      192.168.0.59:8989       FIN_WAIT2  

[root@opendj bin]# netstat -an | grep 8989
tcp6       2      0 :::8989                 :::*                    LISTEN     
tcp6       0      0 192.168.0.59:56231      192.168.0.59:8989       ESTABLISHED
tcp6       0      0 192.168.0.59:56227      192.168.0.59:8989       ESTABLISHED
tcp6     227      0 192.168.0.59:8989       192.168.0.59:56230      ESTABLISHED
tcp6       0      1 192.168.0.59:47671      192.168.0.58:8989       SYN_SENT   
tcp6       0      0 192.168.0.59:56230      192.168.0.59:8989       ESTABLISHED
tcp6       0      0 192.168.0.59:8989       192.168.0.59:56227      ESTABLISHED
tcp6     227      0 192.168.0.59:8989       192.168.0.59:56231      ESTABLISHED

[root@opendj bin]# netstat -an | grep 8989
tcp6       5      0 :::8989                 :::*                    LISTEN     
tcp6       0      0 192.168.0.59:56253      192.168.0.59:8989       FIN_WAIT2  
tcp6       0      0 192.168.0.59:56262      192.168.0.59:8989       ESTABLISHED
tcp6     227      0 192.168.0.59:8989       192.168.0.59:56260      ESTABLISHED
tcp6     227      0 192.168.0.59:8989       192.168.0.59:56261      ESTABLISHED
tcp6       0      0 192.168.0.59:56249      192.168.0.59:8989       FIN_WAIT2  
tcp6       0      1 192.168.0.59:47697      192.168.0.58:8989       SYN_SENT   
tcp6     228      0 192.168.0.59:8989       192.168.0.59:56253      CLOSE_WAIT 
tcp6     228      0 192.168.0.59:8989       192.168.0.59:56252      CLOSE_WAIT 
tcp6     227      0 192.168.0.59:8989       192.168.0.59:56262      ESTABLISHED
tcp6       0      0 192.168.0.59:56260      192.168.0.59:8989       ESTABLISHED
tcp6       0      0 192.168.0.59:56261      192.168.0.59:8989       ESTABLISHED
tcp6       0      0 192.168.0.59:56252      192.168.0.59:8989       FIN_WAIT2  
tcp6       1      0 192.168.0.59:8989       192.168.0.59:56249      CLOSE_WAIT 

After Node 1 system boot - Node 1 DJ instance not running:

[root@opendj bin]# netstat -an | grep 8989

tcp6       0      0 :::8989                 :::*                    LISTEN     
tcp6       0      0 192.168.0.59:56310      192.168.0.59:8989       ESTABLISHED
tcp6       0      0 192.168.0.59:56314      192.168.0.59:8989       ESTABLISHED
tcp6       0      0 192.168.0.59:56318      192.168.0.59:8989       ESTABLISHED
tcp6       0      0 192.168.0.59:8989       192.168.0.59:56310      ESTABLISHED
tcp6       0      0 192.168.0.59:8989       192.168.0.59:56314      ESTABLISHED
tcp6       0      0 192.168.0.59:8989       192.168.0.59:56318      ESTABLISHED
[root@opendj bin]# netstat -an | grep 8989
tcp6       0      0 :::8989                 :::*                    LISTEN     
tcp6       0      0 192.168.0.59:56310      192.168.0.59:8989       ESTABLISHED
tcp6       0      0 192.168.0.59:56314      192.168.0.59:8989       ESTABLISHED
tcp6       0      0 192.168.0.59:56318      192.168.0.59:8989       ESTABLISHED
tcp6       0      0 192.168.0.59:8989       192.168.0.59:56310      ESTABLISHED
tcp6       0      0 192.168.0.59:8989       192.168.0.59:56314      ESTABLISHED
tcp6       0      0 192.168.0.59:8989       192.168.0.59:56318      ESTABLISHED

Attached:

System/Instance data before the handshake errors are forced:

  • node1-before-opendj-support-data-20161020-0919.zip
  • node2-before-opendj-support-data-20161020-0922.zip

System/Instance data after the handshake errors are forced:

  • node2-after-opendj-support-data-20161020-0926.zip
Comment by Chris Ridd [ 24/Nov/16 ]

The support cases referenced in this issue are using either DJ 3.0.0 (both standalone, and embedded in AM 13.0.0) or DJ 3.5.0. There may have been some older cases which are hitting this issue as well, but we don't have that data.

Comment by Matthew Swift [ 06/Mar/17 ]

Fabio Pistolesi - please implement your suggested workaround.

Comment by Fabio Pistolesi [ 17/Mar/17 ]

Root Cause(s) as far as I see it:

  • RS is trying to connect all other RSes, but times out because unreachable. It is done by domain, so an RS unreachable in a domain will be tried again if present in another domain.
  • DS is trying to connect to the RS, who is busy, so the connection is put on the backlog.
  • When the RS finally gets to accept(), it is too late, DS decided for a timeout.
    The two will never connect because RS is actually unavailable for (NDom * NunreachableRS) timeouts.

Moreover trying to connect to other RSes is done by the same thread doing accept(), so it is always trying to process connections expired long ago.

Comment by Fabio Pistolesi [ 29/Mar/17 ]

Added a blacklist for unreachable RSes, so first time the RS tries to connect to other RSes, it will mark an RS whose connection times out as unreachable. Unreachable RSes will not be tried again until the list is purged later on.

Comment by Viktor Nawrath [X] (Inactive) [ 07/Aug/17 ]

Verified on opendj-4.1.0 rev: 82e3722f686

Generated at Wed Sep 30 02:10:26 UTC 2020 using Jira 7.13.12#713012-sha1:6e07c38070d5191bbf7353952ed38f111754533a.