[OPENDJ-5496] DS fails to reconnect to an RS, disconnecting in handshake phase, after system restart Created: 24/Sep/18  Updated: 15/Apr/20  Resolved: 25/Sep/18

Status: Done
Project: OpenDJ
Component/s: replication
Affects Version/s: 6.5.0, 6.0.0, 5.5.1, 5.5.0, 3.5.3
Fix Version/s: 6.5.0

Type: Bug Priority: Critical
Reporter: Lee Trujillo Assignee: Fabio Pistolesi
Resolution: Fixed Votes: 0
Labels: Verified, release-notes
Environment:

2 x Linux VM's with 2 x DS+RS's


Attachments: Zip Archive opendj-support-data-20180924-115825.zip     Zip Archive opendj-support-data-20180924-120338.zip     Zip Archive opendj-support-data-20180924-120518.zip    
Issue Links:
Backport
is backported by OPENDJ-5767 Backport OPENDJ-5496: DS fails to rec... Done
is backported by OPENDJ-5768 Backport OPENDJ-5496: DS fails to rec... Done
is backported by OPENDJ-7103 Backport OPENDJ-5496: DS fails to rec... Done
Relates
relates to OPENDJ-1135 DS sometimes fails to connect to RS a... Done
Epic Link: Bugs 6.5
Story Points: 0.5
QA Assignee: Viktor Nawrath [X] (Inactive)
Support Ticket IDs:
Backports: OPENDJ-5767 (3.5.4) OPENDJ-5768 (5.5.3) OPENDJ-7103 (6.0.1)

 Description   

The following exhibits the same symptoms as seen with OPENDJ-1135 when the remote System is stopped and the remaining DS+RS is restarted.

Steps to reproduce are simple. 

  • Setup 2 VM's with a DS+RS on each.
  • Enable basic MMR and initialize one from the other.

Once the above is complete, use the following basic steps.

  1. Stop DS 1
  2. Stop VM 1
  3. Stop DS 2
  4. Start DS 2

Example:

The following uses my own internal instances named lee2 (VM1) and lee3 (VM2).

Step 1) Stop DS+RS1.

Stop the DS+RS1 (on VM1 - lee2) instance at 24/Sep/2018:11:58:52

 

[root@lee2 bin]# ./stop-ds
Stopping Server...
[24/Sep/2018:11:58:52 -0600] category=BACKEND severity=NOTICE msgID=370 msg=The backend userRoot is now taken offline
[24/Sep/2018:11:58:52 -0600] category=CORE severity=NOTICE msgID=203 msg=The Directory Server is now stopped

Step 2) Stop VM1 (which houses DS+RS1)

 

I issued the stop for VM1 just before 24/Sep/2018:12:00:06.  You can see VM1 (lee2) has been added to the "unreachable RS" list on the logging from VM2.

 

[24/Sep/2018:12:00:06 -0600] category=SYNC severity=WARNING msgID=-1 msg=LEE Adding unreachable RS lee2.internal.forgerock.com:8989
[24/Sep/2018:12:00:26 -0600] category=SYNC severity=WARNING msgID=-1 msg=LEE Adding unreachable RS lee2.internal.forgerock.com:8989

 

3) Stop DS+RS2

 

[24/Sep/2018:12:01:36 -0600] category=CORE severity=NOTICE msgID=203 msg=The Directory Server is now stopped

Step 4) Restart DS+RS2.

 

Just like with OPENDJ-1135, on restart, the DS cannot connect to the RS for any of the backends.

 

[24/Sep/2018:12:02:11 -0600] category=CORE severity=NOTICE msgID=134 msg=ForgeRock Directory Services 6.5.0-SNAPSHOT (build 20180924184829, revision number 1a2b1c3c8d722ce56e5fd3ba37923b7122900bcb) starting up
[24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=21 msg=Installation Directory: /opt/opendj
[24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=23 msg=Instance Directory: /opt/opendj
[24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=17 msg=JVM Information: 1.8.0_51-b16 by Oracle Corporation, 64-bit architecture, 1333264384 bytes heap size
[24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=18 msg=JVM Host: lee3.internal.forgerock.com, running Linux 3.10.0-514.el7.x86_64 amd64, 5997084672 bytes physical memory size, number of processors available 4
[24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=19 msg=JVM Arguments: "-Dorg.opends.server.scriptName=start-ds"
[24/Sep/2018:12:02:14 -0600] category=BACKEND severity=NOTICE msgID=513 msg=The database backend userRoot containing 100002 entries has started
[24/Sep/2018:12:02:14 -0600] category=EXTENSIONS severity=NOTICE msgID=221 msg=DIGEST-MD5 SASL mechanism using a server fully qualified domain name of: lee3.internal.forgerock.com
[24/Sep/2018:12:02:15 -0600] category=SYNC severity=NOTICE msgID=204 msg=Replication server RS(23059) started listening for new connections on address 0.0.0.0 port 8989
[24/Sep/2018:12:02:23 -0600] category=SYNC severity=WARNING msgID=208 msg=Directory server DS(23515) was unable to connect to any replication servers for domain "cn=schema"
[24/Sep/2018:12:02:29 -0600] category=SYNC severity=WARNING msgID=208 msg=Directory server DS(31958) was unable to connect to any replication servers for domain "cn=admin data"
[24/Sep/2018:12:02:33 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:02:38 -0600] category=SYNC severity=WARNING msgID=208 msg=Directory server DS(18212) was unable to connect to any replication servers for domain "dc=example,dc=com"
[24/Sep/2018:12:02:38 -0600] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on Administration Connector 0.0.0.0 port 4444
[24/Sep/2018:12:02:38 -0600] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on LDAP 0.0.0.0 port 1389
[24/Sep/2018:12:02:38 -0600] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on LDAPS 0.0.0.0 port 1636
[24/Sep/2018:12:02:38 -0600] category=CORE severity=NOTICE msgID=135 msg=The Directory Server has started successfully

 

After the instance starts, the following is logged every 10 seconds.

 

[24/Sep/2018:12:02:42 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 18212 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:02:51 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:03:00 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:03:09 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:03:18 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:03:27 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:03:36 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:03:45 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 18212 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:03:54 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 18212 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)

 

Bonus Step 5) Start VM1 (lee2)

If I restart VM1, the instance on VM2 see's the sockets are available and VM'2 DS (lee2) can now connect to its own DS (see the last 3 lines).

Note: I only started the system for VM1, I did not start the DS+RS on VM1.  This also mirrors the behavior seen with OPENDJ-1135; i.e. as long as the other VM was online, the restart of the second VM's DS instance was always able to reconnect to its own RS.

 

[24/Sep/2018:12:04:03 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:04:12 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:04:21 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:04:28 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:04:36 -0600] category=SYNC severity=NOTICE msgID=62 msg=Directory server DS(23515) has connected to replication server RS(23059) for domain "cn=schema" at 172.16.207.111:8989 with generation ID 8408
[24/Sep/2018:12:04:40 -0600] category=SYNC severity=NOTICE msgID=62 msg=Directory server DS(18212) has connected to replication server RS(23059) for domain "dc=example,dc=com" at 172.16.207.111:8989 with generation ID 19363724
[24/Sep/2018:12:04:46 -0600] category=SYNC severity=NOTICE msgID=62 msg=Directory server DS(31958) has connected to replication server RS(23059) for domain "cn=admin data" at 172.16.207.111:8989 with generation ID 161177

 

WORKAROUND:

Don't stop/restart the only remaining DS+RS in the topology until at least one other VM is back online.

 

I've attached 3 Support Extracts.

  • opendj-support-data-20180924-115825.zip: after all is set up and functioning
  • opendj-support-data-20180924-120338.zip: after stopping VM1 and restarting DS+RS2.
  • opendj-support-data-20180924-120518.zip: after starting VM1

 



 Comments   
Comment by Fabio Pistolesi [ 25/Sep/18 ]

Right! The difference in tests is the unreachable host address.

  • In the case of the two VMs, the address is on the same subnet, so when the host is down, the error is ICMP host unreachable, as the ARP address cannot be resolved.
  • In the case of the test, the address is on a different network, so the router has responsibility of forwarding the packet, hence timeout.

The fix is to consider all types of errors when setting the server as unreachable.

Comment by Viktor Nawrath [X] (Inactive) [ 08/Nov/18 ]

Verified using 6.5.0-RC3 f9bb4a1890d

Generated at Sat Oct 24 06:27:52 UTC 2020 using Jira 7.13.12#713012-sha1:6e07c38070d5191bbf7353952ed38f111754533a.