[OPENDJ-7103] Backport OPENDJ-5496: DS fails to reconnect to an RS, disconnecting in handshake phase, after system restart Created: 15/Apr/20  Updated: 16/Apr/20  Resolved: 16/Apr/20

Status: Done
Project: OpenDJ
Component/s: replication
Affects Version/s: 6.5.0, 6.0.0, 5.5.0, 5.5.1, 3.5.3
Fix Version/s: 6.0.1

Type: Bug Priority: Critical
Reporter: Lee Trujillo Assignee: Lee Trujillo
Resolution: Won't Fix Votes: 0
Labels: Verified, release-notes
Environment:

2 x Linux VM's with 2 x DS+RS's


Issue Links:
Backport
is a backport of OPENDJ-5496 DS fails to reconnect to an RS, disco... Done
Story Points: 0.5
Dev Assignee: Lee Trujillo
QA Assignee: Viktor Nawrath [X] (Inactive)

 Description   

The following exhibits the same symptoms as seen with OPENDJ-1135 when the remote System is stopped and the remaining DS+RS is restarted.

Steps to reproduce are simple. 

  • Setup 2 VM's with a DS+RS on each.
  • Enable basic MMR and initialize one from the other.

Once the above is complete, use the following basic steps.

  1. Stop DS 1
  2. Stop VM 1
  3. Stop DS 2
  4. Start DS 2

Example:

The following uses my own internal instances named lee2 (VM1) and lee3 (VM2).

Step 1) Stop DS+RS1.

Stop the DS+RS1 (on VM1 - lee2) instance at 24/Sep/2018:11:58:52

 

[root@lee2 bin]# ./stop-ds
Stopping Server...
[24/Sep/2018:11:58:52 -0600] category=BACKEND severity=NOTICE msgID=370 msg=The backend userRoot is now taken offline
[24/Sep/2018:11:58:52 -0600] category=CORE severity=NOTICE msgID=203 msg=The Directory Server is now stopped

Step 2) Stop VM1 (which houses DS+RS1)

 

I issued the stop for VM1 just before 24/Sep/2018:12:00:06.  You can see VM1 (lee2) has been added to the "unreachable RS" list on the logging from VM2.

 

[24/Sep/2018:12:00:06 -0600] category=SYNC severity=WARNING msgID=-1 msg=LEE Adding unreachable RS lee2.internal.forgerock.com:8989
[24/Sep/2018:12:00:26 -0600] category=SYNC severity=WARNING msgID=-1 msg=LEE Adding unreachable RS lee2.internal.forgerock.com:8989

 

3) Stop DS+RS2

 

[24/Sep/2018:12:01:36 -0600] category=CORE severity=NOTICE msgID=203 msg=The Directory Server is now stopped

Step 4) Restart DS+RS2.

 

Just like with OPENDJ-1135, on restart, the DS cannot connect to the RS for any of the backends.

 

[24/Sep/2018:12:02:11 -0600] category=CORE severity=NOTICE msgID=134 msg=ForgeRock Directory Services 6.5.0-SNAPSHOT (build 20180924184829, revision number 1a2b1c3c8d722ce56e5fd3ba37923b7122900bcb) starting up
[24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=21 msg=Installation Directory: /opt/opendj
[24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=23 msg=Instance Directory: /opt/opendj
[24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=17 msg=JVM Information: 1.8.0_51-b16 by Oracle Corporation, 64-bit architecture, 1333264384 bytes heap size
[24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=18 msg=JVM Host: lee3.internal.forgerock.com, running Linux 3.10.0-514.el7.x86_64 amd64, 5997084672 bytes physical memory size, number of processors available 4
[24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=19 msg=JVM Arguments: "-Dorg.opends.server.scriptName=start-ds"
[24/Sep/2018:12:02:14 -0600] category=BACKEND severity=NOTICE msgID=513 msg=The database backend userRoot containing 100002 entries has started
[24/Sep/2018:12:02:14 -0600] category=EXTENSIONS severity=NOTICE msgID=221 msg=DIGEST-MD5 SASL mechanism using a server fully qualified domain name of: lee3.internal.forgerock.com
[24/Sep/2018:12:02:15 -0600] category=SYNC severity=NOTICE msgID=204 msg=Replication server RS(23059) started listening for new connections on address 0.0.0.0 port 8989
[24/Sep/2018:12:02:23 -0600] category=SYNC severity=WARNING msgID=208 msg=Directory server DS(23515) was unable to connect to any replication servers for domain "cn=schema"
[24/Sep/2018:12:02:29 -0600] category=SYNC severity=WARNING msgID=208 msg=Directory server DS(31958) was unable to connect to any replication servers for domain "cn=admin data"
[24/Sep/2018:12:02:33 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:02:38 -0600] category=SYNC severity=WARNING msgID=208 msg=Directory server DS(18212) was unable to connect to any replication servers for domain "dc=example,dc=com"
[24/Sep/2018:12:02:38 -0600] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on Administration Connector 0.0.0.0 port 4444
[24/Sep/2018:12:02:38 -0600] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on LDAP 0.0.0.0 port 1389
[24/Sep/2018:12:02:38 -0600] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on LDAPS 0.0.0.0 port 1636
[24/Sep/2018:12:02:38 -0600] category=CORE severity=NOTICE msgID=135 msg=The Directory Server has started successfully

 

After the instance starts, the following is logged every 10 seconds.

 

[24/Sep/2018:12:02:42 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 18212 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:02:51 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:03:00 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:03:09 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:03:18 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:03:27 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:03:36 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:03:45 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 18212 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:03:54 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 18212 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)

 

Bonus Step 5) Start VM1 (lee2)

If I restart VM1, the instance on VM2 see's the sockets are available and VM'2 DS (lee2) can now connect to its own DS (see the last 3 lines).

Note: I only started the system for VM1, I did not start the DS+RS on VM1.  This also mirrors the behavior seen with OPENDJ-1135; i.e. as long as the other VM was online, the restart of the second VM's DS instance was always able to reconnect to its own RS.

 

[24/Sep/2018:12:04:03 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:04:12 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:04:21 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:04:28 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
[24/Sep/2018:12:04:36 -0600] category=SYNC severity=NOTICE msgID=62 msg=Directory server DS(23515) has connected to replication server RS(23059) for domain "cn=schema" at 172.16.207.111:8989 with generation ID 8408
[24/Sep/2018:12:04:40 -0600] category=SYNC severity=NOTICE msgID=62 msg=Directory server DS(18212) has connected to replication server RS(23059) for domain "dc=example,dc=com" at 172.16.207.111:8989 with generation ID 19363724
[24/Sep/2018:12:04:46 -0600] category=SYNC severity=NOTICE msgID=62 msg=Directory server DS(31958) has connected to replication server RS(23059) for domain "cn=admin data" at 172.16.207.111:8989 with generation ID 161177

 

WORKAROUND:

Don't stop/restart the only remaining DS+RS in the topology until at least one other VM is back online.

 

I've attached 3 Support Extracts.

  • opendj-support-data-20180924-115825.zip: after all is set up and functioning
  • opendj-support-data-20180924-120338.zip: after stopping VM1 and restarting DS+RS2.
  • opendj-support-data-20180924-120518.zip: after starting VM1

 



 Comments   
Comment by Lee Trujillo [ 16/Apr/20 ]

There were 8 massive unit test failures that cannot be overcome.  Chris hit these failures in a previous backport attempt.

If a customer hits this issue on 6.0.0, their only option is to upgrade.

Generated at Tue Nov 24 21:17:11 UTC 2020 using Jira 7.13.12#713012-sha1:6e07c38070d5191bbf7353952ed38f111754533a.