-
Type:
Bug
-
Status: Done
-
Priority:
Critical
-
Resolution: Fixed
-
Affects Version/s: 6.5.0, 6.0.0, 5.5.0, 5.5.1, 3.5.3
-
Fix Version/s: 6.5.0
-
Component/s: replication
-
Labels:
-
Environment:2 x Linux VM's with 2 x DS+RS's
-
Epic Link:
-
Story Points:0.5
-
Support Ticket IDs:
The following exhibits the same symptoms as seen with when the remote System is stopped and the remaining DS+RS is restarted.OPENDJ-1135
Steps to reproduce are simple.
- Setup 2 VM's with a DS+RS on each.
- Enable basic MMR and initialize one from the other.
Once the above is complete, use the following basic steps.
- Stop DS 1
- Stop VM 1
- Stop DS 2
- Start DS 2
Example:
The following uses my own internal instances named lee2 (VM1) and lee3 (VM2).
Step 1) Stop DS+RS1.
Stop the DS+RS1 (on VM1 - lee2) instance at 24/Sep/2018:11:58:52
[root@lee2 bin]# ./stop-ds Stopping Server... [24/Sep/2018:11:58:52 -0600] category=BACKEND severity=NOTICE msgID=370 msg=The backend userRoot is now taken offline [24/Sep/2018:11:58:52 -0600] category=CORE severity=NOTICE msgID=203 msg=The Directory Server is now stopped
Step 2) Stop VM1 (which houses DS+RS1)
I issued the stop for VM1 just before 24/Sep/2018:12:00:06. You can see VM1 (lee2) has been added to the "unreachable RS" list on the logging from VM2.
[24/Sep/2018:12:00:06 -0600] category=SYNC severity=WARNING msgID=-1 msg=LEE Adding unreachable RS lee2.internal.forgerock.com:8989 [24/Sep/2018:12:00:26 -0600] category=SYNC severity=WARNING msgID=-1 msg=LEE Adding unreachable RS lee2.internal.forgerock.com:8989
3) Stop DS+RS2
[24/Sep/2018:12:01:36 -0600] category=CORE severity=NOTICE msgID=203 msg=The Directory Server is now stopped
Step 4) Restart DS+RS2.
Just like with , on restart, the DS cannot connect to the RS for any of the backends.OPENDJ-1135
[24/Sep/2018:12:02:11 -0600] category=CORE severity=NOTICE msgID=134 msg=ForgeRock Directory Services 6.5.0-SNAPSHOT (build 20180924184829, revision number 1a2b1c3c8d722ce56e5fd3ba37923b7122900bcb) starting up [24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=21 msg=Installation Directory: /opt/opendj [24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=23 msg=Instance Directory: /opt/opendj [24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=17 msg=JVM Information: 1.8.0_51-b16 by Oracle Corporation, 64-bit architecture, 1333264384 bytes heap size [24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=18 msg=JVM Host: lee3.internal.forgerock.com, running Linux 3.10.0-514.el7.x86_64 amd64, 5997084672 bytes physical memory size, number of processors available 4 [24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=19 msg=JVM Arguments: "-Dorg.opends.server.scriptName=start-ds" [24/Sep/2018:12:02:14 -0600] category=BACKEND severity=NOTICE msgID=513 msg=The database backend userRoot containing 100002 entries has started [24/Sep/2018:12:02:14 -0600] category=EXTENSIONS severity=NOTICE msgID=221 msg=DIGEST-MD5 SASL mechanism using a server fully qualified domain name of: lee3.internal.forgerock.com [24/Sep/2018:12:02:15 -0600] category=SYNC severity=NOTICE msgID=204 msg=Replication server RS(23059) started listening for new connections on address 0.0.0.0 port 8989 [24/Sep/2018:12:02:23 -0600] category=SYNC severity=WARNING msgID=208 msg=Directory server DS(23515) was unable to connect to any replication servers for domain "cn=schema" [24/Sep/2018:12:02:29 -0600] category=SYNC severity=WARNING msgID=208 msg=Directory server DS(31958) was unable to connect to any replication servers for domain "cn=admin data" [24/Sep/2018:12:02:33 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:02:38 -0600] category=SYNC severity=WARNING msgID=208 msg=Directory server DS(18212) was unable to connect to any replication servers for domain "dc=example,dc=com" [24/Sep/2018:12:02:38 -0600] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on Administration Connector 0.0.0.0 port 4444 [24/Sep/2018:12:02:38 -0600] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on LDAP 0.0.0.0 port 1389 [24/Sep/2018:12:02:38 -0600] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on LDAPS 0.0.0.0 port 1636 [24/Sep/2018:12:02:38 -0600] category=CORE severity=NOTICE msgID=135 msg=The Directory Server has started successfully
After the instance starts, the following is logged every 10 seconds.
[24/Sep/2018:12:02:42 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 18212 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:02:51 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:03:00 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:03:09 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:03:18 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:03:27 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:03:36 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:03:45 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 18212 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:03:54 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 18212 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
Bonus Step 5) Start VM1 (lee2)
If I restart VM1, the instance on VM2 see's the sockets are available and VM'2 DS (lee2) can now connect to its own DS (see the last 3 lines).
Note: I only started the system for VM1, I did not start the DS+RS on VM1. This also mirrors the behavior seen with ; i.e. as long as the other VM was online, the restart of the second VM's DS instance was always able to reconnect to its own RS.OPENDJ-1135
[24/Sep/2018:12:04:03 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:04:12 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:04:21 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:04:28 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:04:36 -0600] category=SYNC severity=NOTICE msgID=62 msg=Directory server DS(23515) has connected to replication server RS(23059) for domain "cn=schema" at 172.16.207.111:8989 with generation ID 8408 [24/Sep/2018:12:04:40 -0600] category=SYNC severity=NOTICE msgID=62 msg=Directory server DS(18212) has connected to replication server RS(23059) for domain "dc=example,dc=com" at 172.16.207.111:8989 with generation ID 19363724 [24/Sep/2018:12:04:46 -0600] category=SYNC severity=NOTICE msgID=62 msg=Directory server DS(31958) has connected to replication server RS(23059) for domain "cn=admin data" at 172.16.207.111:8989 with generation ID 161177
WORKAROUND:
Don't stop/restart the only remaining DS+RS in the topology until at least one other VM is back online.
I've attached 3 Support Extracts.
- opendj-support-data-20180924-115825.zip: after all is set up and functioning
- opendj-support-data-20180924-120338.zip: after stopping VM1 and restarting DS+RS2.
- opendj-support-data-20180924-120518.zip: after starting VM1
- is backported by
-
OPENDJ-5767 Backport OPENDJ-5496: DS fails to reconnect to an RS, disconnecting in handshake phase, after system restart
-
- Done
-
-
OPENDJ-5768 Backport OPENDJ-5496: DS fails to reconnect to an RS, disconnecting in handshake phase, after system restart
-
- Done
-
-
OPENDJ-7103 Backport OPENDJ-5496: DS fails to reconnect to an RS, disconnecting in handshake phase, after system restart
-
- Done
-
- relates to
-
OPENDJ-1135 DS sometimes fails to connect to RS after server restart
-
- Done
-