[OPENDJ-5496] DS fails to reconnect to an RS, disconnecting in handshake phase, after system restart Created: 24/Sep/18 Updated: 15/Apr/20 Resolved: 25/Sep/18 |
|
Status: | Done |
Project: | OpenDJ |
Component/s: | replication |
Affects Version/s: | 6.5.0, 6.0.0, 5.5.0, 5.5.1, 3.5.3 |
Fix Version/s: | 6.5.0 |
Type: | Bug | Priority: | Critical |
Reporter: | Lee Trujillo | Assignee: | Fabio Pistolesi |
Resolution: | Fixed | Votes: | 0 |
Labels: | Verified, release-notes | ||
Environment: |
2 x Linux VM's with 2 x DS+RS's |
Attachments: |
![]() ![]() ![]() |
||||||||||||||||||||||||
Issue Links: |
|
||||||||||||||||||||||||
Epic Link: | Bugs 6.5 | ||||||||||||||||||||||||
Story Points: | 0.5 | ||||||||||||||||||||||||
QA Assignee: | Viktor Nawrath [X] (Inactive) | ||||||||||||||||||||||||
Support Ticket IDs: | |||||||||||||||||||||||||
Backports: | OPENDJ-5767 (3.5.4) OPENDJ-5768 (5.5.3) OPENDJ-7103 (6.0.1) |
Description |
The following exhibits the same symptoms as seen with Steps to reproduce are simple.
Once the above is complete, use the following basic steps.
Example: The following uses my own internal instances named lee2 (VM1) and lee3 (VM2). Step 1) Stop DS+RS1. Stop the DS+RS1 (on VM1 - lee2) instance at 24/Sep/2018:11:58:52
[root@lee2 bin]# ./stop-ds Stopping Server... [24/Sep/2018:11:58:52 -0600] category=BACKEND severity=NOTICE msgID=370 msg=The backend userRoot is now taken offline [24/Sep/2018:11:58:52 -0600] category=CORE severity=NOTICE msgID=203 msg=The Directory Server is now stopped Step 2) Stop VM1 (which houses DS+RS1)
I issued the stop for VM1 just before 24/Sep/2018:12:00:06. You can see VM1 (lee2) has been added to the "unreachable RS" list on the logging from VM2.
[24/Sep/2018:12:00:06 -0600] category=SYNC severity=WARNING msgID=-1 msg=LEE Adding unreachable RS lee2.internal.forgerock.com:8989 [24/Sep/2018:12:00:26 -0600] category=SYNC severity=WARNING msgID=-1 msg=LEE Adding unreachable RS lee2.internal.forgerock.com:8989
3) Stop DS+RS2
[24/Sep/2018:12:01:36 -0600] category=CORE severity=NOTICE msgID=203 msg=The Directory Server is now stopped Step 4) Restart DS+RS2.
Just like with
[24/Sep/2018:12:02:11 -0600] category=CORE severity=NOTICE msgID=134 msg=ForgeRock Directory Services 6.5.0-SNAPSHOT (build 20180924184829, revision number 1a2b1c3c8d722ce56e5fd3ba37923b7122900bcb) starting up [24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=21 msg=Installation Directory: /opt/opendj [24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=23 msg=Instance Directory: /opt/opendj [24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=17 msg=JVM Information: 1.8.0_51-b16 by Oracle Corporation, 64-bit architecture, 1333264384 bytes heap size [24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=18 msg=JVM Host: lee3.internal.forgerock.com, running Linux 3.10.0-514.el7.x86_64 amd64, 5997084672 bytes physical memory size, number of processors available 4 [24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=19 msg=JVM Arguments: "-Dorg.opends.server.scriptName=start-ds" [24/Sep/2018:12:02:14 -0600] category=BACKEND severity=NOTICE msgID=513 msg=The database backend userRoot containing 100002 entries has started [24/Sep/2018:12:02:14 -0600] category=EXTENSIONS severity=NOTICE msgID=221 msg=DIGEST-MD5 SASL mechanism using a server fully qualified domain name of: lee3.internal.forgerock.com [24/Sep/2018:12:02:15 -0600] category=SYNC severity=NOTICE msgID=204 msg=Replication server RS(23059) started listening for new connections on address 0.0.0.0 port 8989 [24/Sep/2018:12:02:23 -0600] category=SYNC severity=WARNING msgID=208 msg=Directory server DS(23515) was unable to connect to any replication servers for domain "cn=schema" [24/Sep/2018:12:02:29 -0600] category=SYNC severity=WARNING msgID=208 msg=Directory server DS(31958) was unable to connect to any replication servers for domain "cn=admin data" [24/Sep/2018:12:02:33 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:02:38 -0600] category=SYNC severity=WARNING msgID=208 msg=Directory server DS(18212) was unable to connect to any replication servers for domain "dc=example,dc=com" [24/Sep/2018:12:02:38 -0600] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on Administration Connector 0.0.0.0 port 4444 [24/Sep/2018:12:02:38 -0600] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on LDAP 0.0.0.0 port 1389 [24/Sep/2018:12:02:38 -0600] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on LDAPS 0.0.0.0 port 1636 [24/Sep/2018:12:02:38 -0600] category=CORE severity=NOTICE msgID=135 msg=The Directory Server has started successfully
After the instance starts, the following is logged every 10 seconds.
[24/Sep/2018:12:02:42 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 18212 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:02:51 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:03:00 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:03:09 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:03:18 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:03:27 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:03:36 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:03:45 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 18212 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:03:54 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 18212 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
Bonus Step 5) Start VM1 (lee2) If I restart VM1, the instance on VM2 see's the sockets are available and VM'2 DS (lee2) can now connect to its own DS (see the last 3 lines). Note: I only started the system for VM1, I did not start the DS+RS on VM1. This also mirrors the behavior seen with
[24/Sep/2018:12:04:03 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:04:12 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:04:21 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:04:28 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe) [24/Sep/2018:12:04:36 -0600] category=SYNC severity=NOTICE msgID=62 msg=Directory server DS(23515) has connected to replication server RS(23059) for domain "cn=schema" at 172.16.207.111:8989 with generation ID 8408 [24/Sep/2018:12:04:40 -0600] category=SYNC severity=NOTICE msgID=62 msg=Directory server DS(18212) has connected to replication server RS(23059) for domain "dc=example,dc=com" at 172.16.207.111:8989 with generation ID 19363724 [24/Sep/2018:12:04:46 -0600] category=SYNC severity=NOTICE msgID=62 msg=Directory server DS(31958) has connected to replication server RS(23059) for domain "cn=admin data" at 172.16.207.111:8989 with generation ID 161177
WORKAROUND: Don't stop/restart the only remaining DS+RS in the topology until at least one other VM is back online.
I've attached 3 Support Extracts.
|
Comments |
Comment by Fabio Pistolesi [ 25/Sep/18 ] |
Right! The difference in tests is the unreachable host address.
The fix is to consider all types of errors when setting the server as unreachable. |
Comment by Viktor Nawrath [X] (Inactive) [ 08/Nov/18 ] |
Verified using 6.5.0-RC3 f9bb4a1890d |