Uploaded image for project: 'OpenDJ'
  1. OpenDJ
  2. OPENDJ-5768

Backport OPENDJ-5496: DS fails to reconnect to an RS, disconnecting in handshake phase, after system restart

    Details

    • Type: Bug
    • Status: Done
    • Priority: Critical
    • Resolution: Won't Fix
    • Affects Version/s: 6.5.0, 6.0.0, 5.5.1, 5.5.0
    • Fix Version/s: 5.5.3
    • Component/s: replication
    • Labels:
    • Environment:
      2 x Linux VM's with 2 x DS+RS's
    • Story Points:
      0.5
    • Support Ticket IDs:

      Description

      The following exhibits the same symptoms as seen with OPENDJ-1135 when the remote System is stopped and the remaining DS+RS is restarted.

      Steps to reproduce are simple. 

      • Setup 2 VM's with a DS+RS on each.
      • Enable basic MMR and initialize one from the other.

      Once the above is complete, use the following basic steps.

      1. Stop DS 1
      2. Stop VM 1
      3. Stop DS 2
      4. Start DS 2

      Example:

      The following uses my own internal instances named lee2 (VM1) and lee3 (VM2).

      Step 1) Stop DS+RS1.

      Stop the DS+RS1 (on VM1 - lee2) instance at 24/Sep/2018:11:58:52

       

      [root@lee2 bin]# ./stop-ds
      Stopping Server...
      [24/Sep/2018:11:58:52 -0600] category=BACKEND severity=NOTICE msgID=370 msg=The backend userRoot is now taken offline
      [24/Sep/2018:11:58:52 -0600] category=CORE severity=NOTICE msgID=203 msg=The Directory Server is now stopped

      Step 2) Stop VM1 (which houses DS+RS1)

       

      I issued the stop for VM1 just before 24/Sep/2018:12:00:06.  You can see VM1 (lee2) has been added to the "unreachable RS" list on the logging from VM2.

       

      [24/Sep/2018:12:00:06 -0600] category=SYNC severity=WARNING msgID=-1 msg=LEE Adding unreachable RS lee2.internal.forgerock.com:8989
      [24/Sep/2018:12:00:26 -0600] category=SYNC severity=WARNING msgID=-1 msg=LEE Adding unreachable RS lee2.internal.forgerock.com:8989

       

      3) Stop DS+RS2

       

      [24/Sep/2018:12:01:36 -0600] category=CORE severity=NOTICE msgID=203 msg=The Directory Server is now stopped

      Step 4) Restart DS+RS2.

       

      Just like with OPENDJ-1135, on restart, the DS cannot connect to the RS for any of the backends.

       

      [24/Sep/2018:12:02:11 -0600] category=CORE severity=NOTICE msgID=134 msg=ForgeRock Directory Services 6.5.0-SNAPSHOT (build 20180924184829, revision number 1a2b1c3c8d722ce56e5fd3ba37923b7122900bcb) starting up
      [24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=21 msg=Installation Directory: /opt/opendj
      [24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=23 msg=Instance Directory: /opt/opendj
      [24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=17 msg=JVM Information: 1.8.0_51-b16 by Oracle Corporation, 64-bit architecture, 1333264384 bytes heap size
      [24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=18 msg=JVM Host: lee3.internal.forgerock.com, running Linux 3.10.0-514.el7.x86_64 amd64, 5997084672 bytes physical memory size, number of processors available 4
      [24/Sep/2018:12:02:11 -0600] category=JVM severity=NOTICE msgID=19 msg=JVM Arguments: "-Dorg.opends.server.scriptName=start-ds"
      [24/Sep/2018:12:02:14 -0600] category=BACKEND severity=NOTICE msgID=513 msg=The database backend userRoot containing 100002 entries has started
      [24/Sep/2018:12:02:14 -0600] category=EXTENSIONS severity=NOTICE msgID=221 msg=DIGEST-MD5 SASL mechanism using a server fully qualified domain name of: lee3.internal.forgerock.com
      [24/Sep/2018:12:02:15 -0600] category=SYNC severity=NOTICE msgID=204 msg=Replication server RS(23059) started listening for new connections on address 0.0.0.0 port 8989
      [24/Sep/2018:12:02:23 -0600] category=SYNC severity=WARNING msgID=208 msg=Directory server DS(23515) was unable to connect to any replication servers for domain "cn=schema"
      [24/Sep/2018:12:02:29 -0600] category=SYNC severity=WARNING msgID=208 msg=Directory server DS(31958) was unable to connect to any replication servers for domain "cn=admin data"
      [24/Sep/2018:12:02:33 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
      [24/Sep/2018:12:02:38 -0600] category=SYNC severity=WARNING msgID=208 msg=Directory server DS(18212) was unable to connect to any replication servers for domain "dc=example,dc=com"
      [24/Sep/2018:12:02:38 -0600] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on Administration Connector 0.0.0.0 port 4444
      [24/Sep/2018:12:02:38 -0600] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on LDAP 0.0.0.0 port 1389
      [24/Sep/2018:12:02:38 -0600] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on LDAPS 0.0.0.0 port 1636
      [24/Sep/2018:12:02:38 -0600] category=CORE severity=NOTICE msgID=135 msg=The Directory Server has started successfully

       

      After the instance starts, the following is logged every 10 seconds.

       

      [24/Sep/2018:12:02:42 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 18212 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
      [24/Sep/2018:12:02:51 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
      [24/Sep/2018:12:03:00 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
      [24/Sep/2018:12:03:09 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
      [24/Sep/2018:12:03:18 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
      [24/Sep/2018:12:03:27 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
      [24/Sep/2018:12:03:36 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 23515 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
      [24/Sep/2018:12:03:45 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 18212 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
      [24/Sep/2018:12:03:54 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 18212 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)

       

      Bonus Step 5) Start VM1 (lee2)

      If I restart VM1, the instance on VM2 see's the sockets are available and VM'2 DS (lee2) can now connect to its own DS (see the last 3 lines).

      Note: I only started the system for VM1, I did not start the DS+RS on VM1.  This also mirrors the behavior seen with OPENDJ-1135; i.e. as long as the other VM was online, the restart of the second VM's DS instance was always able to reconnect to its own RS.

       

      [24/Sep/2018:12:04:03 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
      [24/Sep/2018:12:04:12 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
      [24/Sep/2018:12:04:21 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
      [24/Sep/2018:12:04:28 -0600] category=SYNC severity=ERROR msgID=178 msg=Directory server 31958 was attempting to connect to replication server 23059 but has disconnected in handshake phase. Error: SocketException(Broken pipe)
      [24/Sep/2018:12:04:36 -0600] category=SYNC severity=NOTICE msgID=62 msg=Directory server DS(23515) has connected to replication server RS(23059) for domain "cn=schema" at 172.16.207.111:8989 with generation ID 8408
      [24/Sep/2018:12:04:40 -0600] category=SYNC severity=NOTICE msgID=62 msg=Directory server DS(18212) has connected to replication server RS(23059) for domain "dc=example,dc=com" at 172.16.207.111:8989 with generation ID 19363724
      [24/Sep/2018:12:04:46 -0600] category=SYNC severity=NOTICE msgID=62 msg=Directory server DS(31958) has connected to replication server RS(23059) for domain "cn=admin data" at 172.16.207.111:8989 with generation ID 161177

       

      WORKAROUND:

      Don't stop/restart the only remaining DS+RS in the topology until at least one other VM is back online.

       

      I've attached 3 Support Extracts.

      • opendj-support-data-20180924-115825.zip: after all is set up and functioning
      • opendj-support-data-20180924-120338.zip: after stopping VM1 and restarting DS+RS2.
      • opendj-support-data-20180924-120518.zip: after starting VM1

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                cjr Chris Ridd
                Reporter:
                cjr Chris Ridd
                Dev Assignee:
                Chris Ridd
                QA Assignee:
                Viktor Nawrath [X] (Inactive)
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: