Uploaded image for project: 'OpenDJ'
  1. OpenDJ
  2. OPENDJ-1135

DS sometimes fails to connect to RS after server restart

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 4.0.0, 3.5.0, 3.0.0, 2.8.0, 2.6.0
    • Fix Version/s: 5.5.0, 3.5.2
    • Component/s: None
    • Support Ticket IDs:
    • Sprint:
      Sprint 23, Sprint 25, Sprint 26, Sprint 27, OpenDJ Sprint 103

      Description

      Found using revision 9496.

      This problem appears sometimes during the replication functional tests:

      • Back-up server A
      • Add entry on server A
      • Stop server B
      • Restore back-up on server B
      • Start server B

      Scenario

      • configure 2 instances (DS+RS in the same jvm)
      • import data in instance1 (1002 entries)
      • enable replication between the 2 instances
      • check replication status (1002 entries expected on both instances)
        o=example	localhost:11444	1002	true	4563	22794	11112	0		false
        o=example	localhost:22444	1002	true	1665	32342	22222	0		false
        
      • backup instance1 (1002 entries)
      • add one entry to instance1
      • check replication status (1003 entries expected on both instances)
        o=example	localhost:11444	1003	true	4563	22794	11112	0		false
        o=example	localhost:22444	1003	true	1665	32342	22222	0		false
        
      • stop instance2
      • restore instance2 with backup from instance1
        [05/Sep/2013:16:26:24 +0200] category=JEB severity=NOTICE msgID=8847445 msg=Restored: 00000000.jdb (size 2709275)
        
      • start instance2
        [05/Sep/2013:16:26:26 +0200] category=EXTENSIONS severity=NOTICE msgID=1507899 msg=Loaded extension from file '/var/tmp/new_repl/opendj-2/lib/extensions/snmp-mib2605.jar' (build 2.7.0-20130904, revision 9496)
        [05/Sep/2013:16:26:26 +0200] category=CORE severity=NOTICE msgID=458886 msg=OpenDJ 2.7.0-20130904 (build 20130904000054Z, R9496) starting up
        [05/Sep/2013:16:26:27 +0200] category=RUNTIME_INFORMATION severity=NOTICE msgID=20381717 msg=Installation Directory:  /var/tmp/new_repl/opendj-2
        [05/Sep/2013:16:26:27 +0200] category=RUNTIME_INFORMATION severity=NOTICE msgID=20381719 msg=Instance Directory:      /var/tmp/new_repl/opendj-2
        [05/Sep/2013:16:26:27 +0200] category=RUNTIME_INFORMATION severity=NOTICE msgID=20381713 msg=JVM Information: 1.6.0_45-b06 by Sun Microsystems Inc., 64-bit architecture, 1847394304 bytes heap size
        [05/Sep/2013:16:26:27 +0200] category=RUNTIME_INFORMATION severity=NOTICE msgID=20381714 msg=JVM Host: csovant-PC, running Linux 3.5.0-39-generic amd64, 8311566336 bytes physical memory size, number of processors available 4
        [05/Sep/2013:16:26:27 +0200] category=RUNTIME_INFORMATION severity=NOTICE msgID=20381715 msg=JVM Arguments: "-Dorg.opends.server.scriptName=start-ds"
        [05/Sep/2013:16:26:28 +0200] category=JEB severity=NOTICE msgID=8847402 msg=The database backend userRoot containing 1002 entries has started
        [05/Sep/2013:16:26:29 +0200] category=EXTENSIONS severity=NOTICE msgID=1507549 msg=DIGEST-MD5 SASL mechanism using a server fully qualified domain name of: localhost
        [05/Sep/2013:16:26:29 +0200] category=SYNC severity=NOTICE msgID=15139020 msg=Replication server RS(32342) started listening for new connections on address 0.0.0.0 port 22222
        [05/Sep/2013:16:26:39 +0200] category=SYNC severity=SEVERE_WARNING msgID=14811344 msg=Directory server DS(31296) was unable to connect to any replication servers for domain "cn=admin data"
        [05/Sep/2013:16:26:44 +0200] category=SYNC severity=SEVERE_ERROR msgID=14942386 msg=Directory server 31296 was attempting to connect to replication server 32342 but has disconnected in handshake phase
        [05/Sep/2013:16:26:49 +0200] category=SYNC severity=SEVERE_WARNING msgID=14811344 msg=Directory server DS(20293) was unable to connect to any replication servers for domain "cn=schema"
        [05/Sep/2013:16:26:49 +0200] category=SYNC severity=SEVERE_ERROR msgID=14942386 msg=Directory server 20293 was attempting to connect to replication server 32342 but has disconnected in handshake phase
        [05/Sep/2013:16:26:59 +0200] category=SYNC severity=SEVERE_WARNING msgID=14811344 msg=Directory server DS(1665) was unable to connect to any replication servers for domain "o=example"
        [05/Sep/2013:16:26:59 +0200] category=PROTOCOL severity=NOTICE msgID=2556180 msg=Started listening for new connections on Administration Connector 0.0.0.0 port 22444
        [05/Sep/2013:16:26:59 +0200] category=PROTOCOL severity=NOTICE msgID=2556180 msg=Started listening for new connections on LDAP Connection Handler 0.0.0.0 port 22389
        [05/Sep/2013:16:26:59 +0200] category=PROTOCOL severity=NOTICE msgID=2556180 msg=Started listening for new connections on LDAPS Connection Handler 0.0.0.0 port 22636
        [05/Sep/2013:16:26:59 +0200] category=CORE severity=NOTICE msgID=458887 msg=The Directory Server has started successfully
        [05/Sep/2013:16:26:59 +0200] category=CORE severity=NOTICE msgID=458891 msg=The Directory Server has sent an alert notification generated by class org.opends.server.core.DirectoryServer (alert type org.opends.server.DirectoryServerStarted, alert ID 458887):  The Directory Server has started successfully
        

        ==> seems that the RS fails to connect with the DS

      • check replication status (1003 entries expected on both instances)
        o=example	localhost:11444	1003	true	4563	22794	11112	0		false
        o=example	localhost:22444	1002	true	1665	32342	22222			false
        

      Traces in the errors log

      ...
      [05/Sep/2013:16:26:59 +0200] category=PROTOCOL severity=NOTICE msgID=2556180 msg=Started listening for new connections on Administration Connector 0.0.0.0 port 22444
      [05/Sep/2013:16:26:59 +0200] category=PROTOCOL severity=NOTICE msgID=2556180 msg=Started listening for new connections on LDAP Connection Handler 0.0.0.0 port 22389
      [05/Sep/2013:16:26:59 +0200] category=PROTOCOL severity=NOTICE msgID=2556180 msg=Started listening for new connections on LDAPS Connection Handler 0.0.0.0 port 22636
      [05/Sep/2013:16:26:59 +0200] category=CORE severity=NOTICE msgID=458887 msg=The Directory Server has started successfully
      [05/Sep/2013:16:26:59 +0200] category=CORE severity=NOTICE msgID=458891 msg=The Directory Server has sent an alert notification generated by class org.opends.server.core.DirectoryServer (alert type org.opends.server.DirectoryServerStarted, alert ID 458887):  The Directory Server has started successfully
      [05/Sep/2013:16:27:00 +0200] category=SYNC severity=SEVERE_ERROR msgID=14942386 msg=Directory server 1665 was attempting to connect to replication server 32342 but has disconnected in handshake phase
      [05/Sep/2013:16:27:11 +0200] category=SYNC severity=SEVERE_ERROR msgID=14942386 msg=Directory server 1665 was attempting to connect to replication server 32342 but has disconnected in handshake phase
      [05/Sep/2013:16:27:20 +0200] category=BACKEND severity=NOTICE msgID=9896349 msg=Export task 20130905162720583 started execution
      [05/Sep/2013:16:27:20 +0200] category=JEB severity=NOTICE msgID=8847447 msg=Exported 1002 entries and skipped 0 in 0 seconds (average rate 6109.8/sec)
      [05/Sep/2013:16:27:20 +0200] category=BACKEND severity=NOTICE msgID=9896350 msg=Export task 20130905162720583 finished execution
      [05/Sep/2013:16:27:22 +0200] category=SYNC severity=SEVERE_ERROR msgID=14942386 msg=Directory server 31296 was attempting to connect to replication server 32342 but has disconnected in handshake phase
      [05/Sep/2013:16:27:33 +0200] category=SYNC severity=SEVERE_ERROR msgID=14942386 msg=Directory server 20293 was attempting to connect to replication server 32342 but has disconnected in handshake phase
      [05/Sep/2013:16:27:44 +0200] category=SYNC severity=SEVERE_ERROR msgID=14942386 msg=Directory server 1665 was attempting to connect to replication server 32342 but has disconnected in handshake phase
      [05/Sep/2013:16:27:55 +0200] category=SYNC severity=SEVERE_ERROR msgID=14942386 msg=Directory server 31296 was attempting to connect to replication server 32342 but has disconnected in handshake phase
      [05/Sep/2013:16:28:06 +0200] category=SYNC severity=SEVERE_ERROR msgID=14942386 msg=Directory server 20293 was attempting to connect to replication server 32342 but has disconnected in handshake phase
      [05/Sep/2013:16:28:12 +0200] category=SYNC severity=SEVERE_ERROR msgID=14942386 msg=Directory server 1665 was attempting to connect to replication server 32342 but has disconnected in handshake phase
      [05/Sep/2013:16:28:23 +0200] category=SYNC severity=SEVERE_ERROR msgID=14942386 msg=Directory server 31296 was attempting to connect to replication server 32342 but has disconnected in handshake phase
      ...
      

      Traces in the replication log

      ...
      [05/Sep/2013:16:25:42 +0200] category=SYNC severity=NOTICE msgID=15138894 msg=The generation ID for domain "o=example" has been reset to 18736508
      [05/Sep/2013:16:25:53 +0200] category=SYNC severity=INFORMATION msgID=14680271 msg=Replication server RS(32342) has accepted a connection from directory server DS(31296) for domain "cn=admin data" at localhost/127.0.0.1:55313
      [05/Sep/2013:16:25:53 +0200] category=SYNC severity=NOTICE msgID=15138878 msg=Directory server DS(31296) has connected to replication server RS(32342) for domain "cn=admin data" at localhost/127.0.0.1:22222 with generation ID 158123
      [05/Sep/2013:16:25:53 +0200] category=SYNC severity=INFORMATION msgID=14680271 msg=Replication server RS(32342) has accepted a connection from directory server DS(20293) for domain "cn=schema" at localhost/127.0.0.1:55316
      [05/Sep/2013:16:25:53 +0200] category=SYNC severity=NOTICE msgID=15138878 msg=Directory server DS(20293) has connected to replication server RS(32342) for domain "cn=schema" at localhost/127.0.0.1:22222 with generation ID 8408
      [05/Sep/2013:16:25:53 +0200] category=SYNC severity=INFORMATION msgID=14680271 msg=Replication server RS(32342) has accepted a connection from directory server DS(1665) for domain "o=example" at localhost/127.0.0.1:55319
      [05/Sep/2013:16:25:53 +0200] category=SYNC severity=NOTICE msgID=15138878 msg=Directory server DS(1665) has connected to replication server RS(32342) for domain "o=example" at localhost/127.0.0.1:22222 with generation ID 18736508
      [05/Sep/2013:16:26:03 +0200] category=SYNC severity=INFORMATION msgID=14680270 msg=Replication server RS(32342) has accepted a connection from replication server RS(28142) for domain "cn=schema" at localhost/127.0.0.1:55339
      [05/Sep/2013:16:26:10 +0200] category=SYNC severity=MILD_WARNING msgID=14745663 msg=Replication server RS(32342) at localhost/127.0.0.1:22222 has closed the connection to this directory server DS(1665). This directory server will now try to connect to another replication server in order to receive changes for the domain "o=example"
      [05/Sep/2013:16:26:29 +0200] category=SYNC severity=NOTICE msgID=15139020 msg=Replication server RS(32342) started listening for new connections on address 0.0.0.0 port 22222
      [05/Sep/2013:16:26:30 +0200] category=SYNC severity=INFORMATION msgID=14680270 msg=Replication server RS(32342) has accepted a connection from replication server RS(22794) for domain "cn=admin data" at localhost/127.0.0.1:55447
      [05/Sep/2013:16:26:30 +0200] category=SYNC severity=INFORMATION msgID=14680270 msg=Replication server RS(32342) has accepted a connection from replication server RS(22794) for domain "o=example" at localhost/127.0.0.1:55448
      [05/Sep/2013:16:26:30 +0200] category=SYNC severity=INFORMATION msgID=14680270 msg=Replication server RS(32342) has accepted a connection from replication server RS(22794) for domain "cn=schema" at localhost/127.0.0.1:55449
      [05/Sep/2013:16:27:11 +0200] category=SYNC severity=INFORMATION msgID=14680169 msg=Replication server accepted a connection from localhost/127.0.0.1:55497 to local address 0.0.0.0/0.0.0.0:22222 but the SSL handshake failed. This is probably benign, but may indicate a transient network outage or a misconfigured client application connecting to this replication server. The error was: Remote host closed connection during handshake
      [05/Sep/2013:16:27:11 +0200] category=SYNC severity=INFORMATION msgID=14680169 msg=Replication server accepted a connection from localhost/127.0.0.1:55498 to local address 0.0.0.0/0.0.0.0:22222 but the SSL handshake failed. This is probably benign, but may indicate a transient network outage or a misconfigured client application connecting to this replication server. The error was: Remote host closed connection during handshake
      [05/Sep/2013:16:27:22 +0200] category=SYNC severity=INFORMATION msgID=14680169 msg=Replication server accepted a connection from localhost/127.0.0.1:55524 to local address 0.0.0.0/0.0.0.0:22222 but the SSL handshake failed. This is probably benign, but may indicate a transient network outage or a misconfigured client application connecting to this replication server. The error was: Remote host closed connection during handshake
      [05/Sep/2013:16:27:22 +0200] category=SYNC severity=INFORMATION msgID=14680169 msg=Replication server accepted a connection from localhost/127.0.0.1:55525 to local address 0.0.0.0/0.0.0.0:22222 but the SSL handshake failed. This is probably benign, but may indicate a transient network outage or a misconfigured client application connecting to this replication server. The error was: Remote host closed connection during handshake
      [05/Sep/2013:16:27:33 +0200] category=SYNC severity=INFORMATION msgID=14680169 msg=Replication server accepted a connection from localhost/127.0.0.1:55546 to local address 0.0.0.0/0.0.0.0:22222 but the SSL handshake failed. This is probably benign, but may indicate a transient network outage or a misconfigured client application connecting to this replication server. The error was: Remote host closed connection during handshake
      [05/Sep/2013:16:27:33 +0200] category=SYNC severity=INFORMATION msgID=14680169 msg=Replication server accepted a connection from localhost/127.0.0.1:55547 to local address 0.0.0.0/0.0.0.0:22222 but the SSL handshake failed. This is probably benign, but may indicate a transient network outage or a misconfigured client application connecting to this replication server. The error was: Remote host closed connection during handshake
      ...
      

      In attachment the jstack.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                fabiop Fabio Pistolesi
                Reporter:
                csovant Christophe Sovant
              • Votes:
                1 Vote for this issue
                Watchers:
                12 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: