Uploaded image for project: 'OpenDJ'
  1. OpenDJ
  2. OPENDJ-2190

Replicas cannot always keep up with sustained high write throughput

    XMLWordPrintable

    Details

    • Support Ticket IDs:
    • Sprint:
      OpenDJ Sprint 62, OpenDJ Sprint 65, OpenDJ Sprint 68, OpenDJ Sprint 76, OpenDJ Sprint 106

      Description

      Scenario:
      ---------

      • 2 servers replicated
      • run addrate tool on the first server (doing add and delete operations of CTS tokens)
      • the number of entries on the two servers stays stable during 36h approx.
      • after that the number of entries on the second servers increase

      => after 3 days of run, the second server has 50000 entries which have not been deleted

      Another run is in progress to provide more information.

      In the errors log of the second server I noticed some delete operations failing (replay):

      [27/Jun/2015:22:57:35 +0200] DELETE conn=-1 op=640590482 msgID=640590483 dn="coretokenid=989556613,ou=famrecords,ou=openam-session,ou=token,dc=com" type=synchronization result=32 message="Entry coretokenid=989556613,ou=famrecords,ou=openam-session,ou=token,dc=com does not exist in the Directory Server" etime=0
      [27/Jun/2015:22:57:35 +0200] DELETE conn=-1 op=640590493 msgID=640590494 dn="coretokenid=989556591,ou=famrecords,ou=openam-session,ou=token,dc=com" type=synchronization result=32 message="Entry coretokenid=989556591,ou=famrecords,ou=openam-session,ou=token,dc=com does not exist in the Directory Server" etime=0
      [27/Jun/2015:22:57:35 +0200] DELETE conn=-1 op=640590496 msgID=640590497 dn="coretokenid=989556605,ou=famrecords,ou=openam-session,ou=token,dc=com" type=synchronization result=32 message="Entry coretokenid=989556605,ou=famrecords,ou=openam-session,ou=token,dc=com does not exist in the Directory Server" etime=0
      [27/Jun/2015:22:57:35 +0200] DELETE conn=-1 op=640590501 msgID=640590502 dn="coretokenid=989556607,ou=famrecords,ou=openam-session,ou=token,dc=com" type=synchronization result=32 message="Entry coretokenid=989556607,ou=famrecords,ou=openam-session,ou=token,dc=com does not exist in the Directory Server" etime=1
      [27/Jun/2015:22:57:35 +0200] DELETE conn=-1 op=640590502 msgID=640590503 dn="coretokenid=989556612,ou=famrecords,ou=openam-session,ou=token,dc=com" type=synchronization result=32 message="Entry coretokenid=989556612,ou=famrecords,ou=openam-session,ou=token,dc=com does not exist in the Directory Server" etime=1
      

      In the replication log of the second server I noticed the following error:

      [29/Jun/2015:00:18:53 +0200] category=SYNC severity=ERROR msgID=org.opends.messages.replication.32 msg=An unexpected error happened handling connection with Replica DS(27734) for domain "dc=com" AbortedChangelogCursorException: Cursor on log '/tmp/replication/DJ4/opendj/changelogDb/1.dom/19665.server' has been aborted after a purge or a clear (Log.java:1369 Log.java:1364 Log.java:1469 FileReplicaDBCursor.java:107 ReplicaCursor.java:121 CompositeDBCursor.java:169 CompositeDBCursor.java:108 DomainDBCursor.java:43 MessageHandler.java:405 MessageHandler.java:300 ServerHandler.java:934 ServerWriter.java:104).  This connection is going to be closed
      [29/Jun/2015:00:19:03 +0200] category=SYNC severity=ERROR msgID=org.opends.messages.replication.32 msg=An unexpected error happened handling connection with Replica DS(27734) for domain "dc=com" AbortedChangelogCursorException: Cursor on log '/tmp/replication/DJ4/opendj/changelogDb/1.dom/19665.server' has been aborted after a purge or a clear (Log.java:1369 Log.java:1364 Log.java:1469 FileReplicaDBCursor.java:107 ReplicaCursor.java:121 CompositeDBCursor.java:169 CompositeDBCursor.java:108 DomainDBCursor.java:43 MessageHandler.java:405 MessageHandler.java:300 ServerHandler.java:934 ServerWriter.java:104).  This connection is going to be closed
      [29/Jun/2015:00:19:20 +0200] category=SYNC severity=ERROR msgID=org.opends.messages.replication.32 msg=An unexpected error happened handling connection with Replica DS(27734) for domain "dc=com" AbortedChangelogCursorException: Cursor on log '/tmp/replication/DJ4/opendj/changelogDb/1.dom/19665.server' has been aborted after a purge or a clear (Log.java:1369 Log.java:1364 Log.java:1469 FileReplicaDBCursor.java:107 ReplicaCursor.java:121 CompositeDBCursor.java:169 CompositeDBCursor.java:108 DomainDBCursor.java:43 MessageHandler.java:405 MessageHandler.java:300 ServerHandler.java:934 ServerWriter.java:104).  This connection is going to be closed
      [29/Jun/2015:02:00:29 +0200] category=SYNC severity=ERROR msgID=org.opends.messages.replication.32 msg=An unexpected error happened handling connection with Replica DS(27734) for domain "dc=com" AbortedChangelogCursorException: Cursor on log '/tmp/replication/DJ4/opendj/changelogDb/1.dom/19665.server' has been aborted after a purge or a clear (Log.java:1369 Log.java:1364 Log.java:1469 FileReplicaDBCursor.java:107 ReplicaCursor.java:121 CompositeDBCursor.java:169 CompositeDBCursor.java:108 DomainDBCursor.java:43 MessageHandler.java:405 MessageHandler.java:300 ServerHandler.java:934 ServerWriter.java:104).  This connection is going to be closed
      

      Output of dsreplication status:

      $ /tmp/replication/DJ3/opendj/bin/dsreplication status -h morbier.internal.forgerock.com -p 4444 -b dc=com -I admin -w password --script-friendly
       -X -n
      dc=com	morbier.internal.forgerock.com:4444	7	true	19665	23526	8989	0		false
      dc=com	morbier.internal.forgerock.com:4445	49009	true	27734	6879	8990	0		false
      

      Another run is in progress to provide more information.

        Attachments

        1. jstack_30.txt
          110 kB
        2. jstack_58.txt
          131 kB
        3. jstack_cont.txt
          148 kB
        4. replication-replay-contention.jstack
          115 kB

          Issue Links

            Activity

              People

              • Assignee:
                fabiop Fabio Pistolesi
                Reporter:
                csovant Christophe Sovant
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: