Uploaded image for project: 'OpenDJ'
  1. OpenDJ
  2. OPENDJ-4590

Replication: cursor aborted on high write throughput

    Details

    • Type: Bug
    • Status: Done
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 6.0.0, 5.5.0
    • Fix Version/s: 6.0.0
    • Component/s: replication
    • Labels:
    • Support Ticket IDs:

      Description

      Found using OpenDJ 6.0.0-SNAPSHOT rev bfceb6e10fe

      Scenario
      1. configure 2 DSRS servers with 7 entries
      2. add/del rate of small entries on the first DSRS instance without limiting the throughput (doing around 11k/s)
      3. let the test run for 12 hours
      4. and we notice a divergence on the number of entries in the backend at the end of the test
      => the first instance contains 7 entries and the second instance 10650 entries
      5. we also noticed the following error in the errors log for the first instance

      [14/Dec/2017:07:04:43 +0100] category=SYNC severity=ERROR msgID=302 msg=The replication server 770 can no longer keep up with changes coming from replication server 18099 for base DN dc=com. Some missing changes have been purged by this replication server and the connecti
      on will be terminated. The directory servers connected to this replication server may fail-over to another replication server that has not purged the changes that it needs. If there is no replication server containing the missing changes then the directory servers will fa
      il to connect to any replication server and will need to be reinitialized. (Underlying error is: Cursor on log '/external/testuser/tof/pyforge/results/20171213-173219/cts/Cts/DJ1/opendj/changelogDb/1.dom/32149.server' has been aborted after a purge or a clear)
      [14/Dec/2017:07:04:43 +0100] category=SYNC severity=ERROR msgID=181 msg=The connection from this replication server RS(18099) to replication server RS(770) at zola.internal.forgerock.com/172.16.204.56:8990 for domain "dc=com" has failed
      [14/Dec/2017:07:04:47 +0100] category=SYNC severity=ERROR msgID=302 msg=The replication server 770 can no longer keep up with changes coming from replication server 18099 for base DN dc=com. Some missing changes have been purged by this replication server and the connecti
      on will be terminated. The directory servers connected to this replication server may fail-over to another replication server that has not purged the changes that it needs. If there is no replication server containing the missing changes then the directory servers will fa
      il to connect to any replication server and will need to be reinitialized. (Underlying error is: Cursor on log '/external/testuser/tof/pyforge/results/20171213-173219/cts/Cts/DJ1/opendj/changelogDb/1.dom/32149.server' has been aborted after a purge or a clear)
      [14/Dec/2017:07:04:55 +0100] category=SYNC severity=ERROR msgID=302 msg=The replication server 770 can no longer keep up with changes coming from replication server 18099 for base DN dc=com. Some missing changes have been purged by this replication server and the connecti
      on will be terminated. The directory servers connected to this replication server may fail-over to another replication server that has not purged the changes that it needs. If there is no replication server containing the missing changes then the directory servers will fa
      il to connect to any replication server and will need to be reinitialized. (Underlying error is: Cursor on log '/external/testuser/tof/pyforge/results/20171213-173219/cts/Cts/DJ1/opendj/changelogDb/1.dom/32149.server' has been aborted after a purge or a clear)
      [14/Dec/2017:07:05:07 +0100] category=SYNC severity=ERROR msgID=302 msg=The replication server 770 can no longer keep up with changes coming from replication server 18099 for base DN dc=com. Some missing changes have been purged by this replication server and the connecti
      on will be terminated. The directory servers connected to this replication server may fail-over to another replication server that has not purged the changes that it needs. If there is no replication server containing the missing changes then the directory servers will fa
      il to connect to any replication server and will need to be reinitialized. (Underlying error is: Cursor on log '/external/testuser/tof/pyforge/results/20171213-173219/cts/Cts/DJ1/opendj/changelogDb/1.dom/32149.server' has been aborted after a purge or a clear)
      [14/Dec/2017:07:05:10 +0100] category=SYNC severity=ERROR msgID=302 msg=The replication server 770 can no longer keep up with changes coming from replication server 18099 for base DN dc=com. Some missing changes have been purged by this replication server and the connecti
      on will be terminated. The directory servers connected to this replication server may fail-over to another replication server that has not purged the changes that it needs. If there is no replication server containing the missing changes then the directory servers will fa
      il to connect to any replication server and will need to be reinitialized. (Underlying error is: Cursor on log '/external/testuser/tof/pyforge/results/20171213-173219/cts/Cts/DJ1/opendj/changelogDb/1.dom/32149.server' has been aborted after a purge or a clear)
      [14/Dec/2017:07:05:13 +0100] category=SYNC severity=ERROR msgID=302 msg=The replication server 770 can no longer keep up with changes coming from replication server 18099 for base DN dc=com. Some missing changes have been purged by this replication server and the connecti
      on will be terminated. The directory servers connected to this replication server may fail-over to another replication server that has not purged the changes that it needs. If there is no replication server containing the missing changes then the directory servers will fa
      il to connect to any replication server and will need to be reinitialized. (Underlying error is: Cursor on log '/external/testuser/tof/pyforge/results/20171213-173219/cts/Cts/DJ1/opendj/changelogDb/1.dom/32149.server' has been aborted after a purge or a clear)
      [14/Dec/2017:07:05:16 +0100] category=SYNC severity=ERROR msgID=302 msg=The replication server 770 can no longer keep up with changes coming from replication server 18099 for base DN dc=com. Some missing changes have been purged by this replication server and the connecti
      on will be terminated. The directory servers connected to this replication server may fail-over to another replication server that has not purged the changes that it needs. If there is no replication server containing the missing changes then the directory servers will fa
      il to connect to any replication server and will need to be reinitialized. (Underlying error is: Cursor on log '/external/testuser/tof/pyforge/results/20171213-173219/cts/Cts/DJ1/opendj/changelogDb/1.dom/32149.server' has been aborted after a purge or a clear)
      

      However, if we limit the add/del throughput to be lower than the changenumber indexer throughput there is no divergence at the end of the test.
      In this specific case/run the changenumber indexer throughput was around 9k/s.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                matthew Matthew Swift
                Reporter:
                csovant Christophe Sovant
                Dev Assignee:
                Matthew Swift
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: