Intermittent Replication Failure



    Bug
    Status: Done
    Critical
    Resolution: Fixed
    2.6.0
    2.6.0
    replication


      While replicated, I intermittently experience a replication failure with the following symptoms:
      1) I see messages like the following in the error log:
      [03/Apr/2013:08:04:22 -0600] category=SYNC severity=NOTICE msgID=15138964 msg=In replication service ou=readimanager, timeout after 2000 ms waiting for the acknowledgement of the assured update message: <My Data>
      In one incident, I saw 26 of these messages over a 14 minute period.
      In a second incident, I saw 65 of these messages over a 50 second period.

      2) Eventually, these messages stopped, but in the first incident, all subsequent MODIFY operations returned a result like:[17/Mar/2013:23:19:48 +0800] MODIFY RES conn=820880 op=38 msgID=39 result=80 message="Entry commUniqueId=34b40fd7-fb87-430b-af1e-f8140a62a2f1,ou=devices,ou=ReadiManager cannot be modified because the server failed to obtain a write lock for this entry after multiple attempts" etime=9006
      In the second incident, all subsequent MODIFY operations failed to return t all.

      3) Once the errors were noticed, the server was restarted, and the problems appeared to go away.

      I'm not sure if it is the cause of the symptoms that I am experiencing, but I notice that in org.opends.server.replication.service.ReplicationDomain, most accesses of waitingAckMsgs are synchronized against waitingAckMsgs, but one in method waitForAckIfAssuredEnabled, on line 3409, is not synchronized against waitingAckMsgs.

      This seems like it could be the cause of the symptoms I am seeing, and is definitely a defect.


        1. incident1.tgz
          612 kB
        2. incident2.tgz
          2.17 MB
        3. incident3_part1.tgz
          6.04 MB
        4. incident3_part2.tgz
          5.73 MB
        5. incident4.tgz
          196 kB
        6. incident6.tgz
          105 kB

