While replicated, I intermittently experience a replication failure with the following symptoms:
1) I see messages like the following in the error log:
[03/Apr/2013:08:04:22 -0600] category=SYNC severity=NOTICE msgID=15138964 msg=In replication service ou=readimanager, timeout after 2000 ms waiting for the acknowledgement of the assured update message: <My Data>
In one incident, I saw 26 of these messages over a 14 minute period.
In a second incident, I saw 65 of these messages over a 50 second period.
2) Eventually, these messages stopped, but in the first incident, all subsequent MODIFY operations returned a result like:[17/Mar/2013:23:19:48 +0800] MODIFY RES conn=820880 op=38 msgID=39 result=80 message="Entry commUniqueId=34b40fd7-fb87-430b-af1e-f8140a62a2f1,ou=devices,ou=ReadiManager cannot be modified because the server failed to obtain a write lock for this entry after multiple attempts" etime=9006
In the second incident, all subsequent MODIFY operations failed to return t all.
3) Once the errors were noticed, the server was restarted, and the problems appeared to go away.
I'm not sure if it is the cause of the symptoms that I am experiencing, but I notice that in org.opends.server.replication.service.ReplicationDomain, most accesses of waitingAckMsgs are synchronized against waitingAckMsgs, but one in method waitForAckIfAssuredEnabled, on line 3409, is not synchronized against waitingAckMsgs.
This seems like it could be the cause of the symptoms I am seeing, and is definitely a defect.