[OPENDJ-846] Intermittent Replication Failure Created: 05/Apr/13  Updated: 08/Nov/19  Resolved: 20/Jun/13

Status: Done
Project: OpenDJ
Component/s: replication
Affects Version/s: 2.6.0
Fix Version/s: 2.6.0

Type: Bug Priority: Critical
Reporter: dlange Assignee: Ludovic Poitou
Resolution: Fixed Votes: 0
Labels: release-notes

Attachments: File incident1.tgz     File incident2.tgz     File incident3_part1.tgz     File incident3_part2.tgz     File incident4.tgz     File incident6.tgz    
Issue Links:
Relates
is related to OPENDJ-4988 Topology wide inter process deadlock ... Done
is related to OPENDJ-1128 Replication server should not forward... Done
is related to OPENDJ-1043 Worker Thread was interrupted while w... Done
Dev Assignee: Ludovic Poitou
QA Assignee: Gary Williams
Support Ticket IDs:

 Description   

While replicated, I intermittently experience a replication failure with the following symptoms:
1) I see messages like the following in the error log:
[03/Apr/2013:08:04:22 -0600] category=SYNC severity=NOTICE msgID=15138964 msg=In replication service ou=readimanager, timeout after 2000 ms waiting for the acknowledgement of the assured update message: <My Data>
In one incident, I saw 26 of these messages over a 14 minute period.
In a second incident, I saw 65 of these messages over a 50 second period.

2) Eventually, these messages stopped, but in the first incident, all subsequent MODIFY operations returned a result like:[17/Mar/2013:23:19:48 +0800] MODIFY RES conn=820880 op=38 msgID=39 result=80 message="Entry commUniqueId=34b40fd7-fb87-430b-af1e-f8140a62a2f1,ou=devices,ou=ReadiManager cannot be modified because the server failed to obtain a write lock for this entry after multiple attempts" etime=9006
In the second incident, all subsequent MODIFY operations failed to return t all.

3) Once the errors were noticed, the server was restarted, and the problems appeared to go away.

I'm not sure if it is the cause of the symptoms that I am experiencing, but I notice that in org.opends.server.replication.service.ReplicationDomain, most accesses of waitingAckMsgs are synchronized against waitingAckMsgs, but one in method waitForAckIfAssuredEnabled, on line 3409, is not synchronized against waitingAckMsgs.

This seems like it could be the cause of the symptoms I am seeing, and is definitely a defect.



 Comments   
Comment by Ludovic Poitou [ 08/Apr/13 ]

Hi,

The timeout message may indicate contention on the server or the network.
The fail to obtain a write lock could come from various sources, including some resource shortage, but also possible defect.

I looked at the code you're mentioning and while there is clearly some mismanagement of the synchronization around the waitingAckMsgs, there is no proof that it is the cause of what you are experimenting.
Even if I was to fix that code, I would have no way to test the changes and verify that they resolve your problem, because based on this bug report, I have no way to understand and reproduce your configuration.

A defect should come with a precise scenario that helps us to try to reproduce and understand what is going on. Providing content of the logs and/or a stack trace when the server is not responding properly can also help to diagnose.

Are you able to produce this data ?

Comment by dlange [ 08/Apr/13 ]

I have seen this happen twice. I don't have stack dumps from either of those incidents. I tried writing a test application to spam concurrent modification requests, but that did not seem to reproduce the problem. I will attach the logs from the two incidents.

Comment by dlange [ 08/Apr/13 ]

Incident two. Incident begins at 08:04:22.

Comment by dlange [ 08/Apr/13 ]

Incident 1. Incident starts at 23:01:35.

Comment by dlange [ 12/Apr/13 ]

It happened again today. I am attaching logs and thread dumps from the third incident.

Comment by dlange [ 12/Apr/13 ]

It happened again today. I am attaching logs and thread dumps from the third incident.

Comment by dlange [ 12/Apr/13 ]

It happened again today. I have more data and a better understanding now.
The incident was triggered by network connectivity problems.
Although there are network timeout errors in the error log, the MODIFY operations that they describe did not actually return control to the client when the timeout happened. Instead, control was returned to the client when the operation actually completed. I think OpenDJ stopped responding to SEARCH requests, as well. It is difficult to tell whether that happened, or the client stopped searching because OpenDJ ate all its threads.
After about 15 minutes, the administrator noticed the problem, took thread dumps, and restarted the servers. The thread dump from corpdmanode1 is from 15 minutes after the start of the incident, the others are considerably later.

In the attached logs, you can find the incident at 06:56 GMT -600 on corpdmanode1 and corpdmanode7. Since this appears to be a distributed problem, I attached logs from all the servers.

Comment by dlange [ 12/Apr/13 ]

I notice in the thread dump that each ReplicationServerDomain has a lock that it holds while sending and receiving replication messages. It appears to me that the network problems caused the message sending code to block waiting for a socket write while holding the lock. This, in turn caused the message receiving code to block waiting for the lock:

"Replication server RS(5972) reading from Replication server RS(20660) for domain "cn=admin data" at corpdmanode9.video.polycom.com/10.245.16.70:8989" prio=10 tid=0x0000000048f63800 nid=0x2b9c runnable [0x0000000052390000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(Unknown Source)
at java.net.SocketOutputStream.write(Unknown Source)
at org.opends.server.replication.protocol.SocketSession.publish(SocketSession.java:273)
at org.opends.server.replication.server.ServerHandler.send(ServerHandler.java:1006)
at org.opends.server.replication.server.ReplicationServerDomain.process(ReplicationServerDomain.java:1636)
at org.opends.server.replication.server.ServerHandler.process(ServerHandler.java:941)
at org.opends.server.replication.server.ServerReader.run(ServerReader.java:249)

"Replication server RS(5972) reading from Replication server RS(2628) for domain "cn=admin data" at /10.245.16.71:23007" prio=10 tid=0x0000000048fdf800 nid=0x2b9f waiting on condition [0x0000000052592000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)

  • parking to wait for <0x00002aaab9cce160> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(Unknown Source)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown Source)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(Unknown Source)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(Unknown Source)
    at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(Unknown Source)
    at org.opends.server.replication.server.ReplicationServerDomain.lock(ReplicationServerDomain.java:2921)
    at org.opends.server.replication.server.ReplicationServerDomain.processChangeTimeHeartbeatMsg(ReplicationServerDomain.java:3353)
    at org.opends.server.replication.server.ServerHandler.process(ServerHandler.java:955)
    at org.opends.server.replication.server.ServerReader.run(ServerReader.java:257)
Comment by dlange [ 16/Apr/13 ]

Here is the replication configuration:
dn: cn=cn=admin data,cn=domains,cn=Multimaster Synchronization,cn=Synchronizatio
n Providers,cn=config
ds-cfg-base-dn: cn=admin data
objectClass: ds-cfg-replication-domain
objectClass: top
ds-cfg-server-id: 1386
cn: cn=admin data
ds-cfg-replication-server: corpdmanode1.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode10.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode2.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode3.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode4.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode7.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode8.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode9.video.polycom.com:8989

dn: cn=ou=Supercluster,cn=domains,cn=Multimaster Synchronization,cn=Synchronizat
ion Providers,cn=config
ds-cfg-assured-sd-level: 2
ds-cfg-base-dn: ou=Supercluster
objectClass: ds-cfg-replication-domain
objectClass: top
ds-cfg-assured-type: safe-data
ds-cfg-server-id: 28476
cn: ou=Supercluster
ds-cfg-replication-server: corpdmanode1.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode10.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode2.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode3.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode4.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode7.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode8.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode9.video.polycom.com:8989

dn: cn=ou=ReadiManager,cn=domains,cn=Multimaster Synchronization,cn=Synchronizat
ion Providers,cn=config
ds-cfg-assured-sd-level: 2
ds-cfg-base-dn: ou=ReadiManager
objectClass: ds-cfg-replication-domain
objectClass: top
ds-cfg-assured-type: safe-data
ds-cfg-server-id: 19419
cn: ou=ReadiManager
ds-cfg-replication-server: corpdmanode1.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode10.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode2.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode3.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode4.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode7.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode8.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode9.video.polycom.com:8989

Comment by dlange [ 18/Apr/13 ]

I reproduced this in a lab environment today, using the following procedure:
1) Install two servers with OpenDJ-2.5.0-Xpress1, and configure them to replicate to each other.
2) Configure assured replication with safe data, and a safe data level of 2.
3) Configure a 100ms delay in packets going from server "carrot" to server "r620-load2".
4) Run the test program (which will be in the attachment). The test program drives modify requests to each of the servers, so that each has data to replicate to the other.
5) Configure 10% packet loss between the two servers.

The test program should have run continuously, buy in fact it locked up right after the packet loss was configured between the two servers.
So far, I was only able to reproduce this once, subsequent attempts have failed to reproduce the problem.

I notice the following lines in the thread dumps from the two servers:
"Replication server RS(811) writing to Replication server RS(11268) for domain "ou=testdata" at r620-load2.prx.eng.westminster.polycom.com/10.48.2.105:21494" prio=10 tid=0x00002b2fec377800 nid=0x73cb runnable [0x00002b2fe9103000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(Unknown Source)
at java.net.SocketOutputStream.write(Unknown Source)
at org.opends.server.replication.protocol.TLSSocketSession.publish(TLSSocketSession.java:318)
at org.opends.server.replication.server.ServerWriter.run(ServerWriter.java:197)

"Replication server RS(811) reading from Replication server RS(11268) for domain "ou=testdata" at r620-load2.prx.eng.westminster.polycom.com/10.48.2.105:21494" prio=10 tid=0x000000005c70a000 nid=0x73ca waiting on condition [0x00002b2fe9002000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)

  • parking to wait for <0x00002b2f243d7cd8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(Unknown Source)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown Source)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(Unknown Source)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(Unknown Source)
    at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(Unknown Source)
    at java.util.concurrent.locks.ReentrantLock.lock(Unknown Source)
    at org.opends.server.replication.protocol.TLSSocketSession.publish(TLSSocketSession.java:314)
    at org.opends.server.replication.protocol.TLSSocketSession.publish(TLSSocketSession.java:298)
    at org.opends.server.replication.server.ServerHandler.sendAck(ServerHandler.java:986)
    at org.opends.server.replication.server.ReplicationServerDomain.processSafeDataUpdateMsg(ReplicationServerDomain.java:714)
    at org.opends.server.replication.server.ReplicationServerDomain.put(ReplicationServerDomain.java:272)
    at org.opends.server.replication.server.ServerHandler.put(ServerHandler.java:1347)
    at org.opends.server.replication.server.ServerReader.run(ServerReader.java:183)

"Replication server RS(11268) writing to Replication server RS(811) for domain "ou=testdata" at carrot.prx.eng.westminster.polycom.com/10.47.17.192:8989" prio=10 tid=0x00007ff7b8007000 nid=0x6dc1 runnable [0x00007ff988fce000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(Unknown Source)
at java.net.SocketOutputStream.write(Unknown Source)
at org.opends.server.replication.protocol.TLSSocketSession.publish(TLSSocketSession.java:318)
at org.opends.server.replication.server.ServerWriter.run(ServerWriter.java:197)

"Replication server RS(11268) reading from Replication server RS(811) for domain "ou=testdata" at carrot.prx.eng.westminster.polycom.com/10.47.17.192:8989" prio=10 tid=0x00007ff7b8006000 nid=0x6dc0 waiting on condition [0x00007ff9890cf000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)

  • parking to wait for <0x000000070a516df0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(Unknown Source)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown Source)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(Unknown Source)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(Unknown Source)
    at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(Unknown Source)
    at java.util.concurrent.locks.ReentrantLock.lock(Unknown Source)
    at org.opends.server.replication.protocol.TLSSocketSession.publish(TLSSocketSession.java:314)
    at org.opends.server.replication.protocol.TLSSocketSession.publish(TLSSocketSession.java:298)
    at org.opends.server.replication.server.ServerHandler.sendAck(ServerHandler.java:986)
    at org.opends.server.replication.server.ReplicationServerDomain.processSafeDataUpdateMsg(ReplicationServerDomain.java:714)
    at org.opends.server.replication.server.ReplicationServerDomain.put(ReplicationServerDomain.java:272)
    at org.opends.server.replication.server.ServerHandler.put(ServerHandler.java:1347)
    at org.opends.server.replication.server.ServerReader.run(ServerReader.java:183)
Comment by dlange [ 18/Apr/13 ]

logs from incident 4.

Comment by Ludovic Poitou [ 19/Apr/13 ]

Thanks for the details. We will try to reproduce the issue with the provided test client.
I notice that you are using Java 6 (update 20) which is really old and out of support. Is this a company choice or just the JVM that happened to be installed ?

Comment by dlange [ 19/Apr/13 ]

The stable version of our product uses Java 6 (update 20), and the most recent version uses Java 7 (update 11). The first three incidents happened on the stable version, running Java 6 (update 20). Incident 4 actually had one server running each version, because I wasn't paying close attention to what version was installed.

Comment by dlange [ 23/Apr/13 ]

I reproduced the problem twice more today, although I still can't reproduce it consistently. I changed the network impairment to have 100ms latency on both servers, and 10% packet loss on both servers, when receiving packets from each other:
tc qdisc add dev eth0 root handle 1:0 netem delay 100ms
iptables -I INPUT 1 -s 10.47.17.192 -m statistic --mode random --probability 0.1 -j DROP

This time, I took packet captures. It seems to me that the packet capture indicates that the tcp window size on both sides of the replication socket gets pushed down to zero, then stays there.

Comment by dlange [ 23/Apr/13 ]

logs and packet captures from incident 6.

Comment by Ludovic Poitou [ 23/Apr/13 ]

Thanks for the data.
I think we are starting to get an understanding of what's going on, which is a good thing on a path to find the root cause and then fix it.
Regards,
Ludo

Comment by dlange [ 24/Apr/13 ]

On closer inspection, I can reproduce the problem repeatably using the method I described in my previous comment. But when I reproduce the problem; then restart both servers; and then reproduce the problem again, the worker threads don't get stuck. Only the replication threads get stuck.

Similarly, I can reproduce the problem with assured replication disabled. But the worker threads don't get stuck, only the replication threads get stuck.

Comment by dlange [ 24/Apr/13 ]

I speculate that:
When the replication writer thread hangs in socketWrite0, the other threads get stuck waiting for it, causing the system to hang.
But socketWrite0 does not normally block. It normally only blocks when the writer's send buffer and reader's receive buffer are both full.
Normally, the reader can fall very far behind the writer before the buffers will fill up. So this problem does not manifest.
But when there is enough packet loss on the network over a short duration, the TCP congestion control algorithm decreases the TCP window size all the way to zero.
In this condition, the system behaves similarly to a full buffer. The sending writer thread can not proceed until the receiving reader thread removes data from the buffer.
But the receiving reader thread blocks waiting for its own writer thread, and the system hangs. The condition persists even after the network returns to normal.

I am going to try reproducing the socket issue with a simple test program, independent of OpenDJ.
If that works, I am going to experiment with different buffer sizes and congestion control algorithms in the operating system to see if they make a difference.

Comment by Ludovic Poitou [ 24/Apr/13 ]

I think I agree with your analysis.
We are discussing and investigating options to prevent the complete system deadlock. Ideally, the server should resume replicating properly after the network returns to normal.

Comment by Ludovic Poitou [ 20/Jun/13 ]

Committed in revision 9015, review CR-1675. Thanks to the reviewers.

Generated at Sun Jan 24 18:25:54 UTC 2021 using Jira 7.13.12#713012-sha1:6e07c38070d5191bbf7353952ed38f111754533a.