Hi,
The timeout message may indicate contention on the server or the network.
The fail to obtain a write lock could come from various sources, including some resource shortage, but also possible defect.
I looked at the code you're mentioning and while there is clearly some mismanagement of the synchronization around the waitingAckMsgs, there is no proof that it is the cause of what you are experimenting.
Even if I was to fix that code, I would have no way to test the changes and verify that they resolve your problem, because based on this bug report, I have no way to understand and reproduce your configuration.
A defect should come with a precise scenario that helps us to try to reproduce and understand what is going on. Providing content of the logs and/or a stack trace when the server is not responding properly can also help to diagnose.
Are you able to produce this data ?
|
I have seen this happen twice. I don't have stack dumps from either of those incidents. I tried writing a test application to spam concurrent modification requests, but that did not seem to reproduce the problem. I will attach the logs from the two incidents.
|
Incident two. Incident begins at 08:04:22.
|
Incident 1. Incident starts at 23:01:35.
|
It happened again today. I am attaching logs and thread dumps from the third incident.
|
It happened again today. I am attaching logs and thread dumps from the third incident.
|
It happened again today. I have more data and a better understanding now.
The incident was triggered by network connectivity problems.
Although there are network timeout errors in the error log, the MODIFY operations that they describe did not actually return control to the client when the timeout happened. Instead, control was returned to the client when the operation actually completed. I think OpenDJ stopped responding to SEARCH requests, as well. It is difficult to tell whether that happened, or the client stopped searching because OpenDJ ate all its threads.
After about 15 minutes, the administrator noticed the problem, took thread dumps, and restarted the servers. The thread dump from corpdmanode1 is from 15 minutes after the start of the incident, the others are considerably later.
In the attached logs, you can find the incident at 06:56 GMT -600 on corpdmanode1 and corpdmanode7. Since this appears to be a distributed problem, I attached logs from all the servers.
|
I notice in the thread dump that each ReplicationServerDomain has a lock that it holds while sending and receiving replication messages. It appears to me that the network problems caused the message sending code to block waiting for a socket write while holding the lock. This, in turn caused the message receiving code to block waiting for the lock:
"Replication server RS(5972) reading from Replication server RS(20660) for domain "cn=admin data" at corpdmanode9.video.polycom.com/10.245.16.70:8989" prio=10 tid=0x0000000048f63800 nid=0x2b9c runnable [0x0000000052390000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(Unknown Source)
at java.net.SocketOutputStream.write(Unknown Source)
at org.opends.server.replication.protocol.SocketSession.publish(SocketSession.java:273)
at org.opends.server.replication.server.ServerHandler.send(ServerHandler.java:1006)
at org.opends.server.replication.server.ReplicationServerDomain.process(ReplicationServerDomain.java:1636)
at org.opends.server.replication.server.ServerHandler.process(ServerHandler.java:941)
at org.opends.server.replication.server.ServerReader.run(ServerReader.java:249)
"Replication server RS(5972) reading from Replication server RS(2628) for domain "cn=admin data" at /10.245.16.71:23007" prio=10 tid=0x0000000048fdf800 nid=0x2b9f waiting on condition [0x0000000052592000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00002aaab9cce160> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(Unknown Source)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown Source)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(Unknown Source)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(Unknown Source)
at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(Unknown Source)
at org.opends.server.replication.server.ReplicationServerDomain.lock(ReplicationServerDomain.java:2921)
at org.opends.server.replication.server.ReplicationServerDomain.processChangeTimeHeartbeatMsg(ReplicationServerDomain.java:3353)
at org.opends.server.replication.server.ServerHandler.process(ServerHandler.java:955)
at org.opends.server.replication.server.ServerReader.run(ServerReader.java:257)
|
Here is the replication configuration:
dn: cn=cn=admin data,cn=domains,cn=Multimaster Synchronization,cn=Synchronizatio
n Providers,cn=config
ds-cfg-base-dn: cn=admin data
objectClass: ds-cfg-replication-domain
objectClass: top
ds-cfg-server-id: 1386
cn: cn=admin data
ds-cfg-replication-server: corpdmanode1.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode10.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode2.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode3.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode4.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode7.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode8.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode9.video.polycom.com:8989
dn: cn=ou=Supercluster,cn=domains,cn=Multimaster Synchronization,cn=Synchronizat
ion Providers,cn=config
ds-cfg-assured-sd-level: 2
ds-cfg-base-dn: ou=Supercluster
objectClass: ds-cfg-replication-domain
objectClass: top
ds-cfg-assured-type: safe-data
ds-cfg-server-id: 28476
cn: ou=Supercluster
ds-cfg-replication-server: corpdmanode1.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode10.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode2.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode3.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode4.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode7.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode8.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode9.video.polycom.com:8989
dn: cn=ou=ReadiManager,cn=domains,cn=Multimaster Synchronization,cn=Synchronizat
ion Providers,cn=config
ds-cfg-assured-sd-level: 2
ds-cfg-base-dn: ou=ReadiManager
objectClass: ds-cfg-replication-domain
objectClass: top
ds-cfg-assured-type: safe-data
ds-cfg-server-id: 19419
cn: ou=ReadiManager
ds-cfg-replication-server: corpdmanode1.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode10.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode2.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode3.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode4.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode7.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode8.video.polycom.com:8989
ds-cfg-replication-server: corpdmanode9.video.polycom.com:8989
|
I reproduced this in a lab environment today, using the following procedure:
1) Install two servers with OpenDJ-2.5.0-Xpress1, and configure them to replicate to each other.
2) Configure assured replication with safe data, and a safe data level of 2.
3) Configure a 100ms delay in packets going from server "carrot" to server "r620-load2".
4) Run the test program (which will be in the attachment). The test program drives modify requests to each of the servers, so that each has data to replicate to the other.
5) Configure 10% packet loss between the two servers.
The test program should have run continuously, buy in fact it locked up right after the packet loss was configured between the two servers.
So far, I was only able to reproduce this once, subsequent attempts have failed to reproduce the problem.
I notice the following lines in the thread dumps from the two servers:
"Replication server RS(811) writing to Replication server RS(11268) for domain "ou=testdata" at r620-load2.prx.eng.westminster.polycom.com/10.48.2.105:21494" prio=10 tid=0x00002b2fec377800 nid=0x73cb runnable [0x00002b2fe9103000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(Unknown Source)
at java.net.SocketOutputStream.write(Unknown Source)
at org.opends.server.replication.protocol.TLSSocketSession.publish(TLSSocketSession.java:318)
at org.opends.server.replication.server.ServerWriter.run(ServerWriter.java:197)
"Replication server RS(811) reading from Replication server RS(11268) for domain "ou=testdata" at r620-load2.prx.eng.westminster.polycom.com/10.48.2.105:21494" prio=10 tid=0x000000005c70a000 nid=0x73ca waiting on condition [0x00002b2fe9002000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00002b2f243d7cd8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(Unknown Source)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown Source)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(Unknown Source)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(Unknown Source)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(Unknown Source)
at java.util.concurrent.locks.ReentrantLock.lock(Unknown Source)
at org.opends.server.replication.protocol.TLSSocketSession.publish(TLSSocketSession.java:314)
at org.opends.server.replication.protocol.TLSSocketSession.publish(TLSSocketSession.java:298)
at org.opends.server.replication.server.ServerHandler.sendAck(ServerHandler.java:986)
at org.opends.server.replication.server.ReplicationServerDomain.processSafeDataUpdateMsg(ReplicationServerDomain.java:714)
at org.opends.server.replication.server.ReplicationServerDomain.put(ReplicationServerDomain.java:272)
at org.opends.server.replication.server.ServerHandler.put(ServerHandler.java:1347)
at org.opends.server.replication.server.ServerReader.run(ServerReader.java:183)
"Replication server RS(11268) writing to Replication server RS(811) for domain "ou=testdata" at carrot.prx.eng.westminster.polycom.com/10.47.17.192:8989" prio=10 tid=0x00007ff7b8007000 nid=0x6dc1 runnable [0x00007ff988fce000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(Unknown Source)
at java.net.SocketOutputStream.write(Unknown Source)
at org.opends.server.replication.protocol.TLSSocketSession.publish(TLSSocketSession.java:318)
at org.opends.server.replication.server.ServerWriter.run(ServerWriter.java:197)
"Replication server RS(11268) reading from Replication server RS(811) for domain "ou=testdata" at carrot.prx.eng.westminster.polycom.com/10.47.17.192:8989" prio=10 tid=0x00007ff7b8006000 nid=0x6dc0 waiting on condition [0x00007ff9890cf000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000070a516df0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(Unknown Source)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown Source)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(Unknown Source)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(Unknown Source)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(Unknown Source)
at java.util.concurrent.locks.ReentrantLock.lock(Unknown Source)
at org.opends.server.replication.protocol.TLSSocketSession.publish(TLSSocketSession.java:314)
at org.opends.server.replication.protocol.TLSSocketSession.publish(TLSSocketSession.java:298)
at org.opends.server.replication.server.ServerHandler.sendAck(ServerHandler.java:986)
at org.opends.server.replication.server.ReplicationServerDomain.processSafeDataUpdateMsg(ReplicationServerDomain.java:714)
at org.opends.server.replication.server.ReplicationServerDomain.put(ReplicationServerDomain.java:272)
at org.opends.server.replication.server.ServerHandler.put(ServerHandler.java:1347)
at org.opends.server.replication.server.ServerReader.run(ServerReader.java:183)
|
logs from incident 4.
|
Thanks for the details. We will try to reproduce the issue with the provided test client.
I notice that you are using Java 6 (update 20) which is really old and out of support. Is this a company choice or just the JVM that happened to be installed ?
|
The stable version of our product uses Java 6 (update 20), and the most recent version uses Java 7 (update 11). The first three incidents happened on the stable version, running Java 6 (update 20). Incident 4 actually had one server running each version, because I wasn't paying close attention to what version was installed.
|
I reproduced the problem twice more today, although I still can't reproduce it consistently. I changed the network impairment to have 100ms latency on both servers, and 10% packet loss on both servers, when receiving packets from each other:
tc qdisc add dev eth0 root handle 1:0 netem delay 100ms
iptables -I INPUT 1 -s 10.47.17.192 -m statistic --mode random --probability 0.1 -j DROP
This time, I took packet captures. It seems to me that the packet capture indicates that the tcp window size on both sides of the replication socket gets pushed down to zero, then stays there.
|
logs and packet captures from incident 6.
|
Thanks for the data.
I think we are starting to get an understanding of what's going on, which is a good thing on a path to find the root cause and then fix it.
Regards,
Ludo
|
On closer inspection, I can reproduce the problem repeatably using the method I described in my previous comment. But when I reproduce the problem; then restart both servers; and then reproduce the problem again, the worker threads don't get stuck. Only the replication threads get stuck.
Similarly, I can reproduce the problem with assured replication disabled. But the worker threads don't get stuck, only the replication threads get stuck.
|
I speculate that:
When the replication writer thread hangs in socketWrite0, the other threads get stuck waiting for it, causing the system to hang.
But socketWrite0 does not normally block. It normally only blocks when the writer's send buffer and reader's receive buffer are both full.
Normally, the reader can fall very far behind the writer before the buffers will fill up. So this problem does not manifest.
But when there is enough packet loss on the network over a short duration, the TCP congestion control algorithm decreases the TCP window size all the way to zero.
In this condition, the system behaves similarly to a full buffer. The sending writer thread can not proceed until the receiving reader thread removes data from the buffer.
But the receiving reader thread blocks waiting for its own writer thread, and the system hangs. The condition persists even after the network returns to normal.
I am going to try reproducing the socket issue with a simple test program, independent of OpenDJ.
If that works, I am going to experiment with different buffer sizes and congestion control algorithms in the operating system to see if they make a difference.
|
I think I agree with your analysis.
We are discussing and investigating options to prevent the complete system deadlock. Ideally, the server should resume replicating properly after the network returns to normal.
|
Committed in revision 9015, review CR-1675. Thanks to the reviewers.
|
Generated at Sun Jan 24 18:25:54 UTC 2021 using Jira 7.13.12#713012-sha1:6e07c38070d5191bbf7353952ed38f111754533a.