[OPENDJ-4988] Topology wide inter process deadlock during long running replicated update stress tests Created: 2018-04-10  Updated: 2020-03-03  Resolved: 2018-04-12

Status: Done
Project: OpenDJ
Component/s: replication
Affects Version/s: 5.5.0, 6.0.0
Fix Version/s: 6.0.0

Type: Bug Priority: Blocker
Reporter: Christophe Sovant Assignee: Fabio Pistolesi
Resolution: Fixed Votes: 0
Labels: Verified

Attachments: File dj1.jstack     File dj2.jstack     File rs1.jstack     File rs2.jstack    
Issue Links:
is backported by OPENDJ-5265 Backport OPENDJ-4988: Topology wide i... Dev backlog
relates to OPENDJ-846 Intermittent Replication Failure Done
relates to OPENDJ-1354 replication threads BLOCKED in pendin... Done
relates to OPENDJ-922 Replication window size is too small ... Done
is related to OPENDJ-934 Changes to RS window-size property re... Done
QA Assignee: Christophe Sovant Christophe Sovant
Version Release date Issue
6.0.0 🏢 2018-05-08 OPENDJ-4988


Found using OpenDJ 6.0.0 M120.7

We found this issue running long stress tests (for 1.5 days) doing:

  • modify operations on split DS/RS topology
  • add/del operations on standard topology (combined DS/RS)

For instance on the modify test on split DS/RS topology, we noticed that the two DS are blocked and are not accepting new operations => the modrate tool output shows a recent throughput of 0 and 0 errors.
We have not errors in the DS and RS logs.

Comment by Matthew Swift [ 2018-04-10 ]

Cause is an inter-process deadlock across the topology:

  1. DJ1: thread reading from RS (1) is indirectly performing a blocking write to the same RS (3):
    1. dc=europe,dc=com listener thread has read an update message from RS1 and is blocked trying to put the update message on the replay queue
    2. all but one replay threads are blocked attempting to update the send window
    3. remaining replay thread is updating the send window and blocked while attempting sending a WindowMsg to RS1 over the network session. It looks like the session has been closed though (due to socket timeout?)
    4. all but one worker threads are blocked on the pending changes queue lock
    5. remaining worker thread is holding the pending changes queue lock, but is blocked trying to push an update message to RS1 over the network.
  2. RS1: thread reading from DS1 is directly performing a blocking write to the same DS (1)
    1. thread reading from DS1 is blocked attempting to send a WindowMsg to DS1. Unlike the DS, the session looks open
    2. server writer, monitor, heartbeat threads are all blocked trying to queue messages to be sent to DS1
    3. RS1 -> DS1 session thread is blocked due to full TCP buffer

The result is a circle of dependencies resulting in a deadlock in the style of the dining philosophers problem. To resolve the problem, we simply need to break the chain, e.g. by finally removing the window support.

Comment by Matthew Swift [ 2018-04-10 ]

Problems like this have been encountered in the past, see links, I'm not sure why we never removed the windowing support, having effectively disabled it in OPENDJ-922.

Comment by Matthew Swift [ 2018-04-10 ]

It's unclear why replication sessions attempt to send messages even after the session has been closed. I think this is a secondary issue, but it would be a good idea to fix it. It may not need backporting though, so we should use separate commits.

Comment by Matthew Swift [ 2018-04-10 ]

It's not easy to remove the WindowMsg support because we need to support migration of old topologies to 6.0. Old versions of DJ will continue to send window messages and expect responses.

Comment by Christophe Sovant [ 2018-05-21 ]

Verified running long duration tests on 6.0.0 RC3 and final build

Generated at Sun Jun 13 05:39:29 UTC 2021 using Jira 8.16.0#816000-sha1:a455b91378454416b49bbc88d03e653cb9815ed5.