[OPENDJ-1190] Under rare circumstances, the DS replication recovery thread (RSUpdater) can spin Created: 25/Oct/13  Updated: 08/Nov/19  Resolved: 19/Aug/14

Status: Done
Project: OpenDJ
Component/s: replication
Affects Version/s: 2.6.1, 2.6.0
Fix Version/s: 3.0.0, 2.8.0, 2.6.2

Type: Bug Priority: Major
Reporter: Matthew Swift Assignee: Chris Ridd
Resolution: Fixed Votes: 0
Labels: release-notes

Dev Assignee: Chris Ridd
QA Assignee: Christophe Sovant
Support Ticket IDs:

 Description   

Under rare circumstances the DS replication recovery thread (RSUpdate) can spin, can be started multiple times, will not shutdown willingly when the server is shutting down, and can look for replay operations in the future.

For example, a user recently reported:

During a load test we noticed that even under no load, one cpu core is pegged at 100%. For example cpu7 below.

11:29:42 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
11:29:45 PM all 12.68 0.00 0.13 0.04 0.00 0.00 0.00 0.00 87.15
11:29:45 PM 0 0.67 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.33
11:29:45 PM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
11:29:45 PM 2 0.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.67
11:29:45 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
11:29:45 PM 4 0.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.67
11:29:45 PM 5 0.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.67
11:29:45 PM 6 0.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.67
11:29:45 PM 7 99.34 0.00 0.66 0.00 0.00 0.00 0.00 0.00 0.00

I took some consecutive jstacks of the pid and the only area where the stack was changing was for this thread. Is this something that we shoudl be concerned about?

"Replica DS(15968) missing change publisher for domain "cn=schema"" prio=10 tid=0x00007f77f014a000 nid=0x740e runnable [0x00007f76d35b4000]
java.lang.Thread.State: RUNNABLE
at java.util.HashMap.put(HashMap.java:372)
at java.util.HashSet.add(HashSet.java:200)
at java.util.AbstractCollection.addAll(AbstractCollection.java:305)
at java.util.LinkedHashSet.<init>(LinkedHashSet.java:152)
at org.opends.server.types.AttributeBuilder$SmallSet.addAll(AttributeBuilder.java:929)
at org.opends.server.types.AttributeBuilder.addAll(AttributeBuilder.java:1356)
at org.opends.server.backends.SchemaBackend.getSchemaEntry(SchemaBackend.java:824)
at org.opends.server.backends.SchemaBackend.getSchemaEntry(SchemaBackend.java:663)
at org.opends.server.backends.SchemaBackend.search(SchemaBackend.java:4229)
at org.opends.server.workflowelement.localbackend.LocalBackendSearchOperation.processLocalSearch(LocalBackendSearchOperation.java:266)
at org.opends.server.workflowelement.localbackend.LocalBackendWorkflowElement.execute(LocalBackendWorkflowElement.java:533)
at org.opends.server.core.WorkflowImpl.execute(WorkflowImpl.java:197)
at org.opends.server.core.WorkflowTopologyNode.execute(WorkflowTopologyNode.java:100)
at org.opends.server.core.SearchOperationBasis.run(SearchOperationBasis.java:1459)
at org.opends.server.protocols.internal.InternalClientConnection.processSearch(InternalClientConnection.java:2347)
at org.opends.server.protocols.internal.InternalClientConnection.processSearch(InternalClientConnection.java:2298)
at org.opends.server.replication.plugin.LDAPReplicationDomain.searchForChangedEntries(LDAPReplicationDomain.java:4812)
at org.opends.server.replication.plugin.LDAPReplicationDomain.buildAndPublishMissingChanges(LDAPReplicationDomain.java:4722)
at org.opends.server.replication.plugin.LDAPReplicationDomain$RSUpdater.run(LDAPReplicationDomain.java:440)

This was when recovering the schema backend which is a special case since it does not record ds-sync-hist info so recovery is impossible IIRC. However, we have also seen rare cases where this occurs for normal backends.



 Comments   
Comment by Jean-Noël Rouvignac [ 04/Nov/13 ]

Fixed in trunk.

Even though we could not reproduce, I have added counter measures which will:
This change is linked to the misbehaving which (wrongly):

  • will prevent RSUpdater thread to be started multiple times
  • will allow RSUpdater thread to shutdown faster when the server is shutting down
  • will make RSUpdater thread only look for changes older than "now"
Comment by Jean-Noël Rouvignac [ 04/Nov/13 ]

Fixed in 2.6.1 .

Comment by Chris Ridd [ 15/Jul/14 ]

This is still occurring with the same customer with 2.6.1, but only (it seems) on cn=schema.

Comment by Chris Ridd [ 28/Jul/14 ]

Some updated logging from 2.6.1 shows buildAndPublishMissingChanges() starting with a CSN in June 2013 (over a year ago) for cn=schema, and successfully performing lots of 10 second searches from that point on. This uses lots of CPU, leading the customer to believe the thread is spinning.

So one problem is the ancient start CSN, which it is retrieving from the RS.

Each search returns 0 entries, and is successful. At the end of the do..while loop in buildAndPublishMissingChanges(), lastRetrievedChange is always null, and the while loop loops for ever.

Comment by Matthew Swift [ 29/Jul/14 ]

I suggest investigating why the recovery is starting from June 2013. Why is the ds-sync-state a) not being read or, b) not being written? Why June 2013? Could it be that there was a regression introduced between 2.4.x and 2.6.x and they upgraded around June 2013 thus triggering the regression?

Comment by Chris Ridd [ 14/Aug/14 ]

The ds-sync-state is being written when changes are made. However changes are not published to the RS; when you DbDump the appropriate table in the RS changelogDb the last (and only) change for this malfunctioning server is in June 2013.

So that explains the start date.

The reason the loop doesn't terminate is that every single search is successful but returns 0 results.

It looks like the domain is effectively "stuck" in recovery mode; as the loop never completes any changes made by the user never update the RS. When the server's restarted, the loop just starts again from June 2013.

Ending the loop in a way that cancels recovery mode seemed to help.

Comment by Matthew Swift [ 07/Nov/19 ]

Moved to closed state because the fixVersion has already been released.

Generated at Sat Oct 31 01:54:46 UTC 2020 using Jira 7.13.12#713012-sha1:6e07c38070d5191bbf7353952ed38f111754533a.