[OPENDJ-6683] Divergences in encrypted changelog after upgrade Created: 27/Sep/19  Updated: 06/Dec/19  Resolved: 06/Dec/19

Status: Done
Project: OpenDJ
Component/s: replication, upgrade
Affects Version/s: 7.0.0
Fix Version/s: 7.0.0

Type: Bug Priority: Critical
Reporter: carole forel Assignee: Fabio Pistolesi
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
duplicates OPENDJ-5519 Change numbers not in sync when gener... Done
is related to OPENDJ-5665 Change numbers not in sync when gener... Dev backlog
Epic Link: Bugs 7.0
Story Points: 20
Dev Assignee: Nicolas Capponi


Found with OpenDJ 7.0.0-SNAPSHOT (35d9249c9cf)

This issue seems to happen a bit randomly, whatever the upgrade path is.

We set up 2 servers, one with data.
We configure replication between them and initialize it.
Then we set confidentiality on for changelogs on both servers:

/DJ_ENCRYPT1/opendj/bin/dsconfig -h openam.example.com -p 4451 -D "cn=myself" -w password -X set-replication-server-prop --provider-name "Multimaster Synchronization" --set confidentiality-enabled:true -n

DJ_ENCRYPT2/opendj/bin/dsconfig -h openam.example.com -p 4452 -D "cn=myself" -w password -X set-replication-server-prop --provider-name "Multimaster Synchronization" --set confidentiality-enabled:true -n

We check replication is working by doing LDAP operations on both servers.
Then we stop and upgrade one server.
We start it again and add some entries to check replication is working fine.
At that time, we compare changelog content and sometimes they are not in sync (even after more than 5minutes wait.)

Content of servers differs, differences can be found at: /root/workspace/OpenDJ-7.0.x/tests_ds_upgrade_paths_je_part1/results/20190926-150751/replication_group3/Upgrade/DJ_ENCRYPT1/opendj/tmp/diff_Encrypted_Replication_Topology_changelog_DJ_ENCRYPT1_DJ_ENCRYPT2.ldif 	
17:24:39.319 	INFO 	dn: changeNumber=10,cn=changelog
changetype: delete

dn: changeNumber=11,cn=changelog
changetype: delete

dn: changeNumber=12,cn=changelog
changetype: delete

dn: changeNumber=13,cn=changelog
changetype: delete

dn: changeNumber=14,cn=changelog
changetype: delete

dn: changeNumber=15,cn=changelog
changetype: delete

dn: changeNumber=16,cn=changelog
changetype: delete

dn: changeNumber=9,cn=changelog
changetype: delete

I have run this test on my laptop and most of the time, it succeeds while it fails everytime on jenkins.

Test is:

./run-pybot.py -s replication_group3.Upgrade -t encrypted_replication_topology opendj

it seems that what is reproducible is the entries that are missing.

Comment by Nicolas Capponi [ 21/Nov/19 ]

I have not succeeded to reproduce this issue on my laptop with a large number of runs.
I'm currently trying to reproduce it on my laptop running the functional test in a docker container, using pyforge-ready image (docker-internal.forgerock.io/forgerock/pyforge-ready:latest). No success so far.

I have also tried to provoke the issue using guess work:

  • deleting manually the offline file in the changelogDB just after the server is stopped and just before it is upgraded to 6.5.0 in order to see if this is the cause of the issue. The test succeeded. The rationale is that a missing offline message could lead to a failure to connect the servers and prevent the replication to work correctly.
  • corrupting the value of the symetric key in the admin data LDIF backend just after the server is stopped and just before it is upgraded to 6.5.0 in order to see if this could provoke an hidden connection error that lead to the issue. The test failed but the connection error is apparent, not hidden, and it does not corresponds to this issue.
Comment by Nicolas Capponi [ 25/Nov/19 ]

Blocked because I'm not able to reproduce it, and thus unable to investigate

Comment by Fabio Pistolesi [ 27/Nov/19 ]

Finally reproduced it on a lab machine. The changelogs seem ok, except the usual problems with cn=admin data. We saw an unrelated error on DJ_ENCRYPT2 after upgrading DJ_ENCRYPT1 about a secret key not being able to be imported.

The changenumber indexer though clearly stops indexing, most likely during upgrade of DJ_ENCRYPT1, but restarting the both servers fixes cn=changelog, both servers have the same content.

Comment by Fabio Pistolesi [ 28/Nov/19 ]

Interestingly, the divergence of the changelog is on the server not upgraded yet, so still on the previous version. The server upgraded to 7.0 works correctly.

Comment by Fabio Pistolesi [ 04/Dec/19 ]

After further digging, we found out we seem to hit a special case of OPENDJ-5519. During upgrade serverIDs for the domains are replaced with the id for cn=admin data, as 7.0 only has a global server-id. When the test fails in 5.5.0, changenumber indexing is not advancing the MCP because trying to process an heartbeat for a disappeared server, which is weird in itself.

As a matter of fact I cannot reproduce when testing upgrade from 6.5.

It looks the bug is not in 7.0.0, even though OPENDJ-5665 is still a problem.

Comment by Fabio Pistolesi [ 06/Dec/19 ]

Caught an interesting point during debugging, the "disappeared replica" is correctly referenced in the list of offline replicas, with CSN 0000016eb7c1118b766500000005 but also as last alive with CSN 0000016eb7c1118c766500000000, which is the problem when in previous DJ versions we sent a heartbeat independently of the other messages.
(the CSNs are 1 ms apart)

Comment by Fabio Pistolesi [ 06/Dec/19 ]

The issue is really in older versions. Suggestion is to tag failing test where divergences in changelog appear in mixed topologies (or after upgrading only one of the servers in the test) as a known issue.

Note that restarting the non ugraded server is a viable workaround.

Generated at Mon Mar 08 12:20:26 UTC 2021 using Jira 7.13.12#713012-sha1:6e07c38070d5191bbf7353952ed38f111754533a.