[OPENDJ-6036] Upgrade: Divergences in changelog in replication topology with confidentiality enabled. Created: 25/Feb/19  Updated: 17/Jul/20  Resolved: 17/Jul/20

Status: Done
Project: OpenDJ
Component/s: regression, replication, upgrade
Affects Version/s: 7.0.0
Fix Version/s: 7.0.0

Type: Bug Priority: Major
Reporter: carole forel Assignee: carole forel
Resolution: Fixed Votes: 0
Labels: None

Epic Link: Bugs 7.0
Story Points: 1
Dev Assignee: Fabio Pistolesi


Found with rev 0b5093d719d1c87e1c722c94d60b3d12aa0bc758

We set up two 3.5.0 servers, with some data.
Then we configure replication:

DJ_ENCRYPT1/opendj/bin/dsreplication enable --host1 localhost --port1 4451 --bindDN1 "cn=myself" --bindPassword1 "password" --replicationPort1 8996 --host2 localhost --port2 4452 --bindDN2 "cn=myself" --bindPassword2 "password" --replicationPort2 8997 -b dc=com -I admin -w password -X -n

DJ_ENCRYPT1/opendj/bin/dsreplication initialize-all -h localhost -p 4451 -b dc=com -I admin -w password -X -n

We set confidentiality to true for changelogs on each server.
We add some entries on one server and check data are in sync.
Then we stop and upgrade the first server.

We check replication is working by doing LDAP operations on both servers.
Some operations are missing on second server changelog:

Content of servers differs, differences can be found at: /local/GIT/pyforge/results/20190225-164721/replication_group3/Upgrade/DJ_ENCRYPT1/opendj/tmp/diff_Encrypted_Replication_Topology_changelog_DJ_ENCRYPT1_DJ_ENCRYPT2.ldif 		
dn: changeNumber=10,cn=changelog
changetype: delete

dn: changeNumber=11,cn=changelog
changetype: delete

dn: changeNumber=12,cn=changelog
changetype: delete

dn: changeNumber=13,cn=changelog
changetype: delete

dn: changeNumber=14,cn=changelog
changetype: delete

dn: changeNumber=15,cn=changelog
changetype: delete

dn: changeNumber=16,cn=changelog
changetype: delete

dn: changeNumber=9,cn=changelog
changetype: delete

To reproduce this issue:

./run-pybot.py -n -v -s replication_group3.Upgrade -t Encrypted_Replication_Topology opendj

This is a regression but is not always reproducible.

Comment by Chris Ridd [ 25/Feb/19 ]

The act of enabling replication will destroy all the instance keys used by confidentiality. Are you sure you are doing things in the correct order? What happens if you replicate and then set up confidentiality?

Comment by carole forel [ 25/Feb/19 ]

Actually, only the changelog is encrypted, it is my description that is wrong. Thanks Chris Ridd for pointing that out.

Comment by Fabio Pistolesi [ 27/Feb/19 ]

The out of sync server is actually the one not upgraded yet, i.e. 3.5.x. Replication works, though, data is correctly sent to the other DS regardless of which server gets the change. The indexer on 3.5 seems stuck
I had failures with 3.5.0, but using 3.5.3 did not reproduce the problem. My hypothesis is backport for OPENDJ-4275 solved it, since it is the only one actually touching the replicaDBs between 3.5.0 and 3.5.3. When upgrading new server ids are assigned, which need cursors on peer servers to be re-initialized.

Comment by Fabio Pistolesi [ 27/Feb/19 ]

Problem seems to be with previous version, fixed (or disappeared) by released patches.

Comment by Matthew Swift [ 07/Nov/19 ]

Closing because no testing required

Comment by carole forel [ 17/Jul/20 ]

It happened again with rev 7.0.0-SNAPSHOT (1db66478396) and rev OpenDJ 7.0.0-SNAPSHOT (f69c2724125)

Comment by carole forel [ 17/Jul/20 ]

Unfortunately it seems to be there with 7.0.0-SNAPSHOT (1db66478396)

Comment by Fabio Pistolesi [ 17/Jul/20 ]

Are you sure this is the same problem ? Is the server out of sync 3.5.3 or 7.0 ?

Comment by carole forel [ 17/Jul/20 ]

No i'm not actually. seems to be after both have been upgraded. https://ci.forgerock.org/job/OpenDJ-build/job/master/2567//artifact/log-ft-part4-linux.html#s1-s8-s14-t4
I might have concluded too fast (tricked by the randomer aspect, it happened twice out of 8 runs)
Sorry, I will close that one and open a new one.

Comment by Fabio Pistolesi [ 17/Jul/20 ]

Thanks carole forel. Note 3.5.0 is used in the tests instead of 3.5.3. As mentioned before 3.5.3 seems to solve the problem.

Generated at Mon Mar 01 10:37:33 UTC 2021 using Jira 7.13.12#713012-sha1:6e07c38070d5191bbf7353952ed38f111754533a.