[OPENDJ-3133] dsreplication status reports M.C. (Missing Changes) when none exist. Created: 16/Jun/16  Updated: 03/Mar/20  Resolved: 21/Nov/16

Status: Done
Project: OpenDJ
Component/s: backends, core server, replication, tools
Affects Version/s: 4.0.0, 3.5.0, 3.0.0
Fix Version/s: 4.0.0

Type: Bug Priority: Critical
Reporter: Lee Trujillo Assignee: Jean-Noël Rouvignac
Resolution: Fixed Votes: 0
Labels: Verified, release-notes

Attachments: Zip Archive OPENDJ-3133-rep-post-divergence.zip     Zip Archive OPENDJ-3133-rep-post-external.zip     Zip Archive OPENDJ-3133-reproduction.zip     Zip Archive debug-data-2016-06-15-152636.zip     Zip Archive debug-data-2016-06-15-152718.zip     File setup_replication.sh    
Issue Links:
Backport
is backported by OPENDJ-3248 Backport OPENDJ-3133: dsreplication s... Done
is backported by OPENDJ-4344 Backport OPENDJ-3133: dsreplication s... Done
Relates
relates to OPENDJ-3337 dsreplication status on a DS shows a ... Done
QA Assignee: Christophe Sovant
Support Ticket IDs:

 Description   

dsreplication status reports Missing Changes when all databases and the changelogDb all have the same data; i.e. no missing changes/data difference.

The following example build, models how an instance can be created from a system "snapshot" (AWS) of a current Master.

  1. Create 2 masters in a Multi-Master replication topology.
    • Each uses an instance.loc to split the cli's and libs from the data
    • instance.loc - /opt/instances/reproduction/masterX/opendjdata
    • Installation Path: /opt/instances/reproduction/masterX/opendj
    • Instance Path: /opt/instances/reproduction/masterX/opendjdata
  2. Stop Master 1
    • Snapshot the whole instance as would happen with AWS' snapshot feature.
    • cp -rp <master> <new master> will suffice.
  3. Start Master 1 and add an entry
    • Check replication status. 0 Missing Changes
  4. Using the snapshotted data above, create a new Master using an instance.loc
    • instance.loc - /opt/instances/reproduction/masterX/opendjdata
    • Remove opendj and all under opendjdata "except" db/
    • Setup with no backend db or suffixes.
    • Create a new backend with the topology suffix. bin/status now shows all entries from Master 1 now available.
    • Enable replication from Master 1 to Master3.
    • Check replication status. 1 Missing Changes

With Master 1 & 2 built out, replication is reporting 0 M.C. (from Master1)

Checking status for opendj.forgerock.com:4444	2016-06-15-152324

opendj; bin/$ date; ./dsreplication status --adminUID admin --adminPasswordFile pass --hostname opendj.forgerock.com --port 4444 --trustAll
Wed Jun 15 15:23:24 MDT 2016
Suffix DN         : Server                    : Entries : Replication enabled : DS ID : RS ID : RS Port (1) : M.C. (2) : A.O.M.C. (3) : Security (4)
------------------:---------------------------:---------:---------------------:-------:-------:-------------:----------:--------------:-------------
dc=example,dc=com : opendj.forgerock.com:4444 : 2003    : true                : 10000 : 20000 : 8989        : 0        :              : false
dc=example,dc=com : opendj.forgerock.com:5444 : 2003    : true                : 31817 : 9865  : 9989        : 0        :              : true

The following starts with enabling replication from Step 4 above.

The Command is -> ./dsreplication enable --adminUID admin --adminPassword password --baseDN dc=example,dc=com --host1 opendj.forgerock.com --port1 4444 --bindDN1 "cn=Directory Manager" --bindPassword1 password --replicationPort1 8989 --host2 opendj.forgerock.com --port2 6444 --bindDN2 "cn=Directory Manager" --bindPassword2 password --replicationPort2 10989 --trustAll --no-prompt

Running command: Wed Jun 15 15:25:53 MDT 2016


Establishing connections ..... Done.
Checking registration information ..... Done.
Updating remote references on server opendj.forgerock.com:4444 ..... Done.
Configuring Replication port on server opendj.forgerock.com:6444 ..... Done.
Updating replication configuration for baseDN dc=example,dc=com on server
opendj.forgerock.com:4444 .....Done.
Updating replication configuration for baseDN dc=example,dc=com on server
opendj.forgerock.com:5444 .....Done.
Updating remote references on server opendj.forgerock.com:5444 ..... Done.
Updating replication configuration for baseDN dc=example,dc=com on server
opendj.forgerock.com:6444 .....Done.
Updating registration configuration on server opendj.forgerock.com:4444 ..... Done.
Updating registration configuration on server opendj.forgerock.com:5444 ..... Done.
Updating registration configuration on server opendj.forgerock.com:6444 ..... Done.
Updating replication configuration for baseDN cn=schema on server
opendj.forgerock.com:4444 .....Done.
Updating replication configuration for baseDN cn=schema on server
opendj.forgerock.com:5444 .....Done.
Updating replication configuration for baseDN cn=schema on server
opendj.forgerock.com:6444 .....Done.
Initializing registration information on server opendj.forgerock.com:6444 with
the contents of server opendj.forgerock.com:4444 .....Done.
Initializing schema on server opendj.forgerock.com:6444 with the contents of
server opendj.forgerock.com:4444 .....Done.

Replication has been successfully enabled.  Note that for replication to work
you must initialize the contents of the base DNs that are being replicated
(use dsreplication initialize to do so).

Check replication (from Master1) after enabling replication from Master 1 to Master 3.

Checking status for opendj.forgerock.com:4444	2016-06-15-152627

opendj; bin/$ date; ./dsreplication status --adminUID admin --adminPasswordFile pass --hostname opendj.forgerock.com --port 4444 --trustAll
Wed Jun 15 15:26:27 MDT 2016
Suffix DN         : Server                    : Entries : Replication enabled : DS ID : RS ID : RS Port (1) : M.C. (2) : A.O.M.C. (3) : Security (4)
------------------:---------------------------:---------:---------------------:-------:-------:-------------:----------:--------------:-------------
dc=example,dc=com : opendj.forgerock.com:4444 : 2003    : true                : 10000 : 20000 : 8989        : 0        :              : false
dc=example,dc=com : opendj.forgerock.com:5444 : 2003    : true                : 31817 : 9865  : 9989        : 0        :              : true
dc=example,dc=com : opendj.forgerock.com:6444 : 2003    : true                : 16699 : 15470 : 10989       : 1        :              : false

All changelog searches show 1 Change.

dn: cn=changelog
objectClass: top
objectClass: container
cn: changelog
subschemaSubentry: cn=schema
hasSubordinates: true
entryDN: cn=changelog

dn: changeNumber=1,cn=changelog
objectClass: top
objectClass: changeLogEntry
changeNumber: 1
changeTime: 20160615212319Z
changeType: add
targetDN: uid=luser,ou=People,dc=example,dc=com
changes:: b2JqZWN0Q2xhc3M6IG9yZ2FuaXphdGlvbmFsUGVyc29uCm9iamVjdENsYXNzOiB0b3AKb2JqZWN0Q2xhc3M6IHBlcnNvbgpvYmplY3RDbGFzczogaW5ldE9yZ1BlcnNvbgp1aWQ6IGx1c2VyCmdpdmVuTmFtZTogTG91CnNuOiBVc2VyCmNuOiBMb3UgVXNlcgp1c2VyUGFzc3dvcmQ6IHtTU0hBfUNrcno4MUJDYmUwRjhNc1lWNnoyanpwck4zT084SzRUVWZ1TFpnPT0KZW50cnlVVUlEOiBkNzE3OGNkNS1iOWY3LTQyZjQtYjIxNC1hMzE5YjdhOWI5MDcKY3JlYXRlVGltZXN0YW1wOiAyMDE2MDYxNTIxMjMxOVoKcHdkQ2hhbmdlZFRpbWU6IDIwMTYwNjE1MjEyMzE5LjUxMVoKY3JlYXRvcnNOYW1lOiBjbj1EaXJlY3RvcnkgTWFuYWdlcixjbj1Sb290IEROcyxjbj1jb25maWcK
subschemaSubentry: cn=schema
numSubordinates: 0
hasSubordinates: false
entryDN: changeNumber=1,cn=changelog
replicationCSN: 0000015555f26f57271000000001
replicaIdentifier: 10000
changeInitiatorsName: cn=Directory Manager,cn=Root DNs,cn=config
targetEntryUUID: d7178cd5-b9f7-42f4-b214-a319b7a9b907
changeLogCookie: dc=example,dc=com:0000015555f26f57271000000001;

# Total number of matching entries: 2

Likewise, each exported backend ldif are exactly the same..again, no missing changes (data).

opendj; reproduction/$ diff ./master2/opendj/bin/m1-example.ldif ./master2/opendj/bin/m1-example.ldif
opendj; reproduction/$ diff ./master2/opendj/bin/m1-example.ldif ./master1/opendj/bin/m2-example.ldif
opendj; reproduction/$ diff ./master2/opendj/bin/m1-example.ldif ./master3/opendj/bin/m3-example.ldif
opendj; reproduction/$ 
opendj; reproduction/$ diff ./master2/opendj/bin/m1-example.ldif ./master1/opendj/bin/m2-example.ldif
opendj; reproduction/$ diff ./master2/opendj/bin/m1-example.ldif ./master3/opendj/bin/m3-example.ldif
opendj; reproduction/$ diff ./master1/opendj/bin/m2-example.ldif ./master3/opendj/bin/m3-example.ldif

Data attached:

I captured the following files within each of the following steps:

num.<file> = Index Num from below.

1.changelogDb.2016-06-15-152228
1.cn=repl.monitor.2016-06-15-152229
1.head-dump.2016-06-15-152230
1.replStatus.2016-06-15-152230
2.changelogDb.2016-06-15-152322
2.cn=repl.monitor.2016-06-15-152322
2.head-dump.2016-06-15-152323
2.replStatus.2016-06-15-152324
3.changelogDb.2016-06-15-152626
3.cn=repl.monitor.2016-06-15-152626
3.head-dump.2016-06-15-152627
3.replStatus.2016-06-15-152627

Index 1: After M1 & M2 are MMR enabled.
Index 2: After restart of M1 (post snapshot) and adding 1 entry to M1 (uid=luser)
Index 3: After adding M3 to replication

-rw-r--r--  1 opendj  opendj     268159 Jun 15 15:26 debug-data-2016-06-15-152636.zip
-rw-r--r--  1 opendj  opendj     268886 Jun 15 15:27 debug-data-2016-06-15-152718.zip

Each of the attached contains a copy of the opendjdata/changelogDb before and after stopping all servers.

debug-data-2016-06-15-152636.zip: taken while the servers were running with a Missing Change.
debug-data-2016-06-15-152718.zip: taken after all servers were stopped.



 Comments   
Comment by Lee Trujillo [ 16/Jun/16 ]

I've been able to replicate this from a current customer issue.

The topology is slightly different in that there are 2 masters and 1 DS only consumer directory.

  1. Setup 2 DS+RS (Master 1 & 2) and 1 DS (Directory1) only.
  2. Backuped Master1
  3. Disabled --disableAll on Directory1
  4. Stopped Directory1
  5. Played changes to Master1.
  6. Restored to Directory1
  7. Started Directory1
  8. Enabled replication from Master 1 to Directory1.

After setup of Master 1 & 2.

Checking status for opendj.forgerock.com:4444	2016-06-16-143047

ls: *.replStatus.*: No such file or directory
opendj; bin/$ date; ./dsreplication status --adminUID admin --adminPasswordFile pass --hostname opendj.forgerock.com --port 4444 --trustAll
Thu Jun 16 14:30:47 MDT 2016
Suffix DN         : Server                    : Entries : Replication enabled : DS ID : RS ID : RS Port (1) : M.C. (2) : A.O.M.C. (3) : Security (4)
------------------:---------------------------:---------:---------------------:-------:-------:-------------:----------:--------------:-------------
dc=example,dc=com : opendj.forgerock.com:4444 : 2002    : true                : 10000 : 20000 : 8989        : 0        :              : false
dc=example,dc=com : opendj.forgerock.com:5444 : 2002    : true                : 20121 : 29832 : 9989        : 0        :              : true

After adding in the DS only "Directory1"

Checking status for opendj.forgerock.com:4444	2016-06-16-143337

opendj; bin/$ date; ./dsreplication status --adminUID admin --adminPasswordFile pass --hostname opendj.forgerock.com --port 4444 --trustAll
Thu Jun 16 14:33:37 MDT 2016
Suffix DN         : Server                    : Entries : Replication enabled : DS ID : RS ID : RS Port (1) : M.C. (2) : A.O.M.C. (3) : Security (4)
------------------:---------------------------:---------:---------------------:-------:-------:-------------:----------:--------------:-------------
dc=example,dc=com : opendj.forgerock.com:4444 : 2002    : true                : 10000 : 20000 : 8989        : 0        :              : false
dc=example,dc=com : opendj.forgerock.com:5444 : 2002    : true                : 20121 : 29832 : 9989        : 0        :              : true
dc=example,dc=com : opendj.forgerock.com:6444 : 2002    : true                : 28962 : (5)   :             : 0        :              : 

After restore on Directory1 and re-enabling replication from Master1 to Directory1

Checking status for opendj.forgerock.com:4444	2016-06-16-143641

opendj; bin/$ date; ./dsreplication status --adminUID admin --adminPasswordFile pass --hostname opendj.forgerock.com --port 4444 --trustAll
Thu Jun 16 14:36:41 MDT 2016
Suffix DN         : Server                    : Entries : Replication enabled : DS ID : RS ID : RS Port (1) : M.C. (2) : A.O.M.C. (3) : Security (4)
------------------:---------------------------:---------:---------------------:-------:-------:-------------:----------:--------------:-------------
dc=example,dc=com : opendj.forgerock.com:4444 : 2002    : true                : 10000 : 20000 : 8989        : 0        :              : false
dc=example,dc=com : opendj.forgerock.com:5444 : 2002    : true                : 20121 : 29832 : 9989        : 0        :              : true
dc=example,dc=com : opendj.forgerock.com:6444 : 2002    : true                : 6995  : (5)   :             : 1        :              : 
./dsreplication enable 
 --host1 opendj.forgerock.com 
 --port1 4444
 --bindDN1 "cn=Directory Manager" 
 --bindPassword1 password 
 --secureReplication1 
 --host2 opendj.forgerock.com 
 --port2 6444 
 --bindDN2 "cn=Directory Manager" 
 --bindPassword2 password 
 --secureReplication2 
 --noReplicationServer2 
 --baseDN dc=example,dc=com 
 --adminUID admin 
 --adminPassword password 
 --no-prompt 
 --noPropertiesFile 
 --trustAll

I've scp'd them to our internal customer server. ~chris.ridd will know how to access it.

-> OPENDJ-3133-debugdata.zip

Debug data while running - debug-data-2016-06-16-144113.zip
Debug data while stopped - debug-data-2016-06-16-144148.zip

-rw-r--r--   1 opendj  opendj  13359775 Jun 16 14:41 debug-data-2016-06-16-144113.zip
-rw-r--r--   1 opendj  opendj  13361286 Jun 16 14:41 debug-data-2016-06-16-144148.zip
Comment by Ludovic Poitou [ 16/Jun/16 ]

Can you try to run dsreplication status against all servers (port 4444, 5444, 6444) ?
I'd like to make sure there is no inconsistent results as well if asking different servers.

Comment by Lee Trujillo [ 29/Jun/16 ]

Attaching scripts that can be used in reproduction.

-rw-r--r--  1 opendj  opendj  11054 Jun 28 18:22 OPENDJ-3133-reproduction.zip

Edit the following scripts and run in the following order

  1. ./setup.master1
  2. ./setup.master2
  3. ./setup.directory1
  4. ./migrate.directory1
  5. ./migrate.master2
  6. ./migrate.master1

Notes:

1) All scripts use a the following parameters. Edit as needed.

The setup script will unzip's the DJ 2.6.0.zip and renames the opendj path to the following instance names.

instance=/opt/instances/master1
instance=/opt/instances/master2
instance=/opt/instances/directory1

This is used once all instances are setup and then with each migration.* script.

ldifFile=/opt/instances/master1-example.ldif
localhostname="opendj.forgerock.com"
...
masterhostname="opendj.forgerock.com"

2) Setup scripts have the following: setup is first done using DJ 2.6.0.

binary='/opt/binaries/OpenDJ-2.6.0.zip'

3) Migration scripts have the following: migration is done using DJ 3.0.0.

binary='/opt/binaries/OpenDJ-3.0.0.zip'

4) The migrate.directory1 and migrate.master2 scripts pauses between the final stop-ds and start-ds.
5) The migrate.master1 script does not pause during restart as the above does.

6) The setup scripts setup DJ 2.6.0 using a basic dsreplication enable and init all followed by an export from Master1 to ldif. This ldif will be used to import into the new DJ 3.0.0 instances.
7) The migration scripts stop the instance, removes it, unzips 3.0.0, executes setup, initializes and enables replication.

Comment by Lee Trujillo [ 29/Jun/16 ]

Used to recreate this issue

Comment by Jean-Noël Rouvignac [ 13/Jul/16 ]

Thanks Lee for your scripts, they are immensely useful.

After running your test, I ran the following commands:

$ target/master1/bin/dsreplication status --adminUID admin --adminPassword password --hostname localhost --port 4444 --trustAll
Suffix DN         : Server         : Entries : Replication enabled : DS ID : RS ID : RS Port (1) : M.C. (2) : A.O.M.C. (3) : Security (4)
------------------:----------------:---------:---------------------:-------:-------:-------------:----------:--------------:-------------
dc=example,dc=com : localhost:4444 : 2002    : true                : 23710 : 10822 : 8989        : 4        :              : true
dc=example,dc=com : localhost:5444 : 2002    : true                : 30164 : 16408 : 9989        : 0        :              : true
dc=example,dc=com : localhost:6444 : 2002    : true                : 17985 : (5)   :             : 1        :              : 

[1] The port used to communicate between the servers whose contents are being
replicated.
[2] The number of changes that are still missing on this server (and that have
been applied to at least one of the other servers).
[3] Age of oldest missing change: the date on which the oldest change that has
not arrived on this server was generated.
[4] Whether the replication communication through the replication port is
encrypted or not.
[5] Server not configured as a replication server (no replication changelog).
[6] Server does not contain replicated data for the suffix.
$ target/master1/bin/dsreplication status --adminUID admin --adminPassword password --hostname localhost --port 4444 --trustAll
Suffix DN         : Server         : Entries : Replication enabled : DS ID : RS ID : RS Port (1) : M.C. (2) : A.O.M.C. (3) : Security (4)
------------------:----------------:---------:---------------------:-------:-------:-------------:----------:--------------:-------------
dc=example,dc=com : localhost:4444 : 2002    : true                : 23710 : 10822 : 8989        : 0        :              : true
dc=example,dc=com : localhost:5444 : 2002    : true                : 30164 : 16408 : 9989        : 0        :              : true
dc=example,dc=com : localhost:6444 : 2002    : true                : 17985 : (5)   :             : 0        :              : 

[1] The port used to communicate between the servers whose contents are being
replicated.
[2] The number of changes that are still missing on this server (and that have
been applied to at least one of the other servers).
[3] Age of oldest missing change: the date on which the oldest change that has
not arrived on this server was generated.
[4] Whether the replication communication through the replication port is
encrypted or not.
[5] Server not configured as a replication server (no replication changelog).
[6] Server does not contain replicated data for the suffix.
jnrouvignac@jnrouvignac-Precision-5510 ~/git/opendj/opendj-server-legacy $ 

As can be seen the results vary, going from seeing missing changes to not seeing any. Further runs show one or the other result.

 

When running the following commands, I can see which server the missing changes are coming from:

# Run against master1
$ target/master1/bin/ldapsearch -p 1389 -D "cn=Directory Manager" -w password -T -b "cn=monitor" "(&(missing-changes=*)(domain-name=dc=example,dc=com)(server-id=*)(cn=*DS*))" domain-name server-id missing-changes approx-older-change-not-synchronized-millis
dn: cn=Connected directory server DS(23710) localhost:45982,cn=Replication server RS(10822) localhost:8989,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 23710
missing-changes: 4

dn: cn=Connected directory server DS(17985) localhost:44552,cn=Connected replication server RS(16408) localhost:9989,cn=Replication server RS(10822) localhost:8989,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 17985
missing-changes: 1

dn: cn=Connected directory server DS(30164) localhost:44402,cn=Connected replication server RS(16408) localhost:9989,cn=Replication server RS(10822) localhost:8989,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 30164
missing-changes: 0
# Run against master2
$ target/master1/bin/ldapsearch -p 2389 -D "cn=Directory Manager" -w password -T -b "cn=monitor" "(&(missing-changes=*)(domain-name=dc=example,dc=com)(server-id=*)(cn=*DS*))" domain-name server-id missing-changes approx-older-change-not-synchronized-millis
dn: cn=Connected directory server DS(17985) localhost:44552,cn=Replication server RS(16408) localhost:9989,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 17985
missing-changes: 0

dn: cn=Connected directory server DS(30164) localhost:44402,cn=Replication server RS(16408) localhost:9989,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 30164
missing-changes: 0

dn: cn=Connected directory server DS(23710) localhost:45982,cn=Connected replication server RS(10822) localhost:8989,cn=Replication server RS(16408) localhost:9989,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 23710
missing-changes: 0
# Run against directory1
$ target/master1/bin/ldapsearch -p 3389 -D "cn=Directory Manager" -w password -T -b "cn=monitor" "(&(missing-changes=*)(domain-name=dc=example,dc=com)(server-id=*)(cn=*DS*))" domain-name server-id missing-changes approx-older-change-not-synchronized-millis

 

So far, I can see two issues:

  1. dsreplication status decides which server to query based on the changing mood of a HashSet (it could stick to the server host+port provided on command line)
  2. one server which has totally different results for missing changes compared to the other server. This exposes the problem on dsreplication status.

Obviously the second problem is the root cause.

Comment by Jean-Noël Rouvignac [ 13/Jul/16 ]

The missing-changes are reported on the migrated master1:

  • for the migrated master1: 4 missing changes
  • for the migrated directory1: 1 missing change

Reminder: the script execution order is the following:

  1. setup.master1
  2. setup.master2
  3. setup.directory1
  4. migrate.directory1
  5. migrate.master2
  6. migrate.master1

So investigations will focus on missing-changes computation in master1.

Comment by Jean-Noël Rouvignac [ 15/Jul/16 ]

Here is what Lee's scripts are doing:

rm -rf install_path
unzip -q opendj-3.0.0.zip
setup
import-ldif
dsreplication enable

This seem to depart a bit from the "usual" way of setting up replication:

  1. Internal initialization:
    dsreplication enable
    dsreplication initialize
    
  2. External initialization:
    dsreplication enable
    dsreplication preexternalinitialization # disables replication
    import-ldif
    dsreplication postexternalinitialization # reenable replication
    

It would be good to check whether one of the "usual" ways is a workaround to this problem.

Comment by Lee Trujillo [ 15/Jul/16 ]

I can confirm, using the pre/post external does not leave me with MC's.

Note the difference and order between the old and new testcases.

Original Testcase:

  1. import-ldif
  2. dsreplication enable
  3. stop-ds --restart.

Pre/Post Testcase:

  1. dsreplication enable
  2. dsreplication pre-external-initialization
  3. import-ldif
  4. dsreplication post-external-initialization
  5. stop-ds --restart.

I changed my scripts to include the pre/post external commands:
OPENDJ-3133-rep-post-external.zip

Comment by Lee Trujillo [ 26/Jul/16 ]

Note on using pre-external and post-external-replication commands before and after an initialization is completed.

There is a difference in how replication progresses when pre/post is used vs when it's not.

  • Without: All changes are played to all other servers. No entry count difference.
  • With: All changes are not played to all other servers. Entry count difference exists.

Starting point in replication

opendj; bin/$ date; ./dsreplication status --adminUID admin --adminPasswordFile pass --hostname opendj.forgerock.com --port 4444 --trustAll
Mon Jul 25 16:05:35 MDT 2016
Suffix DN         : Server                    : Entries : Replication enabled : DS ID : RS ID : RS Port (1) : M.C. (2) : A.O.M.C. (3) : Security (4)
------------------:---------------------------:---------:---------------------:-------:-------:-------------:----------:--------------:-------------
dc=example,dc=com : opendj.forgerock.com:4444 : 2002    : true                : 10000 : 20000 : 8989        : 0        :              : false
dc=example,dc=com : opendj.forgerock.com:5444 : 2002    : true                : 29726 : 24218 : 9989        : 0        :              : true
dc=example,dc=com : opendj.forgerock.com:6444 : 2002    : true                : 26531 : (5)   :             : 0        :              : 

With pre/post external-replication

  • Apply an ADD load to the Main or Secondary Master
  • Disable replication on the 2.6.0 instance
  • Stop the old instance and remove it.
  • Setup the new instance
  • Enable replication
  • Run pre-external-replication
  • Export from the Main Master (taking the current load).
  • Import into this new instance
  • Run post-external-replication
  • Restart the new instance

After Directory 1 (DS only) migration

opendj.forgerock.com:6444

opendj; bin/$ date; ./dsreplication status --adminUID admin --adminPasswordFile pass --hostname opendj.forgerock.com --port 4444 --trustAll
Mon Jul 25 16:08:51 MDT 2016
Suffix DN         : Server                    : Entries : Replication enabled : DS ID : RS ID : RS Port (1) : M.C. (2) : A.O.M.C. (3) : Security (4)
------------------:---------------------------:---------:---------------------:-------:-------:-------------:----------:--------------:-------------
dc=example,dc=com : opendj.forgerock.com:4444 : 2029    : true                : 10000 : 20000 : 8989        : 0        :              : false
dc=example,dc=com : opendj.forgerock.com:5444 : 2023    : true                : 29726 : 24218 : 9989        : 0        :              : true
dc=example,dc=com : opendj.forgerock.com:6444 : 2024    : true                : 5606  : (5)   :             : 0        :              : 

After Master 2 (DS only) migration

opendj.forgerock.com:5444

opendj; bin/$ date; ./dsreplication status --adminUID admin --adminPasswordFile pass --hostname opendj.forgerock.com --port 4444 --trustAll
Mon Jul 25 16:16:58 MDT 2016
Suffix DN         : Server                    : Entries : Replication enabled : DS ID : RS ID : RS Port (1) : M.C. (2) : A.O.M.C. (3) : Security (4)
------------------:---------------------------:---------:---------------------:-------:-------:-------------:----------:--------------:-------------
dc=example,dc=com : opendj.forgerock.com:4444 : 2060    : true                : 10000 : 20000 : 8989        : 0        :              : false
dc=example,dc=com : opendj.forgerock.com:5444 : 2054    : true                : 5051  : 32766 : 9989        : 0        :              : true
dc=example,dc=com : opendj.forgerock.com:6444 : 2048    : true                : 5606  : (5)   :             : 0        :              : 

After Master 1 (DS only) migration

opendj.forgerock.com:4444

When migrating the final master, Master 1, the ADD load must be applied to another server. In this case Master 2. As a consequence of the Entry Counts being off, attempts to add new entries fail with err=68.

Processing ADD request for uid=user.2054,ou=People,dc=example,dc=com
ADD operation failed
Result Code:  68 (Entry Already Exists)
Additional Information:  The entry uid=user.2054,ou=People,dc=example,dc=com cannot be added because an entry with that name already exists
opendj; bin/$ date; ./dsreplication status --adminUID admin --adminPasswordFile pass --hostname opendj.forgerock.com --port 4444 --trustAll
Mon Jul 25 16:25:46 MDT 2016
Suffix DN         : Server                    : Entries : Replication enabled : DS ID : RS ID : RS Port (1) : M.C. (2) : A.O.M.C. (3) : Security (4)
------------------:---------------------------:---------:---------------------:-------:-------:-------------:----------:--------------:-------------
dc=example,dc=com : opendj.forgerock.com:4444 : 2069    : true                : 8981  : 14335 : 8989        : 0        :              : true
dc=example,dc=com : opendj.forgerock.com:5444 : 2074    : true                : 5051  : 32766 : 9989        : 0        :              : true
dc=example,dc=com : opendj.forgerock.com:6444 : 2062    : true                : 5606  : (5)   :             : 0        :              : 

Without pre/post external-replication

  • Apply an ADD load to the Main or Secondary Master
  • Disable replication on the 2.6.0 instance
  • Stop the old instance and remove it.
  • Setup the new instance
  • Enable replication
  • Export from the Main Master (taking the current load).
  • Import into this new instance
  • Restart the new instance

During the migrations of all three instances, replication always shows the Entry Count as the same (up to date).

opendj; bin/$ date; ./dsreplication status --adminUID admin --adminPasswordFile pass --hostname opendj.forgerock.com --port 4444 --trustAll
Mon Jul 25 15:31:25 MDT 2016
clSuffix DN         : Server                    : Entries : Replication enabled : DS ID : RS ID : RS Port (1) : M.C. (2) : A.O.M.C. (3) : Security (4)
------------------:---------------------------:---------:---------------------:-------:-------:-------------:----------:--------------:-------------
dc=example,dc=com : opendj.forgerock.com:4444 : 2002    : true                : 10000 : 20000 : 8989        : 0        :              : false
dc=example,dc=com : opendj.forgerock.com:5444 : 2002    : true                : 29726 : 24218 : 9989        : 0        :              : true
dc=example,dc=com : opendj.forgerock.com:6444 : 2002    : true                : 26531 : (5)   :             : 0        :              : 

opendj; bin/$ date; ./dsreplication status --adminUID admin --adminPasswordFile pass --hostname opendj.forgerock.com --port 4444 --trustAll
Mon Jul 25 15:34:57 MDT 2016
Suffix DN         : Server                    : Entries : Replication enabled : DS ID : RS ID : RS Port (1) : M.C. (2) : A.O.M.C. (3) : Security (4)
------------------:---------------------------:---------:---------------------:-------:-------:-------------:----------:--------------:-------------
dc=example,dc=com : opendj.forgerock.com:4444 : 2029    : true                : 10000 : 20000 : 8989        : 0        :              : false
dc=example,dc=com : opendj.forgerock.com:5444 : 2029    : true                : 29726 : 24218 : 9989        : 0        :              : true
dc=example,dc=com : opendj.forgerock.com:6444 : 2029    : true                : 723   : (5)   :             : 0        :              : 

opendj; bin/$ date; ./dsreplication status --adminUID admin --adminPasswordFile pass --hostname opendj.forgerock.com --port 4444 --trustAll
Mon Jul 25 15:36:43 MDT 2016
Suffix DN         : Server                    : Entries : Replication enabled : DS ID : RS ID : RS Port (1) : M.C. (2) : A.O.M.C. (3) : Security (4)
------------------:---------------------------:---------:---------------------:-------:-------:-------------:----------:--------------:-------------
dc=example,dc=com : opendj.forgerock.com:4444 : 2047    : true                : 10000 : 20000 : 8989        : 0        :              : false
dc=example,dc=com : opendj.forgerock.com:5444 : 2047    : true                : 13134 : 24727 : 9989        : 2        :              : true
dc=example,dc=com : opendj.forgerock.com:6444 : 2047    : true                : 723   : (5)   :             : 0        :              : 

opendj; bin/$ date; ./dsreplication status --adminUID admin --adminPasswordFile pass --hostname opendj.forgerock.com --port 4444 --trustAll
Mon Jul 25 15:40:13 MDT 2016
Suffix DN         : Server                    : Entries : Replication enabled : DS ID : RS ID : RS Port (1) : M.C. (2) : A.O.M.C. (3) : Security (4)
------------------:---------------------------:---------:---------------------:-------:-------:-------------:----------:--------------:-------------
dc=example,dc=com : opendj.forgerock.com:4444 : 2056    : true                : 3757  : 5440  : 8989        : 0        :              : true
dc=example,dc=com : opendj.forgerock.com:5444 : 2056    : true                : 13134 : 24727 : 9989        : 0        :              : true
dc=example,dc=com : opendj.forgerock.com:6444 : 2056    : true                : 723   : (5)   :             : 0        :              : 

Looking at ReplicationServerDomain.java we see the Generation ID is reset and then the changelog is cleared. This is what would cause replication to diverge between the time replication is enabled until finally the post-external-replication command is given.

  /**
   * Sets the provided value as the new in memory generationId.
   * Also clear the changelog databases.
   *
   * @param generationId The new value of generationId.
   * @return The old generation id
   */
  public long changeGenerationId(long generationId)
  {
    synchronized (generationIDLock)
    {
      long oldGenerationId = this.generationId;

      if (this.generationId != generationId)
      {
        clearDbs();

        this.generationId = generationId;
        this.generationIdSavedStatus = false;
      }
      return oldGenerationId;
    }
  }

Attached:

opendj/$ unzip -l OPENDJ-3133-rep-post-divergence.zip
Archive:  OPENDJ-3133-rep-post-divergence.zip
  Length     Date   Time    Name
 --------    ----   ----    ----
    10059  07-25-16 16:06   migrate.directory1
    10103  07-25-16 16:06   migrate.master1
    10115  07-25-16 16:06   migrate.master2
    31697  07-25-16 16:34   migrate-directory1.out
     4805  07-25-16 16:34   migrate-master1-add-failures.out
    33060  07-25-16 16:34   migrate-master1.out
    32981  07-25-16 16:34   migrate-master2.out
 --------                   -------
   132820                   7 files
Comment by Jean-Noël Rouvignac [ 29/Jul/16 ]

I can confirm that the missing changes are actually replica offline messages (messages being sent when a replica goes offline via stop-ds).
Care was taken in the code to avoid them polluting the replica states, but it looks like some support is missing somewhere.
I am investigating further to understand the full dynamics across the topology.

Comment by Jean-Noël Rouvignac [ 29/Jul/16 ]

Running:

# stop ALL the servers in the topology
master1/bin/start-ds
master2/bin/start-ds
directory1/bin/start-ds

# restart ALL the servers in the topology
master1/bin/stop-ds
master2/bin/stop-ds
directory1/bin/stop-ds

... has the following effect on master1:

$ master1/bin/ldapsearch -p 1389 -D "cn=Directory Manager" -w password -T -b "cn=monitor" "(&(missing-changes=*)(domain-name=dc=example,dc=com)(server-id=*)(cn=*DS*))" domain-name server-id missing-changes approx-older-change-not-synchronized-millis
dn: cn=Connected directory server DS(17340) localhost:39516,cn=Replication server RS(17354) localhost:8989,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 17340
missing-changes: 0

dn: cn=Connected directory server DS(8951) localhost:39478,cn=Replication server RS(17354) localhost:8989,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 8951
missing-changes: 0

dn: cn=Connected directory server DS(17899) localhost:51058,cn=Connected replication server RS(4684) localhost:9989,cn=Replication server RS(17354) localhost:8989,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 17899
missing-changes: 0

... and on master2

$ master1/bin/ldapsearch -p 2389 -D "cn=Directory Manager" -w password -T -b "cn=monitor" "(&(missing-changes=*)(domain-name=dc=example,dc=com)(server-id=*)(cn=*DS*))" domain-name server-id missing-changes approx-older-change-not-synchronized-millis
dn: cn=Connected directory server DS(17899) localhost:51058,cn=Replication server RS(4684) localhost:9989,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 17899
missing-changes: 0

dn: cn=Connected directory server DS(17340) localhost:39516,cn=Connected replication server RS(17354) localhost:8989,cn=Replication server RS(4684) localhost:9989,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 17340
missing-changes: 0

dn: cn=Connected directory server DS(8951) localhost:39478,cn=Connected replication server RS(17354) localhost:8989,cn=Replication server RS(4684) localhost:9989,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 8951
missing-changes: 0

 

Said otherwise, the missing-changes problem is gone.
It means this problem is purely "transient", i.e. this problem exists only in-memory (no data related to this problem is persisted on disk).

 

Stopping only one instance does not work because it gets the same information back from another instance in the topology.

Comment by Jean-Noël Rouvignac [ 29/Jul/16 ]

I was wondering if this problem could be due to the mix of 2.6.x with older version, but it is a problem with all versions:

  • upgrading 2.6.4 to 4.0.0-SNAPSHOT: missing changes are reported
  • upgrading 3.0.0 to 4.0.0-SNAPSHOT: missing changes are reported too
Comment by Jean-Noël Rouvignac [ 29/Jul/16 ]

This comment describes a simpler scenario which often recreates the condition, but not always (See attached setup_replication.sh).
It takes a third of the time of the other scenario so it is faster to debug, even in case a rerun is needed.

init

  1. start opendj_0_DSRS: combined DS/RS
  2. start opendj_1_DSRS: combined DS/RS
    1. dsreplication enable with server 0
  3. start opendj_2_DS: DS only
    1. dsreplication enable with server 0
  4. dsreplication initialize-all

no missing-changes

opendj_0_DSRS view of missing-changes:

$ opendj_0_DSRS/bin/ldapsearch -p 1500 -D "cn=Directory Manager" -w admin -T -b "cn=monitor" "(&(missing-changes=*)(domain-name=dc=example,dc=com)(server-id=*)(cn=*DS*))" domain-name server-id missing-changes approx-older-change-not-synchronized-millis'

dn: cn=Connected directory server DS(21154) localhost:54038,cn=Replication se
rver RS(28208) localhost:8900,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 21154
missing-changes: 0

dn: cn=Connected directory server DS(29646) localhost:54972,cn=Replication se
rver RS(28208) localhost:8900,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 29646
missing-changes: 0

dn: cn=Connected directory server DS(2960) localhost:56754,cn=Connected repli
cation server RS(28931) localhost:8901,cn=Replication server RS(28208) localh
ost:8900,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 2960
missing-changes: 0

opendj_1_DSRS view of missing-changes:

$ opendj_1_DSRS/bin/ldapsearch -p 1501 -D "cn=Directory Manager" -w admin -T -b "cn=monitor" "(&(missing-changes=*)(domain-name=dc=example,dc=com)(server-id=*)(cn=*DS*))" domain-name server-id missing-changes approx-older-change-not-synchronized-millis'

dn: cn=Connected directory server DS(21154) localhost:54038,cn=Replication se
rver RS(28208) localhost:8900,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 21154
missing-changes: 0

dn: cn=Connected directory server DS(29646) localhost:54972,cn=Replication se
rver RS(28208) localhost:8900,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 29646
missing-changes: 0

dn: cn=Connected directory server DS(2960) localhost:56754,cn=Connected repli
cation server RS(28931) localhost:8901,cn=Replication server RS(28208) localh
ost:8900,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 2960
missing-changes: 0

restart opendj_1_DSRS

$ opendj_1_DSRS/bin/stop-ds -R

missing-changes

opendj_0_DSRS view of missing-changes:

$ opendj_0_DSRS/bin/ldapsearch -p 1500 -D "cn=Directory Manager" -w admin -T -b "cn=monitor" "(&(missing-changes=*)(domain-name=dc=example,dc=com)(server-id=*)(cn=*DS*))" domain-name server-id missing-changes approx-older-change-not-synchronized-millis'

dn: cn=Connected directory server DS(21154) localhost:54038,cn=Replication se
rver RS(28208) localhost:8900,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 21154
missing-changes: 0

dn: cn=Connected directory server DS(29646) localhost:54972,cn=Replication se
rver RS(28208) localhost:8900,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 29646
missing-changes: 0

dn: cn=Connected directory server DS(2960) localhost:56754,cn=Connected repli
cation server RS(28931) localhost:8901,cn=Replication server RS(28208) localh
ost:8900,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 2960
missing-changes: 0

opendj_1_DSRS view of missing-changes:

$ opendj_1_DSRS/bin/ldapsearch -p 1501 -D "cn=Directory Manager" -w admin -T -b "cn=monitor" "(&(missing-changes=*)(domain-name=dc=example,dc=com)(server-id=*)(cn=*DS*))" domain-name server-id missing-changes approx-older-change-not-synchronized-millis'

dn: cn=Connected directory server DS(21154) localhost:54038,cn=Replication se
rver RS(28208) localhost:8900,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 21154
missing-changes: 0

dn: cn=Connected directory server DS(29646) localhost:54972,cn=Replication se
rver RS(28208) localhost:8900,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 29646
missing-changes: 1

dn: cn=Connected directory server DS(2960) localhost:56754,cn=Connected repli
cation server RS(28931) localhost:8901,cn=Replication server RS(28208) localh
ost:8900,cn=dc_example_dc_com,cn=Replication,cn=monitor
domain-name: dc=example,dc=com
server-id: 2960
missing-changes: 0

dynamic view

opendj_1_DSRS
stop-ds
PendingChanges.putReplicaOfflineMsg() offlineCSN=00000156452dbf847a5600000001

opendj_0_DSRS
ServerReader.run() receives ReplicaOfflineMsg with offlineCSN=00000156452dbf847a5600000001
-> ServerHandler.put(UpdateMsg)
-> ReplicationServerDomain.put(UpdateMsg);
-> MessageHandler(DataServerHandler).updateServerState(UpdateMsg)

opendj_1_DSRS
start-ds

opendj_0_DSRS
ReplicationServerDomain.createLocalTopologyMonitorMsg() MonitorMsg sends LSstate with offlineCSN=00000156452dbf847a5600000001

opendj_1_DSRS
ReplicationDomainMonitor.receiveMonitorDataResponse(ldapServerState) serverId=14639 maxCSNs=ldapServerState=00000156452dbf847a5600000001
-> ReplicationDomainMonitorData.setMaxCsn0(newCSN=00000156452dbf847a5600000001) == true, serverId=31318 currentMaxCSN=null
-> ReplicationDomainMonitorData.setLDAPServerState(14639, 00000156452dbf847a5600000001)
...
ReplicationDomainMonitorData.computeMissingChanges(): Complete monitor data : Missing changes (29136)=diff(00000156452dbf847a5600000001-null)=1=1

Missing changes detected!
Comment by Jean-Noël Rouvignac [ 01/Aug/16 ]

Since missing-changes are computed off the replica/DS ServerState, a fix consists in making sure to not update the ServerState for ReplicaOfflineMsg, but still allow those messages to be fowarded to the topology (to allow changenumber computation to proceed).

Comment by Jean-Noël Rouvignac [ 03/Aug/16 ]

FIXED

Two parts:

  • ensure dsreplication status queries first the replica with the provided host/port
  • ReplicaOfflineMsg where incorrectly updating the server state of replicas (DSs).
    • The fix consists in ensuring the server state are not updated
      while still allowing ReplicaOfflineMsgs to be forwarded to the topology.
    • The proposed change allow ServerWriter.run() to forward ReplicaOfflineMsgs
      to other Replication Servers in the topology.
    • Side-note: here is the call hierarchy showing how the current change still allows ReplicaOfflineMsgs to be forwarded:
      org.opends.server.replication.server.MessageHandler.updateServerState(UpdateMsg)
      org.opends.server.replication.server.MessageHandler.getNextMessage()
      org.opends.server.replication.server.ServerHandler.take()
      org.opends.server.replication.server.ServerWriter.run()
Comment by Christophe Sovant [ 24/Aug/16 ]

Verified in OpenDJ 4.0.0 SNAPSHOT (Aug 09)

Comment by Quentin CASTEL [X] (Inactive) [ 20/Nov/16 ]

modification of the status, in order to migrate the 'Zendesk ID' field to 'Support Ticket ID' field.

Generated at Wed Aug 05 05:10:38 UTC 2020 using Jira 7.13.12#713012-sha1:6e07c38070d5191bbf7353952ed38f111754533a.