[OPENDJ-6536] dsreplication configure hangs with servers upgraded from 5.0.0 to 7.0.0 Created: 16/Aug/19  Updated: 08/Nov/19  Resolved: 19/Aug/19

Status: Done
Project: OpenDJ
Component/s: regression, replication, upgrade
Affects Version/s: 7.0.0
Fix Version/s: Not applicable

Type: Bug Priority: Major
Reporter: carole forel Assignee: Matthew Swift
Resolution: Not a defect Votes: 0
Labels: None

Issue Links:
Relates
Epic Link: Bugs 7.0
Story Points: 0.5
Dev Assignee: Matthew Swift

 Description   

Found with rev b3a29e49c308459023d29668aa4fecd861cd8cb4
was working with rev OpenDJ 7.0.0-SNAPSHOT (3a762502b88)

We set up 3 servers, installing them in 5.0.0 and directly upgrading them to 7.0.0 (what we call an upgrade mode):

./DJ1/opendj/setup directory-server -h localhost -p 1389 -D "cn=Directory Manager" -w password --adminConnectorPort 4444 -Z 1636 -t je -b dc=com -l /local/GIT/pyforge/PyBot/OpenDJ/testcases/data/replication_startup.ldif   -O --acceptLicense

then stop and copy 7.0.0 files and upgrade:
./DJ1/opendj/upgrade -n --acceptLicense --force

for the 3 servers.

Then we configure replication between 'DJ1' and 'DJ3' and it gets stuck:

DJ1/opendj/bin/dsreplication configure --host1 localhost --port1 4444 --bindDN1 "cn=Directory Manager" --bindPassword1 "password" --replicationPort1 8989 --host2 localhost --port2 4446 --bindDN2 "cn=Directory Manager" --bindPassword2 "password" --replicationPort2 8991 -b dc=com -I admin -w password  -X -n

we have the following errors in replication logs:

[16/Aug/2019:15:01:31 +0200] category=SYNC severity=NOTICE msgID=204 msg=Replication server RS(Jan_Farrell) started listening for new connections on address 0.0.0.0 port 8989
[16/Aug/2019:15:01:31 +0200] category=SYNC severity=INFORMATION msgID=207 msg=Replication server RS(Jan_Farrell) has accepted a connection from directory server DS(Jan_Farrell) for domain "dc=com" at /127.0.0.1:52488
[16/Aug/2019:15:01:31 +0200] category=SYNC severity=NOTICE msgID=62 msg=Directory server DS(Jan_Farrell) has connected to replication server RS(Jan_Farrell) for domain "dc=com" at 127.0.0.1:8989 with generation ID 3079061
[16/Aug/2019:15:01:31 +0200] category=SYNC severity=INFORMATION msgID=105 msg=Replication server accepted a connection from /127.0.0.1:52490 to local address 0.0.0.0/0.0.0.0:8989 but the SSL handshake failed. This is probably benign, but may indicate a transient network outage or a misconfigured client application connecting to this replication server. The error was: Received fatal alert: certificate_unknown
...

To reproduce:

in config.cfg, to trigger the upgrade mode:

[OpenDJ]
...
version = ["5.0.0", "7.0.0-SNAPSHOT"]

and then:
./run-pybot.py -nvs replication_group3.Failover -t Failover_One_Server_Down_(Stopped)_Add_Op opendj


 Comments   
Comment by Matthew Swift [ 16/Aug/19 ]

Replication, at least configuring it, is not supported yet for upgraded servers. See OPENDJ-6346.

The use case described in the test seems a bit strange though. Why would you install 6.5 servers, upgrade them, and then configure replication? I was expecting the following two migration/upgrade use cases:

  1. starting with a 6.5 topology, upgrade one of the replicas to 7.0
  2. starting with a 6.5 topology, install a new 7.0 server and configure it to join the 6.5 topology.
Comment by Matthew Swift [ 16/Aug/19 ]

I suggest we flag this as a known-issue until we have a better understanding of the 6.5 -> 7.0 migration use-cases.

Comment by carole forel [ 16/Aug/19 ]

we have other tests to cover the use cases you have mentioned.
In this particular case, we switch all the test suites in upgrade mode , meaning that we run everything with upgraded servers.
This used to work until yesterday.
It's not a mixed topology as all the servers are 7.0.0.

Comment by Matthew Swift [ 16/Aug/19 ]

How many tests are failing Carole? The new 7.0 security model assumes that replicas have a common CA, so dsreplication no longer copies public certs.

Comment by Matthew Swift [ 19/Aug/19 ]

Ludovic Poitou agrees that this is not a valid use case and we do not need to support it. The only valid migration use cases are the two described in my comment above.

Therefore, I'm going to close this issue as not a bug, although we'll need to update the functional tests to avoid testing this scenario. Is that ok?

Comment by Matthew Swift [ 07/Nov/19 ]

Moved to closed state because the fixVersion has already been released.

Generated at Mon Mar 01 10:23:39 UTC 2021 using Jira 7.13.12#713012-sha1:6e07c38070d5191bbf7353952ed38f111754533a.