[OPENDJ-4221] same server ID for replication servers Created: 16/Aug/17  Updated: 31/Jul/20

Status: Dev backlog
Project: OpenDJ
Component/s: replication
Affects Version/s: 6.0.0, 5.5.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Ondrej Fuchsik Assignee: Fabio Pistolesi
Resolution: Unresolved Votes: 0
Labels: not-for-release-notes
Environment:

OpenDJ-4.1.0 rev (587cbba8a19)



 Description   

We have this issue randomly in replication and proxy groups:

[15/Aug/2017:04:15:30 +0100] category=SYNC severity=ERROR msgID=54 msg=In Replication server Replication Server 8989 13993: servers Replica DS(17099) for domain "cn=admin data" and Replica DS(17099) for domain "cn=admin data" have the same ServerId : 17099


 Comments   
Comment by Fabio Pistolesi [ 15/Sep/17 ]

The message implies either

  • DS has disconnected, but before RS finished processing disconnect, DS has connected again.
  • DS is trying to re-connect because first connection has timed out, RS is still processing first connection.

Would be interesting to see DS side errors file.

Comment by Fabio Pistolesi [ 22/Sep/17 ]

Trying to Reproduce the problem did not succeed. Moreover it does not seem to appear anymore in daily tests.
Will reopen if and when we have DS side error log.

Comment by Ondrej Fuchsik [ 30/Jan/18 ]

We are hitting the issue in daily tests for version 6.0.0.

Comment by Jean-Noël Rouvignac [ 09/Feb/18 ]

May be fixed in the future by OPENDJ-3973

Comment by Christophe Sovant [ 21/Mar/18 ]

Jean-Noël Rouvignac It seems that this is not the expected related issue.

Anyway we still have the issue on 6.0.0-SNAPSHOT (rev 2ab2c8a04e7)

Comment by Jean-Noël Rouvignac [ 21/Mar/18 ]

Sorry the comment was too laconic, I meant "may be fixed in the future by ...". I have updated the comment to better reflect that.

Comment by Jean-Noël Rouvignac [ 24/Mar/18 ]

How are setup the replication servers?

Is the test using setup replication-server?

Comment by Fabio Pistolesi [ 03/May/18 ]

Is it happening at shutdown? If it is, it might be related to OPENDJ-5002, since the message is logged when an RS processes a start message from a DS.

Comment by Fabio Pistolesi [ 06/Nov/18 ]

I think the root cause, or at least one of them, for this one is pretty simple. During handshake if the DS decides for any reason not to continue the handshake after the RS sent the `TopologyMsg` at end of `DataServerHandler#startFromRemoteDS()`, we have the RS thinking everything is fine and considers the DS connected, while the DS is actually going to reconnect again. Since the handshake failed on DS side none of the auxiliary threads are started so nor heartbeat nor `ChangeTimeHeartbeat` is running. At this point the RS will not receive anything on the socket, causing it to sleep. When the DS reconnects, the RS thinks it is still connected.

Why is a DS not continuing the handshake depends on a case by case basis. I have two traces.

Case 1

Traces come from another bug, we still see an error about "same server ID" :

[16/Oct/2018:09:56:28 +0200] category=SYNC severity=NOTICE msgID=62 msg=Directory server DS(3110) has connected to replication server RS(32279) for domain "uid=Monitor" at 127.0.1.1:8989 with generation ID 12638
[16/Oct/2018:09:56:28 +0200] category=BACKEND severity=NOTICE msgID=515 msg=Exported 165 entries and skipped 0 in 0 seconds (average rate 2171.1/sec)
[16/Oct/2018:09:56:33 +0200] category=SYNC severity=ERROR msgID=119 msg=Directory server DS(24769) encountered an unexpected error while connecting to replication server localhost:8989 for domain "dc=com": SocketTimeoutException: Read timed out (SocketInputStream.java:-2 SocketInputStream.java:116 SocketInputStream.java:171 SocketInputStream.java:141 BufferedInputStream.java:246 BufferedInputStream.java:286 BufferedInputStream.java:345 SessionImpl.java:308 SessionImpl.java:279 ReplicationBroker.java:1041 ReplicationBroker.java:738 ReplicationBroker.java:278 ReplicationDomain.java:2186 LDAPReplicationDomain.java:502 MultimasterReplication.java:308 MultimasterReplication.java:284 MultimasterReplication.java:355 MultimasterReplication.java:91 SynchronizationProviderConfigManager.java:175 SynchronizationProviderConfigManager.java:141 SynchronizationProviderConfigManager.java:83 ...)
[16/Oct/2018:09:56:33 +0200] category=SYNC severity=WARNING msgID=23 msg=Directory server DS(24769) was unable to connect to any of the following replication servers for domain "dc=com": 32279
[16/Oct/2018:09:56:33 +0200] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on Administration Connector 0.0.0.0 port 4444
[16/Oct/2018:09:56:33 +0200] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on LDAP 0.0.0.0 port 1389
[16/Oct/2018:09:56:33 +0200] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on LDAPS 0.0.0.0 port 1636
[16/Oct/2018:09:56:33 +0200] category=CORE severity=NOTICE msgID=135 msg=The Directory Server has started successfully
[16/Oct/2018:09:56:33 +0200] category=CORE severity=NOTICE msgID=139 msg=The Directory Server has sent an alert notification generated by class org.opends.server.core.DirectoryServer (alert type org.opends.server.DirectoryServerStarted, alert ID org.opends.messages.core-135): The Directory Server has started successfully
[16/Oct/2018:09:56:33 +0200] category=SYNC severity=ERROR msgID=54 msg=In Replication server Replication Server 8989 32279: servers Replica DS(24769) for domain "dc=com" and Replica DS(24769) for domain "dc=com" have the same ServerId : 24769
[16/Oct/2018:09:56:34 +0200] category=SYNC severity=ERROR msgID=211 msg=The connection from this replication server RS(32279) to directory server DS(24769) at localhost/127.0.0.1:38338 for domain "dc=com" has failed
[16/Oct/2018:09:56:34 +0200] category=SYNC severity=NOTICE msgID=62 msg=Directory server DS(24769) has connected to replication server RS(32279) for domain "dc=com" at 127.0.1.1:8989 with generation ID 3079061

In here the DS seems not to receive the topology message in the allotted time (weird as it seems) finally giving up (`msgID=23`). In the same second it tries connecting again, making the RS complain about duplicate IDs. It succeeds later on once the RS has detected the closed socket and purged `connectedDSs` via `stopServer()`.
I also think `msgID=211` should read `to this RS from DS` as it is logged by an RS, and an RS does not connect to a DS.

Case 2

I modified the RS not to send a `TopologyMsg` at end of `startFromRemoteDS()`, I got the following.

[06/Nov/2018:16:28:35 +0100] category=SYNC severity=NOTICE msgID=204 msg=Replication server RS(1) started listening for new connections on address 0.0.0.0 port 8989
[06/Nov/2018:16:28:36 +0100] category=SYNC severity=ERROR msgID=119 msg=Directory server DS(1) encountered an unexpected error while connecting to replication server mwood.local:8989 for domain "dc=example,dc=com": ClassCastException: org.opends.server.replication.protocol.ReplicaOfflineMsg cannot be cast to org.opends.server.replication.protocol.TopologyMsg (ReplicationBroker.java:1040 ReplicationBroker.java:737 ReplicationBroker.java:277 ReplicationDomain.java:2186 LDAPReplicationDomain.java:496 MultimasterReplication.java:309 MultimasterReplication.java:285 MultimasterReplication.java:356 MultimasterReplication.java:92 SynchronizationProviderConfigManager.java:175 SynchronizationProviderConfigManager.java:141 SynchronizationProviderConfigManager.java:83 DirectoryServer.java:1365 DirectoryServer.java:4037)
[06/Nov/2018:16:28:36 +0100] category=SYNC severity=WARNING msgID=23 msg=Directory server DS(1) was unable to connect to any of the following replication servers for domain "dc=example,dc=com": 1, 2
[06/Nov/2018:16:28:36 +0100] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on Administration Connector 0.0.0.0 port 4444
[06/Nov/2018:16:28:36 +0100] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on LDAP 0.0.0.0 port 1389
[06/Nov/2018:16:28:36 +0100] category=CORE severity=NOTICE msgID=135 msg=The Directory Server has started successfully
[06/Nov/2018:16:28:36 +0100] category=CORE severity=NOTICE msgID=139 msg=The Directory Server has sent an alert notification generated by class org.opends.server.core.DirectoryServer (alert type org.opends.server.DirectoryServerStarted, alert ID org.opends.messages.core-135): The Directory Server has started successfully
[06/Nov/2018:16:28:36 +0100] category=SYNC severity=ERROR msgID=54 msg=In Replication server Replication Server 8989 1: servers Replica DS(1) for domain "cn=admin data" and Replica DS(1) for domain "cn=admin data" have the same ServerId : 1
[06/Nov/2018:16:28:36 +0100] category=SYNC severity=NOTICE msgID=62 msg=Directory server DS(1) has connected to replication server RS(2) for domain "cn=admin data" at 172.16.203.153:9989 with generation ID 155540
[06/Nov/2018:16:28:36 +0100] category=SYNC severity=ERROR msgID=54 msg=In Replication server Replication Server 8989 1: servers Replica DS(1) for domain "dc=example,dc=com" and Replica DS(1) for domain "dc=example,dc=com" have the same ServerId : 1
[06/Nov/2018:16:28:36 +0100] category=SYNC severity=NOTICE msgID=62 msg=Directory server DS(1) has connected to replication server RS(2) for domain "dc=example,dc=com" at 172.16.203.153:9989 with generation ID 19363629

DS receives a bad message during handshake, causing it to retry later on.
Note that I think there are unclear things in the output as I do not see a log I added when the connection is detected as closed RS side.

Generated at Mon Sep 21 16:22:42 UTC 2020 using Jira 7.13.12#713012-sha1:6e07c38070d5191bbf7353952ed38f111754533a.