[OPENDJ-567] Better support for replication on multi-homed servers Created: 14/Aug/12  Updated: 08/Nov/19  Resolved: 17/Aug/18

Status: Done
Project: OpenDJ
Component/s: replication
Affects Version/s: 2.4.6
Fix Version/s: 6.5.0

Type: New Feature Priority: Critical
Reporter: Chris Ridd Assignee: Yannick Lecaillez
Resolution: Fixed Votes: 2
Labels: None
Environment:

OS has multiple NICs. One machine has 192.168.7.10 and 192.168.7.11, the other has 192.168.7.20 and 192.168.7.21.


Issue Links:
Duplicate
is duplicated by OPENDJ-332 In Replication server Replication Ser... Done
is duplicated by OPENDJ-2163 Listen address is not used for replic... Done
Relates
relates to OPENDJ-4443 Developer task: Use per-process UUID ... Closed
relates to OPENDJ-5359 Replication server has got same serve... Done
is related to OPENDJ-875 Use of hostnames in replication proto... Done
is related to OPENDJ-2606 Clean up and simplify the replication... Dev backlog
Sub-Tasks:
Key
Summary
Type
Status
Assignee
OPENDJ-5127 QA Task: Allow separate listen addres... Sub-task Closed Ondrej Fuchsik  
OPENDJ-5123 QA Task: Use per-process UUID to uniq... Sub-task Closed Ondrej Fuchsik  
OPENDJ-4443 Developer task: Use per-process UUID ... Sub-task Closed Yannick Lecaillez  
Epic Link: User Friendly Replication 6.5
Story Points: 14
Dev Assignee: Yannick Lecaillez
QA Assignee: Ondrej Fuchsik
Support Ticket IDs:

 Description   

Each OpenDJ instance is on a separate multi-homed server, and replication causes a large amount of logging:

[14/Aug/2012:14:21:55 +0200] category=SYNC severity=SEVERE_ERROR msgID=14942263 msg=In Replication server Replication Server 50889 18477: replication servers 192.168.7.11:50889 and 192.168.7.10:50889 have the same ServerId : 3225


 Comments   
Comment by Matthew Swift [ 11/Oct/12 ]

I think that this can be resolved in three steps:

  1. remove the check for duplicate serverIds and just assume that two servers with the same serverId are the same server. Dsreplication should ensure that a topology does not contain duplicate serverIds, however they may occur as a result of cloning existing replicas and forgetting to assign a new serverId
  2. rev the replication protocol version and modify the handshake a bit in order to check for duplicate serverIds. The new handshake would proceed as follows:
    1. on start up, a server generates a UUID
    2. during handshake a server sends its serverId along with its UUID
    3. peer stores the serverId/UUID pair
    4. if there is already a record for the serverId having a different UUID then we have duplicate serverIds - configuration error. Otherwise, we can assume that we are communicating with the same server.
  3. allow users to configure a specific replication server listen address. Do we want to allow for multiple listen addresses?

Update: items 1 and 2 will be addressed in OPENDJ-4443.

Comment by vins [X] (Inactive) [ 12/Oct/12 ]

We work-arounded temporarily at os level using iptables (on linux)
server 1:
/sbin/iptables -t nat -A POSTROUTING -s 192.168.62.231 -j SNAT --to-source 192.168.62.239
/sbin/iptables -A FORWARD -p tcp -o eth0 --dport 8989 -m state --state NEW -j ACCEPT
server 2:
/sbin/iptables -t nat -A POSTROUTING -s 192.168.62.232 -j SNAT --to-source 192.168.62.244
/sbin/iptables -A FORWARD -p tcp -o eth0 --dport 8989 -m state --state NEW -j ACCEPT

Hope this could help someone while waiting for a new opendj release.
Matthew, as a user I vote for adding the step #3 of your solution.

Comment by Chris Ridd [ 21/Mar/13 ]

Another possible workaround is to set up static routes between the replication servers("/sbin/route add ..."), although obviously that will become harder to manage as the number of RSes increases.

Comment by Chris Ridd [ 27/Jan/16 ]

Part of the replication layer reworking

Comment by Ludovic Poitou [ 14/Apr/17 ]

This comment addresses point 3.
Best practices in middleware is to separate the network traffic between customers related traffic and administrative traffic. Replication should be able to be configured to run on its own network (i.e. a Replica should be able to listen onto one or more specific interfaces). We need to make sure that knowledge of the replicas are providing both the replication access and the client access (to manage referrals).

Comment by Fabio Pistolesi [ 24/May/18 ]

The change will need:

  • replace ds-cfg-replication-port with ds-cfg-replication-listen-address (optionally multi-valued)
  • dsreplication must propose options on the form host:port instead of a simple port at least for configure and possibly initialize and initialize-all subcommands
  • take care when using multiple listening sockets, one per address. Consider NIO's channels and selectors ?

Should we modify status and dsreplication status output?

Comment by Mark Craig [ 27/Jun/18 ]

Yannick Lecaillez, With a build from master, I'm seeing this in my test that tries to set listen-address:

not ok 23 set listen address
# (from function `check_expected_status' in file admin-guide/../bats_helper.bash, line 27,
#  in test file admin-guide/replication.bats, line 905)
#   `check_expected_status "0" "$status"' failed
# Failed with exit status 1
# Output: 
# Object Class Violation: Entry cn=replication server,cn=Multimaster
# Synchronization,cn=Synchronization Providers,cn=config cannot be modified
# because the resulting entry would have violated the server schema: Entry
# "cn=replication server,cn=Multimaster Synchronization,cn=Synchronization
# Providers,cn=config" violates the schema because it contains attribute
# "ds-cfg-listen-address" which is not allowed by any of the object classes in
# the entry

What I see in the schema is indeed no ds-cfg-listen-address in the object class:

objectClasses: ( 1.3.6.1.4.1.26027.1.2.64
  NAME 'ds-cfg-replication-server'
  SUP top
  STRUCTURAL
  MUST ds-cfg-listen-port
  MAY ( ds-cfg-replication-server-id $
        ds-cfg-replication-server $
        cn $
        ds-cfg-queue-size $
        ds-cfg-replication-db-directory $
        ds-cfg-disk-low-threshold $
        ds-cfg-disk-full-threshold $
        ds-cfg-replication-purge-delay $
        ds-cfg-group-id $
        ds-cfg-assured-timeout $
        ds-cfg-degraded-status-threshold $
        ds-cfg-weight $
        ds-cfg-monitoring-period $
        ds-cfg-compute-change-number $
        ds-cfg-source-address $
        ds-cfg-cipher-transformation $
        ds-cfg-cipher-key-length $
        ds-cfg-confidentiality-enabled)
  X-ORIGIN 'OpenDS Directory Server' )

My master branch seems to be up to date.

Comment by Jean-Noël Rouvignac [ 02/Jul/18 ]

Mark Craig the small issue your reported has been fixed by https://stash.forgerock.org/projects/OPENDJ/repos/opendj/commits/88237e65f9079871842d1f3bf93a4584cff309ef

Comment by Ondrej Fuchsik [ 19/Jul/18 ]

Reproduced the problem mentioned in customer ticket using two lab machines (DSRS on each). 

Hosts file content:

172.16.204.138 comtebis
172.16.204.147 beaufortbis

The new feature works and a proof is a netstat output after listen-address is set:

netstat -na | grep 8989
tcp6 0 0 172.16.204.138:8989 :::* LISTEN
...

On the other hand it doesn't solve the reproduced issue, so the customers issue is not solve by this (probably).

NOTE: After chat with Fabio, it seems that the customer issue is specific case of multi-homed and routing problem, which can't be solved only by setting the listen-address.

Comment by Jean-Noël Rouvignac [ 23/Jul/18 ]

Ondrej Fuchsik I think OPENDJ-4443 (being worked on) will definitely solve the problem reported by customers.

Comment by Matthew Swift [ 24/Jul/18 ]

Can this issue be closed now that all sub-tasks are completed?

Comment by Jean-Noël Rouvignac [ 24/Jul/18 ]

After having discussed, it is unclear whether the customer issue will be definitely fixed by OPENDJ-4443.

Inside ReplicationServerDomain, pay attention to the second null check:

    public boolean isAlreadyConnectedToRS(ReplicationServerHandler rsHandler) throws LdapException {
        ReplicationServerHandler oldRsHandler = connectedRSs.get(rsHandler.getPeerServerId());
        if (oldRsHandler == null) {
            return false;
        }

        if (oldRsHandler.getPeerServerHostPort().equals(rsHandler.getPeerServerHostPort())) {
            // this is the same server, this means that our ServerStart messages
            // have been sent at about the same time and 2 connections have been established.
            // Silently drop this connection.
            return true;
        }

        // looks like two replication servers have the same serverId
        // log an error message and drop this connection.
        LocalizableMessage message = ERR_DUPLICATE_REPLICATION_SERVER_ID.get(localReplicationServer.getName(),
                                                                             oldRsHandler.getPeerServerHostPort(),
                                                                             rsHandler.getPeerServerHostPort(),
                                                                             rsHandler.getPeerServerId());
        throw newLdapException(OTHER, message);
    }

Maybe it will need to be updated like this:

        if (oldRsHandler.getPeerProcessId().equals(rsHandler.getPeerProcessId())
                // backward compatibility with DJ < 6.5.0
                || oldRsHandler.getPeerServerHostPort().equals(rsHandler.getPeerServerHostPort())) {

Remember we renamed this issue so the naming no longer matches 100% the problem reported by the customer. So keeping the issue open until OPENDJ-4443 is fixed serves as a reminder to test this.

Comment by Matthew Swift [ 24/Jul/18 ]

Remember we renamed this issue so the naming no longer matches 100% the problem reported by the customer. So keeping the issue open until OPENDJ-4443 is fixed serves as a reminder to test this.

I'm sorry, but this is very confusing. This issue is titled "Allow separate listen addresses for replication on multi-homed servers". As far as I can see that work has been completed: DJ now supports multiple listen addresses. It seems that goal of this issue has changed since it was first created because the title no longer reflects the support tickets. It would also seem that OPENDJ-4443 is required in order to resolve this issue, in which case we may want to consider converting OPENDJ-4443 to a dev sub-task of this issue.

I suggest that we make the following changes:

  • revert the title change to this issue in order to better reflect the problem described in the support tickets
  • rename OPENDJ-5126 (the dev task) to "Developer task: Allow separate listen addresses for replication on multi-homed servers"
  • convert OPENDJ-4443 to a developer sub-task of this issue.

What do you think?

Comment by carole forel [ 24/Jul/18 ]

I agree this is confusing and I agree with your proposal.

Comment by Jean-Noël Rouvignac [ 24/Jul/18 ]

OK, I am doing it now.

Comment by Matthew Swift [ 24/Jul/18 ]

Thanks Jean-Noel!

Comment by Ondrej Fuchsik [ 17/Aug/18 ]

Resolving this, because all dev, doc and qa tasks are done.

Comment by Ondrej Fuchsik [ 17/Aug/18 ]

Dev, doc and qa tasks are done.

Generated at Thu Sep 24 14:17:16 UTC 2020 using Jira 7.13.12#713012-sha1:6e07c38070d5191bbf7353952ed38f111754533a.