[OPENDJ-2969] changelogDb could not be read on OpenDJ instance startup Created: 26/Apr/16  Updated: 18/Feb/20  Resolved: 21/Nov/16

Status: Done
Project: OpenDJ
Component/s: backends, core server, replication
Affects Version/s: 3.0.0
Fix Version/s: 4.0.0, 3.5.0

Type: Bug Priority: Critical
Reporter: Lee Trujillo Assignee: Nicolas Capponi
Resolution: Fixed Votes: 0
Labels: Verified, release-notes
Environment:

So far, Linux only.


Attachments: HTML File catch-2969    
Issue Links:
Backport
is backported by OPENDJ-2993 Backport OPENDJ-2969: changelogDb cou... Done
is backported by OPENDJ-3012 Backport OPENDJ-2969: changelogDb cou... Done
Duplicate
is duplicated by OPENDJ-2978 RuntimeException while stressing the ... Done
is duplicated by OPENDJ-2991 ChangelogException while doing modify... Done
QA Assignee: Christophe Sovant
Support Ticket IDs:
Backports: OPENDJ-2993 (Not applicable) OPENDJ-3012 (3.0.1)

 Description   

OpenDJ reports the following error during and after startup of the instance.

msg=The replication server failed to start because the database /opt/app/opendj/changelogDb could not be read...

There are no space issues at the time. Open file limits were not ruled out.

Full log:

[21/Apr/2016:23:00:48 +0200] category=UTIL severity=NOTICE msgID=org.opends.messages.runtime.21 msg=Installation Directory:  /opt/app/opendj
[21/Apr/2016:23:00:48 +0200] category=UTIL severity=NOTICE msgID=org.opends.messages.runtime.23 msg=Instance Directory:      /opt/app/opendj
[21/Apr/2016:23:00:48 +0200] category=UTIL severity=NOTICE msgID=org.opends.messages.runtime.17 msg=JVM Information: 1.8.0_77-b03 by Oracle Corporation, 6
4-bit architecture, 2058354688 bytes heap size
[21/Apr/2016:23:00:48 +0200] category=UTIL severity=NOTICE msgID=org.opends.messages.runtime.18 msg=JVM Host: opendj1.example.com, running Linux 2.6.32-57
3.18.1.el6.x86_64 amd64, 4014825472 bytes physical memory size, number of processors available 2
[21/Apr/2016:23:00:48 +0200] category=UTIL severity=NOTICE msgID=org.opends.messages.runtime.19 msg=JVM Arguments: "-Xms2G", "-Xmx2G", "-Dorg.opends.serve
r.scriptName=start-ds"
[21/Apr/2016:23:00:48 +0200] category=PLUGGABLE severity=NOTICE msgID=org.opends.messages.backend.513 msg=The database backend userRoot containing 55697 e
ntries has started
[21/Apr/2016:23:00:50 +0200] category=EXTENSIONS severity=NOTICE msgID=org.opends.messages.extension.221 msg=DIGEST-MD5 SASL mechanism using a server full
y qualified domain name of: opendj1.example.com
[21/Apr/2016:23:00:50 +0200] category=SYNC severity=ERROR msgID=org.opends.messages.replication.274 msg=The following log '/opt/app/opendj/changelogDb/2.d
om/30966.server' must be released but it is not referenced."
[21/Apr/2016:23:00:50 +0200] category=SYNC severity=ERROR msgID=org.opends.messages.replication.11 msg=The replication server failed to start because the 
database /opt/app/opendj/changelogDb could not be read : Could not get or create replica DB for baseDN 'dc=example,dc=com', serverId '30966', generationId
 '102871'
[21/Apr/2016:23:00:51 +0200] category=SYNC severity=NOTICE msgID=org.opends.messages.replication.204 msg=Replication server RS(17907) started listening fo
r new connections on address 0.0.0.0 port 8989
[21/Apr/2016:23:00:51 +0200] category=SYNC severity=NOTICE msgID=org.opends.messages.replication.62 msg=Directory server DS(28115) has connected to replic
ation server RS(17907) for domain "cn=admin data" at opendj1.example.com/10.10.10.87:8989 with generation ID 162314
[21/Apr/2016:23:00:52 +0200] category=SYNC severity=NOTICE msgID=org.opends.messages.replication.62 msg=Directory server DS(30966) has connected to replic
ation server RS(17907) for domain "dc=example,dc=com" at opendj1.example.com/10.10.10.87:8989 with generation ID 102871
[21/Apr/2016:23:00:52 +0200] category=SYNC severity=NOTICE msgID=org.opends.messages.replication.62 msg=Directory server DS(19144) has connected to replic
ation server RS(17907) for domain "cn=schema" at opendj1.example.com/10.10.10.87:8989 with generation ID 8408
[21/Apr/2016:23:00:52 +0200] category=PROTOCOL severity=NOTICE msgID=org.opends.messages.protocol.276 msg=Started listening for new connections on Adminis
tration Connector 0.0.0.0 port 4444
[21/Apr/2016:23:00:52 +0200] category=PROTOCOL severity=NOTICE msgID=org.opends.messages.protocol.276 msg=Started listening for new connections on LDAP Co
nnection Handler 0.0.0.0 port 51389
[21/Apr/2016:23:00:52 +0200] category=CORE severity=NOTICE msgID=org.opends.messages.core.135 msg=The Directory Server has started successfully
[21/Apr/2016:23:00:52 +0200] category=CORE severity=NOTICE msgID=org.opends.messages.core.139 msg=The Directory Server has sent an alert notification gene
rated by class org.opends.server.core.DirectoryServer (alert type org.opends.server.DirectoryServerStarted, alert ID org.opends.messages.core-135): The Di
rectory Server has started successfully
[21/Apr/2016:23:01:54 +0200] category=SYNC severity=ERROR msgID=org.opends.messages.replication.274 msg=The following log '/opt/app/opendj/changelogDb/2.d
om/30966.server' must be released but it is not referenced."
[21/Apr/2016:23:01:54 +0200] category=SYNC severity=ERROR msgID=org.opends.messages.replication.26 msg=Error trying to use the underlying database. The Replication Server is going to shut down: ChangelogException: Could not get or create replica DB for baseDN 'dc=example,dc=com', serverId '30966', generationId '102871' (ReplicationEnvironment.java:387 FileReplicaDB.java:151 FileReplicaDB.java:129 FileChangelogDB.java:278 FileChangelogDB.java:199 FileChangelogDB.java:812 ReplicationServerDomain.java:513 ReplicationServerDomain.java:337 ServerHandler.java:1140 ServerReader.java:113)
[21/Apr/2016:23:01:54 +0200] category=SYNC severity=WARNING msgID=org.opends.messages.replication.63 msg=Replication server RS(17907) at opendj1.example.com/10.10.10.87:8989 has closed the connection to this directory server DS(30966). This directory server will now try to connect to another replication server in order to receive changes for the domain "dc=example,dc=com"
[21/Apr/2016:23:01:54 +0200] category=SYNC severity=WARNING msgID=org.opends.messages.replication.63 msg=Replication server RS(17907) at opendj1.example.com/10.10.10.87:8989 has closed the connection to this directory server DS(19144). This directory server will now try to connect to another replication server in order to receive changes for the domain "cn=schema"
[21/Apr/2016:23:01:54 +0200] category=SYNC severity=WARNING msgID=org.opends.messages.replication.63 msg=Replication server RS(17907) at opendj1.example.com/10.10.10.87:8989 has closed the connection to this directory server DS(28115). This directory server will now try to connect to another replication server in order to receive changes for the domain "cn=admin data"
[21/Apr/2016:23:01:54 +0200] category=SYNC severity=NOTICE msgID=org.opends.messages.replication.62 msg=Directory server DS(19144) has connected to replication server RS(8454) for domain "cn=schema" at opendj2.example.com/10.10.10.86:8989 with generation ID 8408
[21/Apr/2016:23:01:54 +0200] category=SYNC severity=NOTICE msgID=org.opends.messages.replication.62 msg=Directory server DS(30966) has connected to replication server RS(8454) for domain "dc=example,dc=com" at opendj2.example.com/10.10.10.86:8989 with generation ID 102871
[21/Apr/2016:23:01:54 +0200] category=SYNC severity=NOTICE msgID=org.opends.messages.replication.62 msg=Directory server DS(28115) has connected to replication server RS(8454) for domain "cn=admin data" at opendj2.example.com/10.10.10.86:8989 with generation ID 162314
[21/Apr/2016:23:02:17 +0200] category=SYNC severity=WARNING msgID=org.opends.messages.replication.63 msg=Replication server RS(8454) at opendj2.example.com/10.10.10.86:8989 has closed the connection to this directory server DS(30966). This directory server will now try to connect to another replication server in order to receive changes for the domain "dc=example,dc=com"
[21/Apr/2016:23:02:17 +0200] category=SYNC severity=WARNING msgID=org.opends.messages.replication.63 msg=Replication server RS(8454) at opendj2.example.com/10.10.10.86:8989 has closed the connection to this directory server DS(19144). This directory server will now try to connect to another replication server in order to receive changes for the domain "cn=schema"
[21/Apr/2016:23:02:17 +0200] category=SYNC severity=WARNING msgID=org.opends.messages.replication.208 msg=Directory server DS(30966) was unable to connect to any replication servers for domain "dc=example,dc=com"
[21/Apr/2016:23:02:17 +0200] category=SYNC severity=WARNING msgID=org.opends.messages.replication.208 msg=Directory server DS(19144) was unable to connect to any replication servers for domain "cn=schema"
[21/Apr/2016:23:02:17 +0200] category=SYNC severity=WARNING msgID=org.opends.messages.replication.63 msg=Replication server RS(8454) at opendj2.example.com/10.10.10.86:8989 has closed the connection to this directory server DS(28115). This directory server will now try to connect to another replication server in order to receive changes for the domain "cn=admin data"
[21/Apr/2016:23:02:17 +0200] category=SYNC severity=WARNING msgID=org.opends.messages.replication.208 msg=Directory server DS(28115) was unable to connect to any replication servers for domain "cn=admin data"

Disabling replication, cleaning out the full changelogDb path and enabling replication is enough to clear the issue.



 Comments   
Comment by Matthew Swift [ 02/May/16 ]

Has this issue been reported by a customer? Is there a case ID for it?

There's been a similar issue reported in our forums:

https://forgerock.org/topic/opendj-3-0-replication-issue/

Comment by Matthew Swift [ 02/May/16 ]

In fact there is a case ID. The agile board displays the "caseId" field not "cases". I'll fix it to avoid future confusion.

Comment by Nicolas Capponi [ 10/May/16 ]

Root cause is the same as issue OPENDJ-2978 : trying to read the newest record of a log file which has a file size which is a multiple of the internal block size (= 256)

For example, for one case:

opends/changelogDb/1.dom/3501.server:
total 460
drwxr-xr-x 2 ncapponi ncapponi   4096 May  2 22:54 .
drwxr-xr-x 4 ncapponi ncapponi   4096 Apr 12 17:19 ..
-rw-r--r-- 1 ncapponi ncapponi 462848 May  3 10:12 head.log

head log size is 462848, and 462848/256=1808 so it is a multiple of 256

When the server restart, the newest record is computed from the log file. It hits the issue, so there is an EOF.

I checked this for three cases, and each time the log file which can't be read has a size which is a multiple of 256.

The fix is available in the PR for OPENDJ-2978

Comment by Nicolas Capponi [ 11/May/16 ]

More information on the root cause:

[21/Apr/2016:23:01:54 +0200] category=SYNC severity=ERROR msgID=org.opends.messages.replication.26 msg=Error trying to use the underlying database. The Replication Server is going to shut down: ChangelogException: Could not get or create replica DB for baseDN 'dc=example,dc=com', serverId '30966', generationId '102871' 
(ReplicationEnvironment.java:387 
 FileReplicaDB.java:151 
 FileReplicaDB.java:129 ...)

ReplicationEnvironment.java:387 corresponds to ReplicationEnvironment#getOrCreateReplicaDB(final DN domainDN, final int serverId, final long generationId) method

  • successive calls are Log.openLog(), then Log constructor, then Log#createOrOpenLogFiles(), then Log#openHeadLogFile()
  • in Log#openHeadLogFile() :
    final Record<K,V> newestRecord = head.getNewestRecord();
  • the computation of the newest record fails with an EOF when file size is a multiple of 256

Note that this can happen for any log file, so the failure can happen for a Replica DB as well as for the change number index DB.

Comment by Nicolas Capponi [ 11/May/16 ]

Fixed by commit https://stash.forgerock.org/projects/OPENDJ/repos/opendj/commits/d8a0693ee610f39d28ccaf139869578b08c307ec

Comment by Nicolas Capponi [ 11/May/16 ]

I found a way to reproduce the bug, and found the fix is insufficient because there's another code path that make it fail : code that is veryfing file integrity at startup, ie BlockLogReader#checkLogIsValid() method.

Comment by Nicolas Capponi [ 11/May/16 ]

To reproduce the bug:

  • install two servers with replication enabled
  • loop until one of the head.log file under changelogDb has a size which is a multiple of 256
    • do a modify operation on an entry always on the same server (alternate two values), e.g.
      dn: uid=user.999,ou=People,o=example
      changetype: modify
      replace: mobile
      mobile: 012 (alternate: 123) 
      
  • when a file with a size multiple of 256 has been found, stop the servers and restart. The error occurs immediately.

Note: there is no way to ensure a multiple of 256 will ever appear. It is a good idea to output the rest : file size%256 to see if the same values occurs again and again.

Comment by Christophe Sovant [ 17/May/16 ]

Verified with an automated test (based on script from Nicolas) using OpenDJ 4.0.0 rev 053e00fd901.

Comment by Mark Craig [ 20/May/16 ]

Reopening to apply release-notes label

Comment by Quentin CASTEL [X] (Inactive) [ 20/Nov/16 ]

modification of the status, in order to migrate the 'Zendesk ID' field to 'Support Ticket ID' field.

Comment by Bernhard Thalmayr [ 12/Jan/17 ]

attached a script from Lee Trujillo, which can be used to find out if this bug is hit.

Generated at Sat Nov 28 13:03:40 UTC 2020 using Jira 7.13.12#713012-sha1:6e07c38070d5191bbf7353952ed38f111754533a.