[OPENDJ-5002] 200s timeout when stopping a replication server Created: 17/Apr/18  Updated: 08/Nov/19  Resolved: 24/Apr/18

Status: Done
Project: OpenDJ
Component/s: None
Affects Version/s: 6.0.0
Fix Version/s: 6.5.0

Type: Bug Priority: Major
Reporter: Viktor Nawrath [X] (Inactive) Assignee: Fabio Pistolesi
Resolution: Fixed Votes: 0
Labels: Verified

Attachments: File 5002_script.sh     File 5002_script2.sh     Text File image-2018-04-18-17-43-54-305.png    
Issue Links:
Backport
is backported by OPENDJ-5414 Backport OPENDJ-5002: 200s timeout wh... Done
Relates
is related to OPENDJ-4838 Error in server logs when unconfiguri... Done
Epic Link: Bugs 6.5
Story Points: 1
QA Assignee: Viktor Nawrath [X] (Inactive)
Backports: OPENDJ-5414 (5.5.3)

 Description   

Found with DS 6.0.0-RC1

Sometimes when we try to stop a replication server, we reach the timeout of 200s, with a parameter error saying the --port must be specified. This happens randomly in many of our tests, but always when stopping a replication server.

./bin/stop-ds
-- rc --
returned 89, expected 0
-- stdout --
Stopping Server...
The timeout of '200' seconds to start the server has been reached. You can
use the argument '--timeout' to increase this timeout

-- stderr --
An error occurred while parsing the command-line arguments: A port number
must be specified to connect to the server

See "stop-ds --help" to get more usage help


 Comments   
Comment by Viktor Nawrath [X] (Inactive) [ 17/Apr/18 ]

The most consistant way to reproduce it is to run:

./run-pybot.py -c security -s security_issues.opendj4536.Replication -v DJ

a few times in a row.

Comment by Viktor Nawrath [X] (Inactive) [ 18/Apr/18 ]

I added a script I used to try to reproduce the issue easier/more reliably than with pyforge. You just need to setup 2 DS's and setup the ports in the script... It configures/uncofigures replication, and then restarts both DS's to try to hit the issue, all in a loop.

I went through a few versions of the script, but in the end I found there are 2 general outcomes, depending on wether we provide connection parameters to stop-ds or not:

1) stop-ds without connection parameters

  • we hit the 200s timeout on the stop-ds eventually (~17 tries for me)

2) stop-ds with connection parameters

  • we don't have the 200s timeout, but sometimes the server takes longer to start (I added an ldapsearch loop to check the server is running before we continue)
  • eventually, the server doesn't start at all, and we get stuck in a loop of the ldapsearches
  • both servers are using more and more memory, and that continues even after I kill the script and we do no more operations on the servers

Comment by Viktor Nawrath [X] (Inactive) [ 19/Apr/18 ]

One more update, instead of `stop-ds -R` I'm doing stop and start... I was able to hit the 200s timeout in the second run, with this log messages on the stderr of stop-ds:

[19/Apr/2018:10:21:33 +0200] category=SYNC severity=WARNING msgID=106 msg=Timed out waiting for monitor data for the domain "cn=schema" from replication server RS(11867)

I'm uploading the updated script I used for that.

Comment by Viktor Nawrath [X] (Inactive) [ 09/Nov/18 ]

Verified using 6.5.0-RC4 6f964a7cb1f

Generated at Mon Mar 01 23:05:07 UTC 2021 using Jira 7.13.12#713012-sha1:6e07c38070d5191bbf7353952ed38f111754533a.