It should be possible for users to detect and be alerted when a replica falls behind the purge delay. However, when a replica rejoins a topology and it is already behind the purge delay there are no errors or warnings and dsrepl status indicates that the topology is healthy, even though this is not the case. Sometimes the late replica is able to continue to accept updates, but sometimes it does not. The exact behavior seems sensitive to whether the change number index is enabled or not (possibly because replica DBs are purged more aggressively when it is disabled).
Steps to reproduce:
- create a 2-way topology
- disable the change number index
- set the purge delay to 30 seconds
- perform an addrate at 100/s
- observe growth in the changelogs using watch tree -lh changelogdb
- stop second server
- wait for purge delay to expire and observe rotation of changelog on first server as the addrate continues (i.e. wait until it should no longer be possible for the second server to rejoin)
- restart the second server
- observe the second server seems to work just fine, including dsrepl status, also no errors in the error or replication logs
- observe number of entries in the base entry has significantly diverged on both servers
I've attached a script which automates steps 1-4 and also enables the non-JSON access loggers in combined mode for easier debugging.
The replica should detect that it cannot be recovered because its ds-sync-state is behind the purge delay of the RS it is connecting to. However, in these are combined DS/RS so the RS appears valid to its local DS, but should fail when connecting to the remote RS.
At least, something should fail somewhere! It would be nice if this was reported to dsrepl status as well.