There are three replicating DS+RS servers. The client read and write traffic is generally pointing at the primary and secondary servers, while the 3rd backup server is almost purely updating its changelogDb.
There is significant write load on the "ou=rprofiles,dc=example,dc=com" domain. Each change is approx 70KB. All three servers have a 24 hour purge-delay.
Under some unknown conditions, the changelogDb on one of the servers (most recently on the backup server) has started growing until the disk has nearly filled.
The changelogDb directories are approximately 150GB on the primary and secondary servers, and over 1TB on the backup server.
We dumped the rprofiles changes from all 3 servers:
So it looks as though the checkpointer threads on the backup server are properly trimming old changes. The disparity in sizes is due to the backup server being run for a bit longer than the other servers.
However the cleaner threads seem to be in some serious trouble, and they are not catching up.
We observe the following messages frequently in the backup server's changelogDb/je.info.0 file:
Note the high `lnSizeCorrectionFactor` value. The "backlog" seems to be roughly the number of files to clean.
We also see this frequently:
The cause of that second message is not clear. An online backup does *not* access the environment in this way.
Using DbSpace -r we can see that there are a large number of files with very low occupancy. Top 10:
The changelogDb only uses 2 cleaner threads and this is not presently configurable. However note that the other 2 servers are able to keep up with this trim load so I don't believe the number of cleaner threads is the issue.