When an RS starts up it creates a thread to listen to the replication port. This creates a loop which exits when either the server is being shutdown, or the port is being reconfigured. The loop:
- accepts a new connection,
- does the SSL handshake, and then
- reads a ReplicationMsg.
If the ReplicationMsg is not a ServerStartMsg / ReplServerStartMsg / ServerStartECLMsg the thread returns silently, and effectively nothing is left to accept connections on the replication port.
There is absolutely no logging that this is happening, which seems like an omission. Something at ERROR level would be appropriate, because the replication service is now crippled.
We only determined this was a problem by comparing jstacks of a working server and of a broken server. It is not yet clear what message is being received that causes the thread to return.