Customer reported too many exceptions when websocket is reconnected, probably after after agent token has expired.
The agent will periodically try to gracefully reconnect its websockets to allow the AM cluster to distribute its notification workload. This is done by creating a new connection before closing the old, so that the connections overlap and no messages are lost.
However, it was not handling failures caused by agent token expiry. The agent would log the connect error whenever there was activity on the websocket, but it would not try to reconnect until there was a lul in activity. If the websocket was busy for an extended period (i.e. the read request does not time out after 4 secs) then the failure is constantly logged without reconnect.
This can be verified by producing a constant stream of notifications and logging out the agent session. Failure to create a new websocket connection is reported constantly while notifications are received, until there is a lul in websocket activity (notifications are not sent for a period of about 4 secs - configurable).
The result of this is fairly dire – the websocket cannot reconnect while there is a constant stream of notifications, and error messages are logged whenever notifications arrive, potentially at a high rate. The rate can be high if an AM site is handling activity from multiple agents - since AM still sends all notifications to each agent. The customer reports that disk space runs out over a long enough period. It is likely that their agent sessions expire because they have a limited lifetime in order to prevent the accumulation of agent sessions in CTS.