When running perf tests in order to try the new streaming support included in CHF for this release, we quickly (50 concurrent users) jumped into a thread starvation situation, ending up with client (Gatling in our case, but any user-agent would behave the same) throwing a timeout exception because connection is stalled.
So, here is the scenario where we discovered this: IG master with latest Async Http Client (AHC), 50 concurrent users receiving 5MB-sized response for each request.
AHC has a thread pool (configured with numberOfWorkers, defaults to number of CPUs, so 8 in my scenario) that is bound to receive network data, when all response's headers are received, we hand-off the promise completion to another thread pool in order to not block this thread pool. While the (incomplete) response is being processed through the chain of promise's callbacks, the worker threads continue to place any received data (response content) in the stream.
The other thread pool (same size than the worker thread pool) is responsible to execute the promise's callbacks, so it eventually end up executing the final then() in the HttpFrameworkServlet that reads the entity content and copy it into the ServletOutputStream.
Reading and writing to the stream are synchronous, blocking operations.
What happen is the following:
- The 8 hand-off threads are blocked in the Servlet's code that reads the response's stream because the stream is empty (not finished, just no more bytes to read at the moment)
- The 8 workers threads are blocked trying to write into another set of response's stream (***Stream.awaitSpace()) because the buffer is full.
- Basically these 2 set of threads are waiting for each other:
- workers threads are waiting for hand-off threads to consume some data in order to leave some bytes free for the current writing operation
- hand-off threads are waiting for workers threads to write some data in the buffer
- This situation is unlikely to happen for small responses:
- the hand-off threads consume the data quickly
- hand-off thread is then available for the next response to handle
- The bigger the responses are, the longer it takes to read the content, blocking a thread for more time, ...
When this situation happen, there is not much we can do, in IG we would need to restart the route that declares the blocked ClientHandler (if this is in config/config.json, that's basically restarting the server).
- Augmenting the numbers of threads remove the deadlock
- For the 50 users / 5MB scenario to pass without deadlock I moved from 8 to 16 workers (therefore 8 to 16 hand-off threads as well)
- I can run different scenarios to see if I can come up with some rules