Files
hermes-agent/tools
ErnestHysa eb9bfd3924 fix(T5): replace time.sleep(0.25) with asyncio.sleep in MCP auth reconnect poll
PAIN BEFORE:
Inside _handle_auth_error_and_retry() (a sync function that runs on the MCP
event loop thread), there was a blocking polling loop:

    while time.monotonic() < deadline:
        if srv.session is not None and srv._ready.is_set():
            break
        time.sleep(0.25)   # BLOCKS THE ENTIRE EVENT LOOP

Since _handle_auth_error_and_retry is invoked from tool handlers that run ON
the MCP event loop, time.sleep(0.25) blocked ALL concurrent MCP operations
(including other tools, keepalive heartbeats, OAuth refreshes) for 250ms per
iteration. With a 15-second deadline, worst case = 60 * 250ms = 15 seconds
of fully blocked concurrency.

WHAT WAS FIXED:
Extracted the blocking poll into an async helper _await_ready() that uses
asyncio.sleep(0.25) (non-blocking), and runs it via _run_on_mcp_loop().
_run_on_mcp_loop() properly awaits the coroutine on the event loop without
blocking the caller's thread. Added exception handling around the poll so
stuck reconnects still fall through to the error path.

The sync _handle_auth_error_and_retry now:
1. Fires reconnect signal (threadsafe)
2. Calls _run_on_mcp_loop(_await_ready(), timeout=15) — non-blocking
3. Returns; the event loop handles the polling

File: tools/mcp_tool.py
Lines: _handle_auth_error_and_retry() (~1886-1920)

Found by: exhaustive multi-pass audit (10 strategies, 1901 files, 913K lines)
2026-05-31 00:50:19 -07:00
..