Files
hermes-agent/gateway
Fearvox 4b06c98fe4 fix(gateway): close ResponseStore + dispose unowned adapter on reconnect failure
Three separate code paths in the gateway's platform reconnect loop
leaked file descriptors every retry, exhausting the default 2560-fd
ulimit in ~12 hours of continuous failure and turning the gateway
into a zombie that raises OSError: [Errno 24] on every open() (#37011).

Root cause:
  * APIServerAdapter.__init__ opens a ResponseStore SQLite connection
    that holds 2 fds (db file + WAL sidecar).
  * APIServerAdapter.disconnect() previously only stopped the aiohttp
    web server — the ResponseStore connection was never closed.
  * The reconnect watcher in _platform_reconnect_watcher constructs a
    fresh adapter on every retry attempt. When the connect call fails
    (3 paths: non-retryable error, retryable error, exception during
    connect) the adapter is dropped without ever being installed on
    self.adapters, so nothing else calls its disconnect(). Result: the
    2 ResponseStore fds stay open until GC sweeps the unreachable
    object, which Python's cyclic GC does not do promptly for
    asyncio-bound native handles.

  2 fds × 1 retry × (3600s / 300s backoff cap) ≈ 12 fds/hour.
  2560 fds / 12 fds/hr ≈ 12h to ulimit exhaustion.

Fix:

  * APIServerAdapter.disconnect() now also calls
    self._response_store.close() (with a try/except so a SQLite
    close failure doesn't abort the aiohttp teardown).
  * New module-level helper _dispose_unused_adapter(adapter) in
    gateway/run.py that calls adapter.disconnect() and swallows
    any exception (so half-constructed adapters whose __init__
    crashed don't kill the watcher loop).
  * _platform_reconnect_watcher calls _dispose_unused_adapter() in
    all three failure paths: non-retryable, retryable, and the
    except Exception arm. adapter = None is initialized
    before the try so the except arm can see the partial
    construction.

Tests:

  * New file tests/gateway/test_platform_reconnect_fd_leak.py with
    7 regression tests covering all three failure paths, the
    _dispose_unused_adapter helper (None + raising-disconnect cases),
    and the APIServerAdapter ResponseStore close behavior (success +
    close-exception cases). The _CountingAdapter fixture tracks
    disconnect() invocations and an _open_fds counter that is
    decremented on dispose, so the assertion is the literal
    observable behavior of the leak.

Refs:
  - Closes #37011 (the original fd-leak report)
  - Supersedes #37018, #37110, #37238, #37260, #37394 (7 competing
    open PRs all addressing the same root cause from different angles;
    none of them rebased cleanly against current main, and none
    covered all three failure paths in one fix with regression tests
    for both the watcher and the platform-level close behavior)
2026-06-02 17:27:44 -07:00
..