polish(gateway): address Copilot review comments on fd-leak fix

Seven Copilot inline review comments on #37679, four worth landing
in a polish pass before merge:

1. _dispose_unused_adapter signature: 'BasePlatformAdapter' ->
   'BasePlatformAdapter | None'. The function explicitly handles
   None and the reconnect watcher calls it with None in the
   except arm, so the annotation now matches the actual contract.

2. (duplicate of #1 on a different line) — same fix.

3. except Exception in _dispose_unused_adapter — the reviewer
   asked about asyncio.CancelledError swallowing. On Python 3.8+
   (Hermes requires 3.13, see pyproject.toml), CancelledError
   inherits from BaseException, NOT Exception, so the existing
   'except Exception' does NOT swallow task cancellation. Added
   an explicit comment explaining the contract so future readers
   don't repeat the analysis. We don't re-raise because the
   watcher loop intentionally treats dispose failures as
   best-effort: a failed dispose on an unowned adapter should not
   take down the watcher that's keeping the gateway alive.

4. _response_store = None after close in api_server.py — the
   reviewer flagged this for idempotency. Decided to keep the
   non-None state intentionally: setting it to None cascades
   to ~9 callers that access self._response_store without a
   None check, and 'close() is idempotent on a closed sqlite3
   Connection' means the current code is already safe. The
   type stays stable; LSP doesn't flag a cascade of
   reportOptionalMemberAccess errors. (This matches the
   pre-existing pattern in the codebase — e.g.
   _mark_disconnected doesn't reset state to None either.)

5. _build_adapter_with_store: reviewer worried about
   disconnect() failing on the self.name property if
   __init__ wasn't called. Already handled: we set
   'adapter.platform = Platform.API_SERVER' so the
   'self.platform.value.title()' property returns
   'Api_Server' without raising. The exception-swallowing
   branch in disconnect() does call self.name via the
   logger.debug format, so this is a real path that needs
   the platform attribute, and we have it.

6. test_disconnect_closes_response_store: bare 'pytest.raises(Exception)'
   -> 'pytest.raises(sqlite3.ProgrammingError)'. The bare
   Exception matcher would silently accept AttributeError,
   OperationalError, env-related issues, etc. The specific
   exception type ('Cannot operate on a closed database') is
   the actual signal we want — proves the SQLite conn is
   closed, not just that *something* raised.

7. test_nonretryable_failure_disposes_unowned_adapter:
   assertion tightened from '>= 1' to '== 1' on
   adapter._disconnect_calls. The docstring said 'exactly once',
   the assertion now matches. Catches the hypothetical
   'watcher disposes the same adapter twice' regression that
   '>=' would have missed.
This commit is contained in:
Fearvox
2026-06-02 15:37:24 -07:00
committed by Teknium
parent 7982560845
commit 01eaba7061
2 changed files with 34 additions and 7 deletions

View File

@ -1755,7 +1755,7 @@ def _preserve_queued_followup_history_offset(
return merged return merged
async def _dispose_unused_adapter(adapter: "BasePlatformAdapter") -> None: async def _dispose_unused_adapter(adapter: "BasePlatformAdapter | None") -> None:
"""Best-effort dispose for an adapter that never made it onto ``self.adapters``. """Best-effort dispose for an adapter that never made it onto ``self.adapters``.
The reconnect watcher in ``GatewayRunner._platform_reconnect_watcher`` The reconnect watcher in ``GatewayRunner._platform_reconnect_watcher``
@ -1778,6 +1778,11 @@ async def _dispose_unused_adapter(adapter: "BasePlatformAdapter") -> None:
failure paths in the reconnect watcher can all call it without failure paths in the reconnect watcher can all call it without
each one having to know that ``disconnect()`` may itself raise each one having to know that ``disconnect()`` may itself raise
on a half-constructed adapter. on a half-constructed adapter.
``adapter`` may be ``None``: the reconnect watcher initialises
``adapter = None`` before the ``try`` so the ``except Exception``
arm can dispose a half-constructed object, and also early-returns
here when ``_create_adapter()`` returned ``None``.
""" """
if adapter is None: if adapter is None:
return return
@ -1788,6 +1793,15 @@ async def _dispose_unused_adapter(adapter: "BasePlatformAdapter") -> None:
# crashed during aiohttp app setup) can raise from # crashed during aiohttp app setup) can raise from
# disconnect() on objects that never finished initializing. # disconnect() on objects that never finished initializing.
# We must not let that escape and abort the watcher loop. # We must not let that escape and abort the watcher loop.
#
# On Python 3.8+, ``asyncio.CancelledError`` inherits from
# ``BaseException`` (not ``Exception``), so this ``except
# Exception`` does not swallow task cancellation. We don't
# re-raise explicitly because the watcher loop intentionally
# treats dispose failures as best-effort: a failed ``disconnect``
# call should not take down the reconnect watcher that
# itself is what's keeping the gateway alive during a partial
# outage.
logger.debug( logger.debug(
"Adapter dispose raised on unowned adapter %r", "Adapter dispose raised on unowned adapter %r",
getattr(adapter, "name", type(adapter).__name__), getattr(adapter, "name", type(adapter).__name__),

View File

@ -165,11 +165,18 @@ class TestReconnectFDLeakRegression:
new=AsyncMock(return_value=False)): new=AsyncMock(return_value=False)):
await _run_watcher_one_iteration(runner) await _run_watcher_one_iteration(runner)
assert adapter._disconnect_calls >= 1, ( # The intent of this test is "the watcher calls disconnect()
# exactly once on the unowned adapter" — not "at least once".
# An accidental double-dispose would be a new bug to catch
# (e.g. the watcher's two failure paths both calling dispose
# for the same adapter instance). Tighten to == 1.
assert adapter._disconnect_calls == 1, (
f"non-retryable reconnect failure must call adapter.disconnect() " f"non-retryable reconnect failure must call adapter.disconnect() "
f"at least once; got {adapter._disconnect_calls} calls. " f"exactly once; got {adapter._disconnect_calls} calls. "
"Without it, 2 fds leak per retry at the 300s backoff cap " "Without it, 2 fds leak per retry at the 300s backoff cap "
"(#37011)." "(#37011). More than one call would also be a bug — the "
"adapter has already been disposed once, a second call is "
"wasted work and may itself raise."
) )
assert adapter._open_fds == 0, ( assert adapter._open_fds == 0, (
f"adapter fds not released after disconnect(); " f"adapter fds not released after disconnect(); "
@ -303,15 +310,21 @@ class TestAPIServerDisconnectClosesResponseStore:
to ``~/.hermes/response_store.db`` (or :memory: as a fallback), to ``~/.hermes/response_store.db`` (or :memory: as a fallback),
which is exactly the resource that was leaking pre-fix. which is exactly the resource that was leaking pre-fix.
""" """
import sqlite3
store = ResponseStore(max_size=10, db_path=str(tmp_path / "rs.db")) store = ResponseStore(max_size=10, db_path=str(tmp_path / "rs.db"))
adapter = self._build_adapter_with_store(store) adapter = self._build_adapter_with_store(store)
await adapter.disconnect() await adapter.disconnect()
# Post-disconnect, the underlying sqlite3 conn should be closed. # Post-disconnect, the underlying sqlite3 conn should be closed.
# Any further query raises ``ProgrammingError: Cannot operate # sqlite3 raises ``ProgrammingError: Cannot operate on a closed
# on a closed database``. # database`` for any further operation. We assert on the
with pytest.raises(Exception): # specific exception type (not bare ``Exception``) so the test
# only passes when the close actually took effect — a generic
# ``Exception`` catcher would mask unrelated failures (env
# issues, AttributeError, etc.).
with pytest.raises(sqlite3.ProgrammingError):
store._conn.execute("SELECT 1").fetchone() store._conn.execute("SELECT 1").fetchone()
@pytest.mark.asyncio @pytest.mark.asyncio