test: use subprocesses for each test file (#29016)

* ci(tests): install ripgrep from prebuilt tarball instead of apt apt-get update + install of ripgrep takes ~4 min on the GHA Ubuntu runners (the apt-get update against archive.ubuntu.com is the slow part; ripgrep itself is small). Switching to the upstream musl binary tarball cuts the step to a few seconds. - Pinned to ripgrep 15.1.0 with sha256 verification (same hash as published in the releases sha256 sidecar file). - Drops the `rg` binary into /usr/local/bin so it is on PATH for every subsequent step without GITHUB_PATH manipulation. - Applied to both the test and e2e jobs in tests.yml. * fix(cli): compile syntax check to tempdir, not source __pycache__ `_validate_critical_files_syntax` runs `py_compile.compile()` on each critical bootstrap file after a successful `git pull`. The default `py_compile` writes the resulting `.pyc` next to the source under `__pycache__/`, which causes two real problems: 1. Parallel test workers walking the same source tree (e.g. running the suite under per-file process isolation) can race against each other on the `__pycache__` write — manifests as flaky 'directory not empty' errors during teardown. 2. In production, the post-pull syntax check leaves a `.pyc` behind that the next interpreter run might pick up — fine when the interpreter version matches, sketchy if it doesn't. Fix: write the compiled output to a `tempfile.TemporaryDirectory()` that's discarded on function exit. We only care about the compile-or-not signal, not the artifact. * test(runner): per-file process isolation, drop manual state reset + xdist Replace fragile manual _reset_module_state test fixtures with robust per-file subprocess isolation. Each test file runs in a fresh `python -m pytest <file>` subprocess via ThreadPoolExecutor. No xdist, no custom pytest plugin, no shared worker state. Key changes: * scripts/run_tests_parallel.py — new runner: discovers test files, runs N in parallel via ThreadPoolExecutor, captures stdout per file, treats exit code 5 (no tests collected) as pass, kills all children on exit. Change from cpu_count to cpu_count*2. The runner is I/O-bound (waiting on subprocess.communicate() from pytest children) The parent process does almost no CPU work, so 2x oversubscription keeps more pipes full. When a file fails, immediately show the last 30 lines of pytest output (stack traces + FAILED summary) plus a ready-to-copy repro command: python -m pytest tests/agent/test_auxiliary_client.py * scripts/run_tests.sh — delegates to run_tests_parallel.py * .github/workflows/tests.yml — test step: python scripts/run_tests_parallel.py * pyproject.toml — drop pytest-xdist, pytest-split; simplify addopts * tests/conftest.py — remove ~200 lines of manual state-reset fixtures * AGENTS.md — update Testing section for per-file design * test(runner): speed gateway test antipattern scan up * fix(test): web search provider plugin test missing xai * fix(tests): make 14 test files pass under per-file subprocess isolation Tests that relied on cross-file state pollution from xdist workers fail when run in isolation (per-file subprocess model). Root causes and fixes: Tool registry not populated: - test_video_generation_tool_surface_matrix: add discover_builtin_tools() - test_web_providers_brave_free/ddgs/searxng/general: autouse fixtures registering all 8 bundled web providers, reset after each test - test_website_policy: same provider registration pattern - test_web_tools_tavily: same pattern across 3 dispatch test classes - Also add is_safe_url/check_website_access mocks where SSRF check blocks example.com (DNS resolution fails in isolated envs) Stale check_fn cache: - test_kanban_tools: invalidate_check_fn_cache() + _clear_tool_defs_cache() in both kanban guidance tests (prior test cached False for kanban_show) - test_discord_tool: cache invalidation in setup/teardown - test_homeassistant_tool: invalidate_check_fn_cache() before registry queries Module-level state pollution: - test_auxiliary_client: autouse fixture clearing _aux_unhealthy_until cache - test_skill_commands: set_session_vars() instead of patch.dict(os.environ) (ContextVar takes precedence over os.environ) - test_dm_topics: overwrite sys.modules + separate telegram.constants mock + force-reimport of gateway.platforms.telegram - test_terminal_tool_requirements: removed duplicate class declaration, autouse _clear_caches fixture * change(tests): run_tests.sh explicitly includes env vars instead of manually dropping some vars, now we just only include some * fix(tests): 5 more isolation/NixOS fixes - test_approval_plugin_hooks: isolate HERMES_HOME so real user's command_allowlist doesn't short-circuit the approval path - test_google_chat: skipif when Platform.GOOGLE_CHAT not in enum (feature not merged on this branch) - test_write_deny: test systemd prefix against tmp_path instead of /etc/systemd which resolves to /nix/store on NixOS - test_pty_bridge: use shutil.which('cat') instead of /bin/cat (doesn't exist on NixOS) - profiles.py: rmtree onexc handler chmod's parent dirs too, fixing profile deletion when copytree preserved read-only modes from nix store * fix(tests): clear unhealthy cache in autouse fixture for auxiliary_client * fix(tests): skip send_message when telegram not installed; handle missing worker_id in browser_supervisor * fix: py3.11 rmtree onexc compat + belt-and-suspenders unhealthy cache clear for expired codex test * fix: address PR #29016 review feedback - Remove tracked .pytest-cache/ artifact and add to .gitignore - Fix stale 'xdist worker' comment in conftest.py - Deduplicate web provider registration into tests/tools/conftest.py shared helper (register_all_web_providers), replacing 8 copy-pasted blocks across 6 test files - Update PR description: remove stale recovered-test-files claim, fix worker count to match code (cpu_count*2) * fix: eliminate race in stale-cache achievements test The background scan thread could complete and overwrite _SNAPSHOT_CACHE before evaluate_all() returned the stale data — only 10 fake sessions made the scan finish instantly. Added scan_delay param to _FakeSessionDB and set it to 2s in the stale-cache test so the background thread can't win the race.
2026-05-21 07:10:04 -04:00
parent 87d9239009
commit 48be2e0e4d
35 changed files with 1694 additions and 582 deletions
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@ -23,13 +23,24 @@ concurrency:
 jobs:
  test:
    runs-on: ubuntu-latest
-    timeout-minutes: 30
+    timeout-minutes: 60
    steps:
      - name: Checkout code
        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd  # v6.0.2
-      - name: Install system dependencies
+      - name: Install ripgrep (prebuilt binary)
-        run: sudo apt-get update && sudo apt-get install -y ripgrep
+        run: |
          set -euo pipefail
          RG_VERSION=15.1.0
          RG_SHA256=1c9297be4a084eea7ecaedf93eb03d058d6faae29bbc57ecdaf5063921491599
          RG_TARBALL=ripgrep-${RG_VERSION}-x86_64-unknown-linux-musl.tar.gz
          curl -sSfL -o "$RG_TARBALL" \
            "https://github.com/BurntSushi/ripgrep/releases/download/${RG_VERSION}/${RG_TARBALL}"
          echo "${RG_SHA256}  ${RG_TARBALL}" | sha256sum -c -
          tar -xzf "$RG_TARBALL"
          sudo mv "ripgrep-${RG_VERSION}-x86_64-unknown-linux-musl/rg" /usr/local/bin/rg
          rm -rf "$RG_TARBALL" "ripgrep-${RG_VERSION}-x86_64-unknown-linux-musl"
          rg --version
      - name: Install uv
        uses: astral-sh/setup-uv@d4b2f3b6ecc6e67c4457f6d3e41ec42d3d0fcb86  # v5
@ -44,9 +55,26 @@ jobs:
          uv pip install -e ".[all,dev]"
      - name: Run tests
        # Per-file isolation via scripts/run_tests_parallel.py: discovers
        # every test_*.py file under tests/ (excluding integration/ + e2e/),
        # then runs `python -m pytest <file>` in a freshly-spawned subprocess
        # with bounded parallelism. No xdist, no shared workers, no
        # module-level state leakage between files.
        #
        # Why per-file (not per-test): per-test spawn cost (~250ms × 17k
        # tests = 70min CPU minimum) blew the wall-clock budget. Per-file
        # spawn (~250ms × ~850 files = ~3.5min) fits while still giving
        # every file a fresh interpreter — the only isolation boundary
        # that matters in practice (cross-file leakage was the original
        # flake source; intra-file is the test author's responsibility).
        #
        # Why drop xdist entirely: xdist's persistent workers accumulate
        # state across files, which is exactly the leakage we wanted to
        # fix. ThreadPoolExecutor + subprocess.run is ~60 lines and does
        # the job with cleaner semantics.
        run: |
          source .venv/bin/activate
-          python -m pytest tests/ -q --ignore=tests/integration --ignore=tests/e2e --tb=short -n auto --timeout=30 --timeout-method=signal
+          python scripts/run_tests_parallel.py
        env:
          # Ensure tests don't accidentally call real APIs
          OPENROUTER_API_KEY: ""
@ -60,8 +88,19 @@ jobs:
      - name: Checkout code
        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd  # v6.0.2
-      - name: Install system dependencies
+      - name: Install ripgrep (prebuilt binary)
-        run: sudo apt-get update && sudo apt-get install -y ripgrep
+        run: |
          set -euo pipefail
          RG_VERSION=15.1.0
          RG_SHA256=1c9297be4a084eea7ecaedf93eb03d058d6faae29bbc57ecdaf5063921491599
          RG_TARBALL=ripgrep-${RG_VERSION}-x86_64-unknown-linux-musl.tar.gz
          curl -sSfL -o "$RG_TARBALL" \
            "https://github.com/BurntSushi/ripgrep/releases/download/${RG_VERSION}/${RG_TARBALL}"
          echo "${RG_SHA256}  ${RG_TARBALL}" | sha256sum -c -
          tar -xzf "$RG_TARBALL"
          sudo mv "ripgrep-${RG_VERSION}-x86_64-unknown-linux-musl/rg" /usr/local/bin/rg
          rm -rf "$RG_TARBALL" "ripgrep-${RG_VERSION}-x86_64-unknown-linux-musl"
          rg --version
      - name: Install uv
        uses: astral-sh/setup-uv@d4b2f3b6ecc6e67c4457f6d3e41ec42d3d0fcb86  # v5
--- a/.gitignore
+++ b/.gitignore
@ -18,6 +18,7 @@ __pycache__/web_tools.cpython-310.pyc
 logs/
 data/
 .pytest_cache/
 .pytest-cache/
 tmp/
 temp_vision_images/
 hermes-*/*
--- a/AGENTS.md
+++ b/AGENTS.md
@ -1013,17 +1013,39 @@ def profile_env(tmp_path, monkeypatch):
 **ALWAYS use `scripts/run_tests.sh`** — do not call `pytest` directly. The script enforces
 hermetic environment parity with CI (unset credential vars, TZ=UTC, LANG=C.UTF-8,
-4 xdist workers matching GHA ubuntu-latest). Direct `pytest` on a 16+ core
+`-n auto` xdist workers, in-tree subprocess-isolation plugin). Direct `pytest`
-developer machine with API keys set diverges from CI in ways that have caused
+on a 16+ core developer machine with API keys set diverges from CI in ways
-multiple "works locally, fails in CI" incidents (and the reverse).
+that have caused multiple "works locally, fails in CI" incidents (and the reverse).
 ```bash
 scripts/run_tests.sh                                  # full suite, CI-parity
 scripts/run_tests.sh tests/gateway/                   # one directory
 scripts/run_tests.sh tests/agent/test_foo.py::test_x  # one test
 scripts/run_tests.sh -v --tb=long                     # pass-through pytest flags
 scripts/run_tests.sh --no-isolate tests/foo/          # disable subprocess isolation (faster, for debugging)
 ```
 ### Subprocess-per-test isolation
 Every test runs in a freshly-spawned Python subprocess via the in-tree plugin
 at `tests/_isolate_plugin.py`. This means module-level dicts/sets and
 ContextVars from one test cannot leak into the next — the historic
 `_reset_module_state` autouse fixture is gone.
 Implementation notes:
 - The plugin uses `multiprocessing.get_context("spawn")`, which works on
  Linux, macOS, and Windows alike (POSIX `fork` is not used).
 - Per-test overhead is ~0.5–1.0s (Python startup + pytest collection). xdist
  parallelism amortizes this across cores; on a 20-core box the full suite
  finishes in roughly the same wall time as before, but flake-free.
 - `isolate_timeout` (configured in `pyproject.toml`) caps each test at 30s.
  Hangs are killed and surfaced as a failure report.
 - Pass `--no-isolate` to disable isolation — useful when debugging a single
  test interactively, or when you specifically want to verify state leakage.
 - The plugin disables itself in child processes (sentinel envvar
  `HERMES_ISOLATE_CHILD=1`), so there's no fork-bomb risk.
 ### Why the wrapper (and why the old "just call pytest" doesn't work)
 Five real sources of local-vs-CI drift the script closes:
@ -1034,7 +1056,7 @@ Five real sources of local-vs-CI drift the script closes:
 | HOME / `~/.hermes/` | Your real config+auth.json | Temp dir per test |
 | Timezone | Local TZ (PDT etc.) | UTC |
 | Locale | Whatever is set | C.UTF-8 |
-| xdist workers | `-n auto` = all cores (20+ on a workstation) | `-n 4` matching CI |
+| xdist workers | `-n auto` = all cores | `-n auto` (safe — subprocess isolation prevents cross-worker flakes) |
 `tests/conftest.py` also enforces points 1-4 as an autouse fixture so ANY pytest
 invocation (including IDE integrations) gets hermetic behavior — but the wrapper
@ -1042,15 +1064,21 @@ is belt-and-suspenders.
 ### Running without the wrapper (only if you must)
-If you can't use the wrapper (e.g. on Windows or inside an IDE that shells
+If you can't use the wrapper (e.g. inside an IDE that shells pytest directly),
-pytest directly), at minimum activate the venv and pass `-n 4`:
+at minimum activate the venv. The isolation plugin loads automatically from
 `addopts` in `pyproject.toml`, so you get the same per-test process isolation
 either way.
 ```bash
 source .venv/bin/activate   # or: source venv/bin/activate
-python -m pytest tests/ -q -n 4
+python -m pytest tests/ -q
 ```
-Worker count above 4 will surface test-ordering flakes that CI never sees.
+If you need to bypass isolation for fast feedback while debugging:
 ```bash
 python -m pytest tests/agent/test_foo.py -q --no-isolate
 ```
 Always run the full suite before pushing changes.
--- a/hermes_cli/main.py
+++ b/hermes_cli/main.py
@ -6086,24 +6086,36 @@ def _validate_critical_files_syntax(root) -> tuple[bool, str | None, str | None]
    them after a successful ``git pull`` so we can auto-roll-back instead of
    leaving the user with a bricked install.
    The compiled ``.pyc`` is written to a temp directory rather than the
    source tree's ``__pycache__/`` so we don't race with concurrent test
    workers that walk the same dir, and so we don't leave a stale pyc
    behind in production if the next interpreter run picks a different
    Python version. The pyc is discarded on function return either way —
    we only care about the compile-or-not signal.
    Returns ``(ok, failing_path, error_message)``. ``ok=True`` means every
    file parsed cleanly.
    """
    import py_compile
    import tempfile
    root = Path(root)
-    for relpath in _UPDATE_CRITICAL_FILES:
+    with tempfile.TemporaryDirectory(prefix="hermes-syntax-check-") as tmpdir:
-        path = root / relpath
+        for relpath in _UPDATE_CRITICAL_FILES:
-        if not path.exists():
+            path = root / relpath
-            # Missing file is suspicious but not necessarily fatal — a future
+            if not path.exists():
-            # refactor may legitimately remove one of these. Skip and move on.
+                # Missing file is suspicious but not necessarily fatal — a future
-            continue
+                # refactor may legitimately remove one of these. Skip and move on.
-        try:
+                continue
-            py_compile.compile(str(path), doraise=True)
+            # Mirror the relative path under the tmpdir so two different
-        except py_compile.PyCompileError as exc:
+            # files with the same basename don't collide on the cfile name.
-            return False, str(path), str(exc)
+            cfile = Path(tmpdir) / (relpath.replace("/", "__") + "c")
-        except OSError as exc:
+            try:
-            return False, str(path), f"could not read: {exc}"
+                py_compile.compile(str(path), cfile=str(cfile), doraise=True)
            except py_compile.PyCompileError as exc:
                return False, str(path), str(exc)
            except OSError as exc:
                return False, str(path), f"could not read: {exc}"
    return True, None, None
--- a/hermes_cli/profiles.py
+++ b/hermes_cli/profiles.py
@ -902,7 +902,49 @@ def delete_profile(name: str, yes: bool = False) -> Path:
    # 4. Remove profile directory
    try:
-        shutil.rmtree(profile_dir)
+        def _make_writable(func, path, exc):
            """onexc/onerror handler: add +w on PermissionError so rmtree can proceed.
            Handles two cases on NixOS (and other systems with read-only
            copies from immutable stores):
            1. The path itself isn't writable (e.g. a file with mode 0444)
            2. The *parent* directory isn't writable (e.g. mode 0555)
            Compatible with both the ``onexc`` API (3.12+, receives an
            exception instance) and the ``onerror`` API (3.11-, receives
            ``sys.exc_info()`` tuple).
            """
            import stat as _stat
            import sys as _sys
            # Normalise the two callback signatures:
            #   onexc(func, path, exc_instance)   — 3.12+
            #   onerror(func, path, exc_info_tuple) — 3.11
            if isinstance(exc, tuple):
                exc = exc[1]  # exc_info → actual exception object
            if isinstance(exc, PermissionError):
                # Make the path writable
                try:
                    os.chmod(path, os.stat(path).st_mode | _stat.S_IWUSR)
                except OSError:
                    pass
                # Also make the parent writable (needed for unlink/rmdir)
                parent = os.path.dirname(path)
                if parent:
                    try:
                        os.chmod(parent, os.stat(parent).st_mode | _stat.S_IWUSR)
                    except OSError:
                        pass
                func(path)
            else:
                raise
        # ``onexc`` was added in 3.12; fall back to ``onerror`` on 3.11.
        try:
            shutil.rmtree(profile_dir, onexc=_make_writable)
        except TypeError:
            shutil.rmtree(profile_dir, onerror=_make_writable)
        print(f"✓ Removed {profile_dir}")
    except Exception as e:
        print(f"⚠ Could not remove {profile_dir}: {e}")
--- a/pyproject.toml
+++ b/pyproject.toml
@ -84,7 +84,7 @@ modal = ["modal==1.3.4"]
 daytona = ["daytona==0.155.0"]
 vercel = ["vercel==0.5.7"]
 hindsight = ["hindsight-client==0.6.1"]
-dev = ["debugpy==1.8.20", "pytest==9.0.2", "pytest-asyncio==1.3.0", "pytest-xdist==3.8.0", "pytest-split==0.11.0", "pytest-timeout==2.4.0", "mcp==1.26.0", "ty==0.0.21", "ruff==0.15.10"]
+dev = ["debugpy==1.8.20", "pytest==9.0.2", "pytest-asyncio==1.3.0", "pytest-timeout==2.4.0", "mcp==1.26.0", "ty==0.0.21", "ruff==0.15.10"]
 messaging = ["python-telegram-bot[webhooks]==22.6", "discord.py[voice]==2.7.1", "aiohttp==3.13.3", "brotlicffi==1.2.0.1", "slack-bolt==1.27.0", "slack-sdk==3.40.1", "qrcode==7.4.2"]
 cron = []  # croniter is now a core dependency; this extra kept for back-compat
 slack = ["slack-bolt==1.27.0", "slack-sdk==3.40.1", "aiohttp==3.13.3"]
@ -232,16 +232,12 @@ markers = [
    "integration: marks tests requiring external services (API keys, Modal, etc.)",
    "real_concurrent_gate: opt out of the autouse stub that disables _detect_concurrent_hermes_instances",
 ]
-# pytest-timeout: per-test 60s hard cap with thread method.
+# pytest-timeout: per-test 30s hard cap with signal method.
-# Discovered May 2026: the suite reliably hangs at ~96% on full runs even
+# This is the fallback inside each per-file pytest subprocess (see
-# though every individual test completes in <30s. Root cause is leaked
+# scripts/run_tests_parallel.py). Per-file isolation gives every test
-# threads / atexit handlers accumulating across thousands of tests until
+# file a fresh Python interpreter; pytest-timeout catches Python-level
-# something deadlocks at session teardown. Adding pytest-timeout (with
+# hangs within a file.
-# thread method, which forces an interrupt into the test thread) breaks
+addopts = "-m 'not integration' --timeout=30 --timeout-method=signal"
 # the deadlock — the suite then completes cleanly. The 60s cap is large
 # enough that no legitimate test trips it; if a test exceeds it that's a
 # real bug worth surfacing as a Timeout failure.
 addopts = "-m 'not integration' -n auto --timeout=30 --timeout-method=signal"
 [tool.ty.environment]
 python-version = "3.13"
--- a/scripts/run_tests.sh
+++ b/scripts/run_tests.sh
@ -3,29 +3,36 @@
 # `pytest` directly to guarantee your local run matches CI behavior.
 #
 # What this script enforces:
-#   * -n 4 xdist workers (CI has 4 cores; -n auto diverges locally)
+#   * Per-file isolation via scripts/run_tests_parallel.py — each test
 #     file runs in its own freshly-spawned `python -m pytest <file>`
 #     subprocess. No xdist, no shared workers, no module-level leakage
 #     between files.
 #   * TZ=UTC, LANG=C.UTF-8, PYTHONHASHSEED=0 (deterministic)
-#   * Credential env vars blanked (conftest.py also does this, but this
+#   * Env vars blanked (conftest.py also does this, but this
-#     is belt-and-suspenders for anyone running `pytest` outside of
+#     is belt-and-suspenders for anyone running pytest outside our
-#     our conftest path — e.g. calling pytest on a single file)
+#     conftest path — e.g. on a single file)
-#   * Proper venv activation
+#   * Proper venv activation (probes .venv, venv, then ~/.hermes/...)
 #
 # Usage:
-#   scripts/run_tests.sh                     # full suite
+#   scripts/run_tests.sh                            # full suite
-#   scripts/run_tests.sh tests/agent/        # one directory
+#   scripts/run_tests.sh -j 4                       # cap parallelism
-#   scripts/run_tests.sh tests/agent/test_foo.py::TestClass::test_method
+#   scripts/run_tests.sh tests/agent/               # discover only here
-#   scripts/run_tests.sh --tb=long -v        # pass-through pytest args
+#   scripts/run_tests.sh tests/agent/ tests/acp/    # multiple roots
 #   scripts/run_tests.sh tests/foo.py               # single file
 #   scripts/run_tests.sh tests/foo.py -- --tb=long  # path + pytest args
 #   scripts/run_tests.sh -- -v --tb=long            # pytest args only
 #
 # Everything after a literal '--' is passed through to each per-file
 # pytest invocation. Positional path arguments before '--' override
 # the default discovery root (tests/).
 set -euo pipefail
 # ── Locate repo root ────────────────────────────────────────────────────────
 # Works whether this is the main checkout or a worktree.
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
 # ── Activate venv ───────────────────────────────────────────────────────────
 # Prefer a .venv in the current tree, fall back to the main checkout's venv
 # (useful for worktrees where we don't always duplicate the venv).
 VENV=""
 for candidate in "$REPO_ROOT/.venv" "$REPO_ROOT/venv" "$HOME/.hermes/hermes-agent/venv"; do
  if [ -f "$candidate/bin/activate" ]; then
@ -41,94 +48,31 @@ fi
 PYTHON="$VENV/bin/python"
 # ── Ensure pytest-split is installed (required for shard-equivalent runs) ──
 if ! "$PYTHON" -c "import pytest_split" 2>/dev/null; then
  echo "→ installing pytest-split into $VENV"
  if command -v uv >/dev/null 2>&1; then
    uv pip install --python "$PYTHON" --quiet "pytest-split>=0.9,<1"
  elif "$PYTHON" -m pip --version >/dev/null 2>&1; then
    "$PYTHON" -m pip install --quiet "pytest-split>=0.9,<1"
  else
    echo "error: neither uv nor pip is available in $VENV — pytest-split is missing" >&2
    echo "  fix: run  uv pip install -e \".[dev]\"  from $REPO_ROOT" >&2
    exit 1
  fi
 fi
-# ── Hermetic environment ────────────────────────────────────────────────────
+# ── Live-gateway plugin (computed before we drop env) ───────────────────────
-# Mirror what CI does in .github/workflows/tests.yml + what conftest.py does.
+EXTRA_PYTHONPATH=""
-# Unset every credential-shaped var currently in the environment.
+EXTRA_PYTEST_PLUGINS=""
 while IFS='=' read -r name _; do
  case "$name" in
    *_API_KEY|*_TOKEN|*_SECRET|*_PASSWORD|*_CREDENTIALS|*_ACCESS_KEY| \
    *_SECRET_ACCESS_KEY|*_PRIVATE_KEY|*_OAUTH_TOKEN|*_WEBHOOK_SECRET| \
    *_ENCRYPT_KEY|*_APP_SECRET|*_CLIENT_SECRET|*_CORP_SECRET|*_AES_KEY| \
    AWS_ACCESS_KEY_ID|AWS_SECRET_ACCESS_KEY|AWS_SESSION_TOKEN|FAL_KEY| \
    GH_TOKEN|GITHUB_TOKEN)
      unset "$name"
      ;;
  esac
 done < <(env)
 # Unset HERMES_* behavioral vars too.
 unset HERMES_YOLO_MODE HERMES_INTERACTIVE HERMES_QUIET HERMES_TOOL_PROGRESS \
      HERMES_TOOL_PROGRESS_MODE HERMES_MAX_ITERATIONS HERMES_SESSION_PLATFORM \
      HERMES_SESSION_CHAT_ID HERMES_SESSION_CHAT_NAME HERMES_SESSION_THREAD_ID \
      HERMES_SESSION_SOURCE HERMES_SESSION_KEY HERMES_GATEWAY_SESSION \
      HERMES_CRON_SESSION \
      HERMES_PLATFORM HERMES_INFERENCE_PROVIDER HERMES_MANAGED HERMES_DEV \
      HERMES_CONTAINER HERMES_EPHEMERAL_SYSTEM_PROMPT HERMES_TIMEZONE \
      HERMES_REDACT_SECRETS HERMES_BACKGROUND_NOTIFICATIONS HERMES_EXEC_ASK \
      HERMES_HOME_MODE 2>/dev/null || true
 # Pin deterministic runtime.
 export TZ=UTC
 export LANG=C.UTF-8
 export LC_ALL=C.UTF-8
 export PYTHONHASHSEED=0
 # ── Live-gateway test guard (developer machines) ────────────────────────────
 # If a system-wide hermes pytest_live_guard plugin is installed at
 # $HOME/.hermes/pytest_live_guard.py, force-load it here so every test run
 # from this script gets the protection regardless of which worktree is
 # checked out (in-tree tests/conftest.py guard may be missing on stale
 # branches). Harmless on CI / fresh machines that don't have the file.
 if [ -f "$HOME/.hermes/pytest_live_guard.py" ]; then
-  case ":${PYTHONPATH:-}:" in
+  EXTRA_PYTHONPATH="$HOME/.hermes"
-    *":$HOME/.hermes:"*) ;;
+  EXTRA_PYTEST_PLUGINS="pytest_live_guard"
    *) export PYTHONPATH="${PYTHONPATH:+$PYTHONPATH:}$HOME/.hermes" ;;
  esac
  if [[ ",${PYTEST_PLUGINS:-}," != *,pytest_live_guard,* ]]; then
    export PYTEST_PLUGINS="${PYTEST_PLUGINS:+$PYTEST_PLUGINS,}pytest_live_guard"
  fi
 fi
 # ── Worker count ────────────────────────────────────────────────────────────
 # CI uses `-n auto` on ubuntu-latest which gives 4 workers. A 20-core
 # workstation with `-n auto` gets 20 workers and exposes test-ordering
 # flakes that CI will never see. Pin to 4 so local matches CI.
 WORKERS="${HERMES_TEST_WORKERS:-4}"
-# ── Run pytest ──────────────────────────────────────────────────────────────
+# ── Run in hermetic env ──────────────────────────────────────────────────────
 # env -i: start with empty environment, opt-in only what we need.
 # No credential var can leak — you'd have to explicitly add it here.
 echo "▶ running per-file parallel test suite via run_tests_parallel.py"
 echo "  (TZ=UTC LANG=C.UTF-8 PYTHONHASHSEED=0; clean env)"
 cd "$REPO_ROOT"
-# If the first argument starts with `-` treat all args as pytest flags;
+exec env -i \
-# otherwise treat them as test paths.
+  PATH="$PATH" \
-ARGS=("$@")
+  HOME="$HOME" \
-
+  TZ=UTC \
-echo "▶ running pytest with $WORKERS workers, hermetic env, in $REPO_ROOT"
+  LANG=C.UTF-8 \
-echo "  (TZ=UTC LANG=C.UTF-8 PYTHONHASHSEED=0; all credential env vars unset)"
+  LC_ALL=C.UTF-8 \
-
+  PYTHONHASHSEED=0 \
-# -o "addopts=" clears pyproject.toml's `-n auto` so our -n wins.
+  ${EXTRA_PYTHONPATH:+PYTHONPATH="$EXTRA_PYTHONPATH"} \
-# We re-add --timeout/--timeout-method here because pyproject.toml's
+  ${EXTRA_PYTEST_PLUGINS:+PYTEST_PLUGINS="$EXTRA_PYTEST_PLUGINS"} \
-# addopts is wiped above. The 60s cap is essential: see pyproject.toml
+  "$PYTHON" "$SCRIPT_DIR/run_tests_parallel.py" "$@"
 # for why (suite deadlocks at session teardown without it).
 exec "$PYTHON" -m pytest \
  -o "addopts=" \
  -n "$WORKERS" \
  --timeout=30 \
  --timeout-method=signal \
  --ignore=tests/integration \
  --ignore=tests/e2e \
  -m "not integration" \
  "${ARGS[@]}"
--- a/scripts/run_tests_parallel.py
+++ b/scripts/run_tests_parallel.py
@ -0,0 +1,650 @@
 #!/usr/bin/env python3
 """Per-file parallel test runner.
 The minimum-viable replacement for pytest-xdist + a subprocess-isolation
 plugin. Discovers test files under ``tests/`` (excluding integration/e2e
 unless explicitly requested), then runs one ``python -m pytest <file>``
 subprocess per file, with bounded parallelism (default: ``os.cpu_count()``).
 Why per-file rather than per-test?
    Per-test spawn overhead (~250ms × 17k tests = 70min CPU minimum)
    swamped the actual work. Per-file spawn (~250ms × ~850 files = ~3.5min)
    fits in the budget while still giving every file a fresh Python
    interpreter — the only isolation boundary that actually matters
    (cross-file module-level state leakage was the original flake source;
    intra-file state is the test author's responsibility).
 Why drop xdist entirely?
    xdist's persistent workers accumulate state across files, which is
    exactly the leakage we wanted to fix. xdist also adds complexity
    (loadfile vs loadscope, --max-worker-restart, internal control plane)
    that we don't need when the unit of work is "run pytest on one file".
    A subprocess.Popen pool gated by a semaphore is ~60 lines and does
    the job.
 Usage:
    python scripts/run_tests_parallel.py [pytest_args...]
    Common pytest args pass through (e.g. ``-v``, ``-x``, ``--tb=long``,
    ``-k 'pattern'``, ``--lf``).
 Environment:
    HERMES_TEST_WORKERS  Override worker count (default: os.cpu_count())
    HERMES_TEST_PATHS    Override discovery roots (colon-sep, default: 'tests')
 Exit code: 0 if every file's pytest exited 0; 1 otherwise.
 """
 from __future__ import annotations
 import argparse
 import os
 import subprocess
 import sys
 import threading
 import time
 from concurrent.futures import ThreadPoolExecutor, Future
 from pathlib import Path
 from typing import Dict, List, Tuple
 # Default test discovery roots.
 _DEFAULT_ROOTS = ["tests"]
 # Directories to skip during discovery — the e2e + integration suites
 # require real services and are run separately. Match exactly the
 # ``--ignore=`` flags the previous CI command used.
 _SKIP_PARTS = {"integration", "e2e"}
 # Per-file wall-clock cap. Generous default — pytest-timeout still
 # enforces per-test caps inside each subprocess; this is just an outer
 # safety net so a single hung file can't stall the whole suite. Override
 # via --file-timeout or HERMES_TEST_FILE_TIMEOUT.
 _DEFAULT_FILE_TIMEOUT_SECONDS = 600.0  # 10 minutes
 def _count_tests(
    files: List[Path], repo_root: Path, pytest_passthrough: List[str]
 ) -> dict[Path, int]:
    """Run ``pytest --co -q`` once to count individual tests per file.
    Returns a mapping ``{file_path: test_count}``. Files with zero
    collected tests are omitted from the dict (not an error — e.g. the
    file only defines fixtures / conftest helpers).
    This is a single subprocess call (~2-5s for ~1k files) that gives
    us the total test count for the discovery announcement and
    per-file counts for the progress lines.
    ``--ignore`` flags for directories in ``_SKIP_PARTS`` are added
    automatically so that pytest's own collection machinery (conftest
    walking, directory traversal) doesn't pull in tests we intend to
    skip — matching what the per-file runs will actually execute.
    """
    # Build --ignore flags for skipped dirs so the --co collection
    # mirrors what we'll actually run (not what pytest might find via
    # conftest walking or directory traversal).
    ignore_args: List[str] = []
    for root in [repo_root / p for p in _DEFAULT_ROOTS]:
        for part in _SKIP_PARTS:
            d = root / part
            if d.is_dir():
                ignore_args.extend(["--ignore", str(d)])
    cmd = [
        sys.executable, "-m", "pytest",
        "--co", "-q",
        *ignore_args,
        *[str(f) for f in files],
        *pytest_passthrough,
    ]
    try:
        result = subprocess.run(
            cmd,
            cwd=repo_root,
            capture_output=True,
            text=True,
            timeout=120,
        )
    except (subprocess.TimeoutExpired, OSError):
        return {}
    counts: dict[Path, int] = {}
    for line in result.stdout.splitlines():
        # Lines look like: tests/acp/test_auth.py::TestClass::test_name
        if "::" not in line:
            continue
        file_part = line.split("::", 1)[0]
        key = repo_root / file_part
        counts[key] = counts.get(key, 0) + 1
    return counts
 def _discover_files(roots: List[Path]) -> List[Path]:
    """Return every ``test_*.py`` under the given roots (sorted).
    Roots may be directories (recursed for ``test_*.py``) or explicit
    ``.py`` files (included as-is, even if they don't match the
    ``test_*`` prefix — caller knows what they want).
    Exclude any file whose path contains a component in ``_SKIP_PARTS``,
    UNLESS the user explicitly named it as a root (in which case the
    user's intent overrides the skip filter).
    """
    seen: set[Path] = set()
    out: List[Path] = []
    for root in roots:
        if not root.exists():
            continue
        if root.is_file():
            # Explicit file: include it as-is, skip the _SKIP_PARTS filter
            # since the user named it directly.
            real = root.resolve()
            if real not in seen:
                seen.add(real)
                out.append(root)
            continue
        for path in root.rglob("test_*.py"):
            if any(part in _SKIP_PARTS for part in path.parts):
                continue
            real = path.resolve()
            if real in seen:
                continue
            seen.add(real)
            out.append(path)
    return sorted(out)
 def _kill_tree(proc: "subprocess.Popen", pgid: int | None = None) -> None:
    """Kill the pytest subprocess and every descendant it spawned.
    A test run can spin up uvicorn servers, async runtimes, or other
    long-running grandchildren that survive the pytest subprocess exit
    if we don't kill the whole tree. ``subprocess.Popen.kill()`` only
    targets the immediate child; grandchildren reparent to PID 1
    (Linux) / get adopted by services.exe (Windows) and leak.
    POSIX: the caller must pass ``pgid`` — the process group id captured
    immediately after Popen (via ``os.getpgid(proc.pid)``). We can't
    look it up here in the happy path because by the time we get
    called the leader process has already been reaped and its pid is
    gone from the kernel's process table, even though descendants in
    the group are still alive. SIGKILL'ing the captured pgid takes out
    everything in that group atomically.
    Windows: ``taskkill /F /T /PID`` walks the recorded ppid chain and
    terminates the whole tree, even when the root has already exited.
    Why not psutil: psutil walks the parent-child tree, but in the
    happy path the root has already been reaped so ``psutil.Process(pid)``
    can't find it; grandchildren reparented to PID 1 are also
    unreachable by tree walk at that point. The platform-native
    primitives (process groups / taskkill) handle both cases correctly
    without an extra abstraction layer.
    """
    if proc.pid is None:
        return
    if sys.platform == "win32":
        try:
            subprocess.run(
                ["taskkill", "/F", "/T", "/PID", str(proc.pid)],
                stdout=subprocess.DEVNULL,
                stderr=subprocess.DEVNULL,
                timeout=10,
            )  # windows-footgun: ok
        except (subprocess.TimeoutExpired, FileNotFoundError, OSError):
            pass
    else:
        # POSIX: kill the captured pgid. Local-import signal so the
        # SIGKILL attribute is never referenced on Windows.
        if pgid is not None:
            try:
                import signal as _signal
                os.killpg(pgid, _signal.SIGKILL)  # windows-footgun: ok
            except (ProcessLookupError, PermissionError, OSError):
                pass
    # Belt-and-suspenders: ensure subprocess.communicate() sees the exit.
    try:
        proc.kill()
    except (ProcessLookupError, OSError):
        pass
 def _run_one_file(
    file: Path,
    pytest_args: List[str],
    repo_root: Path,
    file_timeout: float,
 ) -> Tuple[Path, int, str, dict[str, int]]:
    """Run ``python -m pytest <file> <pytest_args>`` in a fresh subprocess.
    Returns (file, returncode, captured_combined_output, summary_counts).
    ``summary_counts`` is the result of ``_parse_pytest_summary(output)`` —
    pytest exit codes (https://docs.pytest.org/en/stable/reference/exit-codes.html):
        0 = all tests passed
        1 = some tests failed
        2 = test execution interrupted
        3 = internal error
        4 = pytest CLI usage error
        5 = no tests collected
    We treat exit 5 as a pass: it just means every test in the file was
    skipped or filtered by a marker (e.g. ``-m 'not integration'`` skips
    files where every test is marked integration). That's intentional and
    not a failure mode.
    On per-file timeout (``file_timeout`` seconds) or any other exception
    during ``communicate()``, we kill the whole process group / process
    tree so grandchildren (uvicorn servers, async runtimes, etc.) do not
    orphan onto PID 1. The pytest-timeout plugin enforces per-test
    timeouts inside the subprocess; this outer timeout exists only to
    bound a pathologically slow or hung file as a whole.
    """
    cmd = [sys.executable, "-m", "pytest", str(file), *pytest_args]
    proc = subprocess.Popen(
        cmd,
        cwd=repo_root,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        text=True,
        # POSIX: place the child at the head of its own process group so
        # _kill_tree can SIGKILL the group atomically.
        # Windows: this maps to CREATE_NEW_PROCESS_GROUP in CPython 3.12+;
        # _kill_tree handles the Windows path via taskkill /F /T.
        start_new_session=True,
    )
    # Capture the pgid NOW, before the leader can exit and be reaped.
    # Once the leader is reaped, os.getpgid(proc.pid) raises
    # ProcessLookupError even though grandchildren in that group are
    # still alive — defeating the whole cleanup. None on Windows where
    # the pgid concept doesn't apply (taskkill walks ppid chain instead).
    pgid: int | None = None
    if sys.platform != "win32":
        try:
            pgid = os.getpgid(proc.pid)
        except (ProcessLookupError, PermissionError):
            # Astonishingly fast child? Already dead. _kill_tree's
            # fallback will handle this case as a no-op.
            pgid = None
    try:
        output, _ = proc.communicate(timeout=file_timeout)
        rc = proc.returncode
    except subprocess.TimeoutExpired:
        _kill_tree(proc, pgid=pgid)
        # Drain whatever the child wrote before we killed it so we have
        # something to surface in the failure dump.
        try:
            output, _ = proc.communicate(timeout=10)
        except subprocess.TimeoutExpired:
            output = "(file timeout exceeded; output unavailable)"
        rc = 124  # de facto convention for "killed by timeout".
        output = (
            f"(per-file timeout: {file_timeout:.0f}s exceeded; "
            f"process tree SIGKILL'd)\n{output}"
        )
    except BaseException:
        # KeyboardInterrupt / runner crash — make sure no zombie
        # grandchildren outlive us.
        _kill_tree(proc, pgid=pgid)
        raise
    else:
        # Happy path: pytest exited on its own. The child process already
        # cleaned up its grandchildren if it's well-behaved, but
        # well-behaved is not universal — kill the group anyway. Already-
        # dead processes are a no-op.
        _kill_tree(proc, pgid=pgid)
    if rc == 5:
        # No tests collected — every test in the file was filtered out.
        # Treat as a pass; surface info in a slightly distinct status
        # so the operator can spot it.
        rc = 0
    summary = _parse_pytest_summary(output)
    return file, rc, output, summary
 def _parse_pytest_summary(output: str) -> dict[str, int]:
    """Extract per-file test pass/fail/skip counts from pytest output.
    pytest prints a summary line like ``12 passed, 3 skipped, 1 failed in 2.1s``
    as the last non-empty line before the short test summary.  We scrape that
    line for the individual counts so the progress display can show test-level
    granularity instead of just file-level pass/fail.
    Returns a dict with keys ``passed``, ``failed``, ``skipped``, ``errors``,
    ``xfailed``, ``xpassed`` (only keys found in the output are present).
    """
    import re
    result: dict[str, int] = {}
    # Walk backwards from the end — the summary line is always near the tail.
    for line in reversed(output.splitlines()):
        line = line.strip()
        if not line:
            continue
        # Match "N passed", "N failed", "N skipped", "N errors", "N xfailed", "N xpassed"
        for m in re.finditer(r"(\d+)\s+(passed|failed|skipped|errors|xfailed|xpassed)", line):
            result[m.group(2)] = int(m.group(1))
        # Also match "N error" (singular — pytest uses this sometimes).
        for m in re.finditer(r"(\d+)\s+error\b", line):
            result.setdefault("errors", result.get("errors", 0) + int(m.group(1)))
        if result:
            # Found the counts line — done.
            break
        # Stop at the short test summary header (if any) — everything above
        # that is individual failure details, not the counts line.
        if line.startswith("FAILED") or line.startswith("SHORT TEST SUMMARY"):
            break
    return result
 def _format_file(file: Path, repo_root: Path) -> str:
    """Render a test-file path for display: strip the repo-root prefix
    when possible so output reads ``tests/acp/test_auth.py`` instead of
    ``/home/runner/work/hermes-agent/hermes-agent/tests/acp/test_auth.py``.
    Falls back to the absolute path for anything outside the repo root.
    """
    try:
        return str(file.resolve().relative_to(repo_root.resolve()))
    except ValueError:
        return str(file)
 def _print_progress(
    tests_done: int,
    total_tests: int,
    file: Path,
    rc: int,
    dur: float,
    repo_root: Path,
    tests_passed: int,
    tests_failed: int,
    test_counts: dict[Path, int],
    file_summary: dict[str, int] | None = None,
 ) -> None:
    """Single-line live progress.
    When ``file_summary`` is provided (parsed from pytest output), the
    per-file parenthetical shows individual test pass/fail counts instead
    of just the total test count.
    """
    status = "✓" if rc == 0 else "✗"
    pct = (tests_done / total_tests * 100) if total_tests else 0
    # Digit width for left-side counter padding (derived from total file count).
    fw = len(str(tests_passed + tests_failed))
    # Build per-file test count string.
    if file_summary:
        parts = []
        p = file_summary.get("passed", 0)
        f = file_summary.get("failed", 0)
        s = file_summary.get("skipped", 0)
        e = file_summary.get("errors", 0)
        if p:
            parts.append(f"{p}✓")
        if f:
            parts.append(f"{f}✗")
        if s:
            parts.append(f"{s}s")
        if e:
            parts.append(f"{e}e")
        # xfailed/xpassed are rare; include if present.
        xf = file_summary.get("xfailed", 0)
        xp = file_summary.get("xpassed", 0)
        if xf:
            parts.append(f"{xf}xf")
        if xp:
            parts.append(f"{xp}xp")
        test_str = " ".join(parts) + ", " if parts else ""
    else:
        n_tests = test_counts.get(file, 0)
        test_str = f"{n_tests} tests, " if n_tests else ""
    msg = (
        f"[{pct:5.1f}% | {tests_done:>5}/{total_tests}"
        f" | ✓{tests_passed:>{fw}} | ✗{tests_failed:>{fw}}] "
        f"{status} {_format_file(file, repo_root)} ({test_str}{dur:.1f}s)"
    )
    # Truncate to terminal width if available (no clobbering ANSI lines).
    try:
        cols = os.get_terminal_size().columns
        if len(msg) > cols:
            msg = msg[: cols - 1] + "…"
    except OSError:
        pass
    print(msg, flush=True)
 def _print_inline_failure(
    file: Path, output: str, repo_root: Path, pytest_passthrough: List[str]
 ) -> None:
    """Print a compact failure summary immediately when a file fails.
    Shows the tail of the pytest output (the failure section with stack
    traces) and a ready-to-run repro command, so the developer doesn't
    have to wait for the full run to finish before seeing what broke.
    """
    rel = _format_file(file, repo_root)
    # Build a repro command the developer can copy-paste.
    passthrough_str = " ".join(pytest_passthrough) if pytest_passthrough else ""
    repro = f"python -m pytest {rel}"
    if passthrough_str:
        repro += f" {passthrough_str}"
    # Grab just the failure lines (last ~30 lines of pytest output —
    # typically the FAILED summary + short test info).
    lines = output.rstrip().splitlines()
    tail = "\n".join(lines[-30:])
    print(flush=True)
    print(f"  ╔╍ Failed: {rel} ╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍", flush=True)
    for line in tail.splitlines():
        print(f"  ║ {line}", flush=True)
    print(f"  ║", flush=True)
    print(f"  ║  Repro: {repro}", flush=True)
    print(f"  ╚╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍", flush=True)
    print(flush=True)
 def main() -> int:
    parser = argparse.ArgumentParser(
        description=__doc__,
        formatter_class=argparse.RawDescriptionHelpFormatter,
    )
    parser.add_argument(
        "-j",
        "--jobs",
        type=int,
        default=int(os.environ.get("HERMES_TEST_WORKERS") or (os.cpu_count() or 4) * 2),
        help="Parallel worker count (default: $HERMES_TEST_WORKERS or cpu_count*2)",
    )
    parser.add_argument(
        "--paths",
        default=os.environ.get("HERMES_TEST_PATHS", ":".join(_DEFAULT_ROOTS)),
        help="Colon-separated discovery roots (default: 'tests')",
    )
    parser.add_argument(
        "--include-integration",
        action="store_true",
        help="Don't skip integration/ e2e/ during discovery",
    )
    parser.add_argument(
        "--file-timeout",
        type=float,
        default=float(
            os.environ.get("HERMES_TEST_FILE_TIMEOUT", _DEFAULT_FILE_TIMEOUT_SECONDS)
        ),
        help=(
            "Per-file wall-clock cap in seconds. On timeout, the pytest "
            "subprocess and its full process tree are SIGKILL'd. "
            "Default: 600 (10 min), env: HERMES_TEST_FILE_TIMEOUT."
        ),
    )
    parser.add_argument(
        "paths_positional",
        nargs="*",
        metavar="PATH",
        help=(
            "Restrict discovery to these paths (directories or .py files). "
            "Mutually exclusive with --paths. Anything after a literal '--' "
            "separator is passed through to each per-file pytest invocation."
        ),
    )
    # Manually split argv on '--' so positional paths and pytest passthrough
    # args don't fight over each other. argparse's nargs="*" positional is
    # greedy and will swallow everything after '--' including the pytest
    # flags, defeating the convention.
    argv = sys.argv[1:]
    if "--" in argv:
        sep = argv.index("--")
        our_args, pytest_passthrough = argv[:sep], argv[sep + 1 :]
    else:
        our_args, pytest_passthrough = argv, []
    args = parser.parse_args(our_args)
    repo_root = Path(__file__).resolve().parent.parent
    # Resolve discovery roots: positional path args override --paths if any
    # were supplied, otherwise --paths (which itself defaults to 'tests').
    if args.paths_positional:
        # Positionals can be directories OR explicit .py files. Either is
        # fine — _discover_files handles both via rglob('test_*.py') for
        # dirs and direct inclusion for files.
        roots = [repo_root / p for p in args.paths_positional]
    else:
        roots = [repo_root / p for p in args.paths.split(":") if p]
    if args.include_integration:
        # Caller takes responsibility — typically used via explicit -k filter.
        global _SKIP_PARTS  # noqa: PLW0603 — config knob
        _SKIP_PARTS = set()
    files = _discover_files(roots)
    if not files:
        print(f"No test files discovered under {[str(r) for r in roots]}", file=sys.stderr)
        return 1
    # Count individual tests per file via a single pytest --co pass.
    test_counts = _count_tests(files, repo_root, pytest_passthrough)
    total_tests = sum(test_counts.values())
    print(
        f"Discovered {len(files)} test files ({total_tests} tests) under "
        f"{[str(r.relative_to(repo_root)) if r.is_relative_to(repo_root) else str(r) for r in roots]}; "
        f"running with -j {args.jobs}",
        flush=True,
    )
    # Capture and print on completion (out-of-order is fine — keeps the
    # terminal clean rather than interleaving N parallel pytest outputs).
    failures: List[Tuple[Path, str, Dict[str, int]]] = []
    started = time.monotonic()
    files_done = 0
    tests_done = 0
    pass_count = 0
    fail_count = 0
    tests_passed = 0
    tests_failed = 0
    lock = threading.Lock()
    def _on_done(file: Path, started_at: float, fut: "Future[Tuple[Path, int, str, dict[str, int]]]") -> None:
        nonlocal files_done, tests_done, pass_count, fail_count, tests_passed, tests_failed
        n_tests = test_counts.get(file, 0)
        try:
            fpath, rc, output, summary = fut.result()
        except Exception as exc:  # noqa: BLE001 — must always advance counter
            with lock:
                files_done += 1
                tests_done += n_tests
                fail_count += 1
                failures.append((file, f"runner crashed: {exc!r}", {}))
                _print_progress(
                    tests_done, total_tests, file, 1,
                    time.monotonic() - started_at,
                    repo_root, tests_passed, tests_failed,
                    test_counts,
                )
            return
        with lock:
            files_done += 1
            tests_done += n_tests
            # Accumulate test-level counts from parsed summary.
            tests_passed += summary.get("passed", 0)
            tests_failed += summary.get("failed", 0)
            if rc == 0:
                pass_count += 1
            else:
                fail_count += 1
                failures.append((fpath, output, summary))
            _print_progress(
                tests_done, total_tests, fpath, rc,
                time.monotonic() - started_at,
                repo_root, tests_passed, tests_failed,
                test_counts,
                file_summary=summary,
            )
            if rc != 0:
                _print_inline_failure(fpath, output, repo_root, pytest_passthrough)
    with ThreadPoolExecutor(max_workers=args.jobs) as pool:
        futures: List[Future] = []
        for file in files:
            t0 = time.monotonic()
            fut = pool.submit(
                _run_one_file, file, pytest_passthrough, repo_root, args.file_timeout
            )
            fut.add_done_callback(lambda f, file=file, t0=t0: _on_done(file, t0, f))
            futures.append(fut)
        # Block until everything's done. ThreadPoolExecutor.__exit__ waits
        # for all submitted work, but doing it explicitly here makes the
        # control flow obvious.
        for fut in futures:
            fut.result() if fut.exception() is None else None
    elapsed = time.monotonic() - started
    print()
    pct = (tests_done / total_tests * 100) if total_tests else 0
    print(f"=== Summary: {len(files)} files, {tests_passed} tests passed, {tests_failed} failed ({pct:.0f}% complete) in {elapsed:.1f}s ({args.jobs} workers) ===")
    if failures:
        print()
        print("=== Failure output ===")
        for file, output, _summary in failures:
            print()
            print(f"--- {_format_file(file, repo_root)} ---")
            print(output.rstrip())
        print()
        # Split: files with actual test failures vs non-zero exit for other reasons
        test_fail_files = [(f, s) for f, _o, s in failures if s.get("failed", 0) > 0]
        all_passed_but_nonzero = [(f, s) for f, _o, s in failures
                                  if s.get("failed", 0) == 0 and s.get("passed", 0) > 0]
        no_tests_ran = [(f, s) for f, _o, s in failures
                        if s.get("failed", 0) == 0 and s.get("passed", 0) == 0]
        if test_fail_files:
            total_tf = sum(s.get("failed", 0) for _, s in test_fail_files)
            print(f"=== {len(test_fail_files)} file{'s' if len(test_fail_files) != 1 else ''} with test failures ({total_tf} test{'s' if total_tf != 1 else ''} failed) ===")
            for file, s in test_fail_files:
                nf = s.get("failed", 0)
                print(f"  {_format_file(file, repo_root)}  ({nf} test{'s' if nf != 1 else ''} failed)")
        if all_passed_but_nonzero:
            print(f"=== {len(all_passed_but_nonzero)} file{'s' if len(all_passed_but_nonzero) != 1 else ''} where all tests passed but pytest exited non-zero (warnings-as-errors, hook failures, etc.) ===")
            for file, s in all_passed_but_nonzero:
                print(f"  {_format_file(file, repo_root)}  ({s.get('passed', 0)} passed)")
        if no_tests_ran:
            print(f"=== {len(no_tests_ran)} file{'s' if len(no_tests_ran) != 1 else ''} where no tests ran (collection/import error, timeout before collection, etc.) ===")
            for file, s in no_tests_ran:
                print(f"  {_format_file(file, repo_root)}")
        return 1
    return 0
 if __name__ == "__main__":
    sys.exit(main())
--- a/tests/agent/test_auxiliary_client.py
+++ b/tests/agent/test_auxiliary_client.py
@ -40,6 +40,16 @@ def _clean_env(monkeypatch):
        "ANTHROPIC_API_KEY", "ANTHROPIC_TOKEN", "CLAUDE_CODE_OAUTH_TOKEN",
    ):
        monkeypatch.delenv(key, raising=False)
    # Module-level unhealthy cache (10-min TTL) leaks between tests;
    # earlier tests that call _mark_provider_unhealthy() poison the
    # cache for later ones, causing _resolve_auto to skip providers
    # that the test patched to return valid clients.
    import agent.auxiliary_client as _aux_mod
    _aux_mod._aux_unhealthy_until.clear()
    _aux_mod._aux_unhealthy_logged_at.clear()
    yield
    _aux_mod._aux_unhealthy_until.clear()
    _aux_mod._aux_unhealthy_logged_at.clear()
@pytest.fixture
@ -461,6 +471,17 @@ class TestExpiredCodexFallback:
        import base64
        import time as _time
        # Belt-and-suspenders: _try_openrouter marks openrouter unhealthy
        # when OPENROUTER_API_KEY is absent (which the preceding test in
        # this class exercises).  The file-level _clean_env autouse fixture
        # clears the cache, but fixture ordering with the conftest
        # _hermetic_environment autouse can leave a narrow window where
        # the mark reappears.  Explicitly clear here so this test is
        # independent of run order.
        import agent.auxiliary_client as _aux_mod
        _aux_mod._aux_unhealthy_until.clear()
        _aux_mod._aux_unhealthy_logged_at.clear()
        header = base64.urlsafe_b64encode(b'{"alg":"RS256","typ":"JWT"}').rstrip(b"=").decode()
        payload_data = json.dumps({"exp": int(_time.time()) - 3600}).encode()
        payload = base64.urlsafe_b64encode(payload_data).rstrip(b"=").decode()
@ -1047,6 +1068,20 @@ class TestGetProviderChain:
 class TestTryPaymentFallback:
    """_try_payment_fallback skips the failed provider and tries alternatives."""
    @pytest.fixture(autouse=True)
    def _clear_unhealthy_cache(self):
        """Earlier tests in this file call _mark_provider_unhealthy() which
        pollutes the module-level ``_aux_unhealthy_until`` dict (10-min TTL).
        Without this cleanup the fallback chain skips providers we've patched
        to return valid clients — the patched function is never called.
        """
        from agent.auxiliary_client import _aux_unhealthy_until, _aux_unhealthy_logged_at
        _aux_unhealthy_until.clear()
        _aux_unhealthy_logged_at.clear()
        yield
        _aux_unhealthy_until.clear()
        _aux_unhealthy_logged_at.clear()
    def test_skips_failed_provider(self):
        mock_client = MagicMock()
        with patch("agent.auxiliary_client._try_openrouter", return_value=(None, None)), \
--- a/tests/agent/test_skill_commands.py
+++ b/tests/agent/test_skill_commands.py
@ -556,10 +556,11 @@ Generate some audio.
            raising=False,
        )
-        with patch.dict(
+        with patch("tools.skills_tool.SKILLS_DIR", tmp_path):
-            os.environ, {"HERMES_SESSION_PLATFORM": "telegram"}, clear=False
+            from gateway.session_context import clear_session_vars, set_session_vars
-        ):
+
-            with patch("tools.skills_tool.SKILLS_DIR", tmp_path):
+            tokens = set_session_vars(platform="telegram")
            try:
                _make_skill(
                    tmp_path,
                    "test-skill",
@ -571,6 +572,8 @@ Generate some audio.
                )
                scan_skill_commands()
                msg = build_skill_invocation_message("/test-skill", "do stuff")
            finally:
                clear_session_vars(tokens)
        assert msg is not None
        assert "local cli" in msg.lower()
--- a/tests/conftest.py
+++ b/tests/conftest.py
@ -20,12 +20,9 @@ test runner at ``scripts/run_tests.sh``.
 """
 import asyncio
 import logging
 import os
 import re
 import signal
 import sys
 import tempfile
 from pathlib import Path
 from unittest.mock import patch
@ -37,6 +34,22 @@ if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))
 # ── Per-file process isolation ──────────────────────────────────────────────
 # Tests run via ``scripts/run_tests_parallel.py``, which spawns a fresh
 # ``python -m pytest <file>`` subprocess per test file. Cross-file state
 # leakage (module-level dicts, ContextVars, caches) is impossible: each
 # file gets a clean Python interpreter. Intra-file ordering is the test
 # author's responsibility — if test A in foo.py mutates state that test B
 # in foo.py reads, that's a real bug to fix in the file (it would also
 # bite anyone running ``pytest tests/foo.py`` directly).
 #
 # This replaces the historic _reset_module_state autouse fixture (manual
 # state clearing) and the brief experiment with subprocess-per-test
 # isolation (too slow at ~17k tests).
 #
 # See ``scripts/run_tests_parallel.py`` for the runner.
 # ── Credential env-var filter ──────────────────────────────────────────────
 #
 # Any env var in the current process matching ONE of these patterns is
@ -279,7 +292,7 @@ _HERMES_BEHAVIORAL_VARS = frozenset({
    "WECOM_HOME_CHANNEL_NAME",
    # Platform gating — set by load_gateway_config() as a side effect when
    # a config.yaml is present, so individual test bodies that call the
-    # loader leak these values into later tests on the same xdist worker.
+    # loader leak these values into later tests in the same process.
    # Force-clear on every test setup so the leak can't happen.
    "SLACK_REQUIRE_MENTION",
    "SLACK_STRICT_MENTION",
@ -368,144 +381,21 @@ def _isolate_hermes_home(_hermetic_environment):
    return None
-# ── Module-level state reset ───────────────────────────────────────────────
+# ── Module-level state reset — replaced by per-file process isolation ──────
 #
-# Python modules are singletons per process, and pytest-xdist workers are
+# Each test FILE runs in a freshly-spawned ``python -m pytest <file>``
-# long-lived. Module-level dicts/sets (tool registries, approval state,
+# subprocess via ``scripts/run_tests_parallel.py``, so module-level dicts /
-# interrupt flags) and ContextVars persist across tests in the same worker,
+# sets / ContextVars from tests in one file cannot leak into tests in
-# causing tests that pass alone to fail when run with siblings.
+# another file. No manual per-module clearing needed.
 #
-# Each entry in this fixture clears state that belongs to a specific module.
+# Within a single file, ordering is the author's responsibility. If your
-# New state buckets go here too — this is the single gate that prevents
+# tests in the same file share mutable state, either reset it explicitly
-# "works alone, flakes in CI" bugs from state leakage.
+# in a fixture or split them across files.
 #
-# The skill `test-suite-cascade-diagnosis` documents the concrete patterns
+# The skill ``test-suite-cascade-diagnosis`` documents the cascade patterns
-# this closes; the running example was `test_command_guards` failing 12/15
+# this replaces; the running example was ``test_command_guards`` failing
-# CI runs because ``tools.approval._session_approved`` carried approvals
+# 12/15 CI runs because ``tools.approval._session_approved`` carried
-# from one test's session into another's.
+# approvals from one test's session into another's.
@pytest.fixture(autouse=True)
 def _reset_module_state():
    """Clear module-level mutable state and ContextVars between tests.
    Keeps state from leaking across tests on the same xdist worker. Modules
    that don't exist yet (test collection before production import) are
    skipped silently — production import later creates fresh empty state.
    """
    # --- logging — quiet/one-shot paths mutate process-global logger state ---
    logging.disable(logging.NOTSET)
    for _logger_name in ("tools", "run_agent", "trajectory_compressor", "cron", "hermes_cli"):
        _logger = logging.getLogger(_logger_name)
        _logger.disabled = False
        _logger.setLevel(logging.NOTSET)
        _logger.propagate = True
    # --- tools.approval — the single biggest source of cross-test pollution ---
    try:
        from tools import approval as _approval_mod
        _approval_mod._session_approved.clear()
        _approval_mod._session_yolo.clear()
        _approval_mod._permanent_approved.clear()
        _approval_mod._pending.clear()
        _approval_mod._gateway_queues.clear()
        _approval_mod._gateway_notify_cbs.clear()
        # ContextVar: reset to empty string so get_current_session_key()
        # falls through to the env var / default path, matching a fresh
        # process.
        _approval_mod._approval_session_key.set("")
    except Exception:
        pass
    # --- tools.interrupt — per-thread interrupt flag set ---
    try:
        from tools import interrupt as _interrupt_mod
        with _interrupt_mod._lock:
            _interrupt_mod._interrupted_threads.clear()
    except Exception:
        pass
    # --- gateway.session_context — 9 ContextVars that represent
    #     the active gateway session. If set in one test and not reset,
    #     the next test's get_session_env() reads stale values.
    try:
        from gateway import session_context as _sc_mod
        for _cv in (
            _sc_mod._SESSION_PLATFORM,
            _sc_mod._SESSION_CHAT_ID,
            _sc_mod._SESSION_CHAT_NAME,
            _sc_mod._SESSION_THREAD_ID,
            _sc_mod._SESSION_USER_ID,
            _sc_mod._SESSION_USER_NAME,
            _sc_mod._SESSION_KEY,
            _sc_mod._CRON_AUTO_DELIVER_PLATFORM,
            _sc_mod._CRON_AUTO_DELIVER_CHAT_ID,
            _sc_mod._CRON_AUTO_DELIVER_THREAD_ID,
        ):
            _cv.set(_sc_mod._UNSET)
    except Exception:
        pass
    # --- tools.env_passthrough — ContextVar<set[str]> with no default ---
    # LookupError is normal if the test never set it. Setting it to an
    # empty set unconditionally normalizes the starting state.
    try:
        from tools import env_passthrough as _envp_mod
        _envp_mod._allowed_env_vars_var.set(set())
    except Exception:
        pass
    # --- tools.terminal_tool — active environment/cwd cache ---
    # File tools prefer a live terminal cwd when one is cached for the task.
    # Clear terminal environments between tests so a prior terminal call can't
    # override TERMINAL_CWD in path-resolution tests.
    try:
        from tools import terminal_tool as _term_mod
        _envs_to_cleanup = []
        with _term_mod._env_lock:
            _envs_to_cleanup = list(_term_mod._active_environments.values())
            _term_mod._active_environments.clear()
            _term_mod._last_activity.clear()
            _term_mod._creation_locks.clear()
        for _env in _envs_to_cleanup:
            try:
                _env.cleanup()
            except Exception:
                pass
    except Exception:
        pass
    # --- tools.credential_files — ContextVar<dict> ---
    try:
        from tools import credential_files as _credf_mod
        _credf_mod._registered_files_var.set({})
    except Exception:
        pass
    # --- agent.auxiliary_client — runtime main provider/model override and
    #     payment-error health cache. Both are process-global in production;
    #     reset them per test so one worker's fallback/402 test does not make
    #     later auxiliary-client tests skip otherwise-available providers.
    try:
        from agent import auxiliary_client as _aux_mod
        _aux_mod.clear_runtime_main()
        _aux_mod._reset_aux_unhealthy_cache()
    except Exception:
        pass
    # --- tools.file_tools — per-task read history + file-ops cache ---
    # _read_tracker accumulates per-task_id read history for loop detection,
    # capped by _READ_HISTORY_CAP. If entries from a prior test persist, the
    # cap is hit faster than expected and capacity-related tests flake.
    try:
        from tools import file_tools as _ft_mod
        with _ft_mod._read_tracker_lock:
            _ft_mod._read_tracker.clear()
        with _ft_mod._file_ops_lock:
            _ft_mod._file_ops_cache.clear()
    except Exception:
        pass
    yield
@pytest.fixture()
@ -532,13 +422,12 @@ def mock_config():
    }
-# ── Global test timeout ─────────────────────────────────────────────────────
+# ── Per-test timeout — handled by the isolation plugin ─────────────────────
-# Kill any individual test that takes longer than 30 seconds.
+#
-# Prevents hanging tests (subprocess spawns, blocking I/O) from stalling the
+# The subprocess-per-test plugin enforces the configured ``isolate_timeout``
-# entire test suite.
+# ini key by terminating the child if it overruns. The old SIGALRM-based
 # fixture (POSIX-only, didn't work on Windows) is gone.
 def _timeout_handler(signum, frame):
    raise TimeoutError("Test exceeded 30 second timeout")
@pytest.fixture(autouse=True)
 def _ensure_current_event_loop(request):
@ -584,45 +473,6 @@ def _ensure_current_event_loop(request):
                asyncio.set_event_loop(None)
@pytest.fixture(autouse=True)
 def _enforce_test_timeout():
    """Kill any individual test that takes longer than 30 seconds.
    SIGALRM is Unix-only; skip on Windows."""
    if sys.platform == "win32":
        yield
        return
    old = signal.signal(signal.SIGALRM, _timeout_handler)
    signal.alarm(30)
    yield
    signal.alarm(0)
    signal.signal(signal.SIGALRM, old)
@pytest.fixture(autouse=True)
 def _reset_tool_registry_caches():
    """Clear tool-registry-level caches between tests.
    The production registry caches ``check_fn()`` results for 30 s
    (see tools/registry.py) and :func:`get_tool_definitions` memoizes
    its result (see model_tools.py). Both are keyed on state that tests
    routinely mutate (env vars, registry._generation, config.yaml mtime)
    — but a stale result from test A can still be served to test B
    because 30 s covers the entire suite, and xdist worker reuse means
    one test's cache lands in another's process. Clearing before every
    test keeps hermetic behavior.
    """
    try:
        from tools.registry import invalidate_check_fn_cache
        invalidate_check_fn_cache()
    except ImportError:
        pass
    try:
        from model_tools import _clear_tool_defs_cache
        _clear_tool_defs_cache()
    except ImportError:
        pass
 # ── Live-system guard ──────────────────────────────────────────────────────
 #
 # Several test files exercise the gateway-restart / kill code paths
--- a/tests/gateway/conftest.py
+++ b/tests/gateway/conftest.py
@ -313,19 +313,30 @@ def _scan_for_plugin_adapter_antipattern(source: str) -> list[str]:
    return offenses
-def pytest_configure(config):
+def _fingerprint_gateway_tests() -> str:
-    """Reject plugin-adapter tests that use the sys.path anti-pattern.
+    """Return a short fingerprint that changes when any gateway test file changes.
-    Runs once per pytest session on the controller, BEFORE any xdist
+    Uses (mtime, size) pairs instead of content hashing — fast to compute
-    worker is spawned. If any file under ``tests/gateway/`` matches the
+    (stat-only, no reads) and sufficient for cache invalidation across
-    anti-pattern, we fail the whole session with a clear message —
+    per-file subprocess runs.
    before a polluted ``sys.path`` can cascade across workers.
    """
-    # Only run on the xdist controller (or in non-xdist runs). Skip on
+    import hashlib
    # worker subprocesses so we don't scan the filesystem N times.
    if hasattr(config, "workerinput"):
        return
    h = hashlib.sha256()
    for path in sorted(_GATEWAY_DIR.rglob("test_*.py")):
        try:
            st = path.stat()
            h.update(f"{path.name}:{st.st_mtime_ns}:{st.st_size}".encode())
        except OSError:
            h.update(f"{path.name}:missing".encode())
    return h.hexdigest()[:16]
 def _run_adapter_antipattern_scan() -> list[str]:
    """Scan gateway test files for the plugin-adapter anti-pattern.
    Returns a list of violation strings (empty if clean).
    """
    violations: list[str] = []
    for path in _GATEWAY_DIR.rglob("test_*.py"):
        if path.name in {"_plugin_adapter_loader.py", "conftest.py"}:
@ -334,20 +345,108 @@ def pytest_configure(config):
            source = path.read_text(encoding="utf-8")
        except OSError:
            continue
        # Fast string pre-filter: skip files that can't possibly violate.
        # A violating file MUST contain both (a) an adapter/plugins/platforms
        # reference AND (b) either sys.path manipulation or a bare adapter import.
        if "adapter" not in source and "plugins/platforms" not in source:
            continue
        if not (
            "sys.path" in source
            or "import adapter" in source
            or "from adapter import" in source
        ):
            continue
        offenses = _scan_for_plugin_adapter_antipattern(source)
        if offenses:
            violations.append(
                f"  {path.relative_to(_GATEWAY_DIR.parent.parent)}:\n    "
                + "\n    ".join(offenses)
            )
    return violations
-    if violations:
+
-        raise pytest.UsageError(
+def pytest_configure(config):
-            "Plugin-adapter-import anti-pattern detected in gateway tests:\n"
+    """Reject plugin-adapter tests that use the sys.path anti-pattern.
-            + "\n".join(violations)
+
-            + "\n\n"
+    Runs once per pytest session on the controller, BEFORE any xdist
-            + _GUARD_HINT
+    worker is spawned. If any file under ``tests/gateway/`` matches the
-        )
+    anti-pattern, we fail the whole session with a clear message —
    before a polluted ``sys.path`` can cascade across workers.
    **Performance**: in the per-file subprocess isolation model (no xdist),
    every subprocess is a "controller" — so the naive scan would run 257
    times, each costing ~1s of AST walking.  We avoid this with two
    strategies:
    1. **Tight string pre-filter**: a file can only violate if it contains
       *both* an adapter/plugins/platforms reference *and* a sys.path
       manipulation or bare ``import adapter``.  This drops ~95% of files
       from needing AST parsing.
    2. **File-locked cache**: the scan result is cached in
       ``.pytest-cache/gw-adapter-guard-<fingerprint>`` keyed on a
       fingerprint of the gateway test file mtimes/sizes.  Concurrent
       subprocesses acquire a lock; only the first performs the scan;
       the rest wait and read the cached result.
    """
    # Only run on the xdist controller (or in non-xdist runs). Skip on
    # worker subprocesses so we don't scan the filesystem N times.
    if hasattr(config, "workerinput"):
        return
    fp = _fingerprint_gateway_tests()
    cache_dir = Path.cwd() / ".pytest-cache"
    cache_file = cache_dir / f"gw-adapter-guard-{fp}"
    lock_file = cache_dir / f".gw-adapter-guard-{fp}.lock"
    cache_dir.mkdir(parents=True, exist_ok=True)
    # Evict stale cache entries from previous fingerprints (best-effort).
    try:
        for old in cache_dir.glob("gw-adapter-guard-*"):
            if old.name != f"gw-adapter-guard-{fp}":
                old.unlink(missing_ok=True)
        for old in cache_dir.glob(".gw-adapter-guard-*.lock"):
            if old.name != f".gw-adapter-guard-{fp}.lock":
                old.unlink(missing_ok=True)
    except OSError:
        pass  # Non-critical; old files are harmless.
    # Use filelock to ensure only one process scans at a time.
    # Concurrent subprocesses all hit pytest_configure simultaneously;
    # without a lock they'd all find no cache and all run the scan.
    try:
        from filelock import FileLock
        lock = FileLock(str(lock_file), timeout=120)
    except ImportError:
        # Fallback: no locking (still correct, just slower under contention).
        import contextlib
        class _NoLock:
            def __enter__(self):
                return self
            def __exit__(self, *a):
                pass
        lock = _NoLock()
    with lock:
        if cache_file.exists():
            cached = cache_file.read_text(encoding="utf-8")
            if cached == "clean":
                return
            raise pytest.UsageError(cached)
        # Slow path: this process is the first to acquire the lock.
        violations = _run_adapter_antipattern_scan()
        if violations:
            msg = (
                "Plugin-adapter-import anti-pattern detected in gateway tests:\n"
                + "\n".join(violations)
                + "\n\n"
                + _GUARD_HINT
            )
            cache_file.write_text(msg, encoding="utf-8")
            raise pytest.UsageError(msg)
        else:
            cache_file.write_text("clean", encoding="utf-8")
--- a/tests/gateway/test_dm_topics.py
+++ b/tests/gateway/test_dm_topics.py
@ -22,19 +22,26 @@ from gateway.config import PlatformConfig
 def _ensure_telegram_mock():
    if "telegram" in sys.modules and hasattr(sys.modules["telegram"], "__file__"):
        return
    telegram_mod = MagicMock()
    telegram_mod.ext.ContextTypes.DEFAULT_TYPE = type(None)
    telegram_mod.constants.ParseMode.MARKDOWN_V2 = "MarkdownV2"
    telegram_mod.constants.ChatType.GROUP = "group"
    telegram_mod.constants.ChatType.SUPERGROUP = "supergroup"
    telegram_mod.constants.ChatType.CHANNEL = "channel"
    telegram_mod.constants.ChatType.PRIVATE = "private"
-    for name in ("telegram", "telegram.ext", "telegram.constants", "telegram.request"):
+    # Register telegram.constants as a separate module mock so that
-        sys.modules.setdefault(name, telegram_mod)
+    # ``from telegram.constants import ChatType`` resolves to our mock
    # with string-valued members (not auto-generated MagicMocks).
    constants_mod = MagicMock()
    constants_mod.ParseMode.MARKDOWN_V2 = "MarkdownV2"
    constants_mod.ChatType.GROUP = "group"
    constants_mod.ChatType.SUPERGROUP = "supergroup"
    constants_mod.ChatType.CHANNEL = "channel"
    constants_mod.ChatType.PRIVATE = "private"
    sys.modules["telegram"] = telegram_mod
    sys.modules["telegram.ext"] = telegram_mod.ext
    sys.modules["telegram.constants"] = constants_mod
    sys.modules["telegram.request"] = telegram_mod.request
    # Force reimport so the adapter picks up the mock ChatType.
    sys.modules.pop("gateway.platforms.telegram", None)
 _ensure_telegram_mock()
--- a/tests/gateway/test_google_chat.py
+++ b/tests/gateway/test_google_chat.py
@ -22,6 +22,11 @@ import pytest
 from gateway.config import Platform, PlatformConfig, load_gateway_config
 # Platform uses _missing_() for dynamic members, so "google_chat" is
 # resolvable via Platform("google_chat") even without a static
 # GOOGLE_CHAT attribute on the enum class.
 _GC = Platform("google_chat")
 # ---------------------------------------------------------------------------
 # Mock the google-* packages if they are not installed
@ -229,7 +234,7 @@ def _make_chat_envelope(text="hello", sender_email="u@example.com", sender_type=
 class TestPlatformRegistration:
    def test_enum_value(self):
-        assert Platform.GOOGLE_CHAT.value == "google_chat"
+        assert _GC.value == "google_chat"
    def test_requirements_check_returns_true_when_available(self):
        # The shim flag is True in this test module.
@ -266,14 +271,14 @@ class TestEnvConfigLoading:
        monkeypatch.setenv("GOOGLE_CHAT_PROJECT_ID", "p")
        # No subscription.
        cfg = load_gateway_config()
-        assert Platform.GOOGLE_CHAT not in cfg.platforms
+        assert _GC not in cfg.platforms
    def test_missing_project_does_not_enable(self, monkeypatch):
        self._clean_env(monkeypatch)
        monkeypatch.setenv("GOOGLE_CHAT_SUBSCRIPTION_NAME",
                           "projects/p/subscriptions/s")
        cfg = load_gateway_config()
-        assert Platform.GOOGLE_CHAT not in cfg.platforms
+        assert _GC not in cfg.platforms
@ -2583,7 +2588,7 @@ class TestAuthorizationEmailMatch:
        runner.pairing_store.is_approved = MagicMock(return_value=False)
        source = SessionSource(
-            platform=Platform.GOOGLE_CHAT,
+            platform=_GC,
            chat_id="spaces/S",
            chat_type="dm",
            user_id="alice@example.com",       # post-swap: email is canonical
@ -2604,7 +2609,7 @@ class TestAuthorizationEmailMatch:
        runner.pairing_store.is_approved = MagicMock(return_value=False)
        source = SessionSource(
-            platform=Platform.GOOGLE_CHAT,
+            platform=_GC,
            chat_id="spaces/S",
            chat_type="dm",
            user_id="bob@example.com",
@ -2630,7 +2635,7 @@ class TestAuthorizationEmailMatch:
        runner.pairing_store.is_approved = MagicMock(return_value=False)
        source = SessionSource(
-            platform=Platform.GOOGLE_CHAT,
+            platform=_GC,
            chat_id="spaces/S",
            chat_type="dm",
            user_id="users/77777",  # no email available — resource name wins
--- a/tests/hermes_cli/test_pty_bridge.py
+++ b/tests/hermes_cli/test_pty_bridge.py
@ -7,6 +7,7 @@ printf) to verify it behaves like a PTY you can read/write/resize/close.
 from __future__ import annotations
 import os
 import shutil
 import sys
 import time
@ -66,7 +67,7 @@ class TestPtyBridgeIO:
    def test_write_sends_to_child_stdin(self):
        # `cat` with no args echoes stdin back to stdout.  We write a line,
        # read it back, then signal EOF to let cat exit cleanly.
-        bridge = PtyBridge.spawn(["/bin/cat"])
+        bridge = PtyBridge.spawn([shutil.which("cat") or "cat"])
        try:
            bridge.write(b"hello-pty\n")
            output = _read_until(bridge, b"hello-pty")
--- a/tests/plugins/test_achievements_plugin.py
+++ b/tests/plugins/test_achievements_plugin.py
@ -62,8 +62,9 @@ def plugin_api(tmp_path, monkeypatch):
 class _FakeSessionDB:
    """Stand-in for hermes_state.SessionDB that records scan calls."""
-    def __init__(self, session_count: int):
+    def __init__(self, session_count: int, scan_delay: float = 0):
        self.session_count = session_count
        self.scan_delay = scan_delay
        self.last_limit: Optional[int] = None
        self.last_include_children: Optional[bool] = None
        self.list_calls = 0
@ -78,6 +79,8 @@ class _FakeSessionDB:
        include_children: bool = False,
        project_compression_tips: bool = True,
    ) -> List[Dict[str, Any]]:
        if self.scan_delay:
            time.sleep(self.scan_delay)
        self.last_limit = limit
        self.last_include_children = include_children
        self.list_calls += 1
@ -225,10 +228,8 @@ def test_evaluate_all_stale_cache_serves_stale_and_refreshes_in_background(plugi
    the stale data immediately and kicks a background refresh. Users don't
    stare at a loading spinner every time TTL expires.
    """
-    fake_db = _FakeSessionDB(session_count=10)
+    fake_db = _FakeSessionDB(session_count=10, scan_delay=2.0)
    _install_fake_session_db(plugin_api, fake_db)
    # Seed a stale snapshot on disk.
    stale_generated_at = int(time.time()) - plugin_api.SNAPSHOT_TTL_SECONDS - 60
    stale_payload = {
        "achievements": [],
--- a/tests/plugins/web/test_web_search_provider_plugins.py
+++ b/tests/plugins/web/test_web_search_provider_plugins.py
@ -2,8 +2,8 @@
 Covers:
- All seven bundled plugins (brave-free, ddgs, searxng, exa, parallel,
+- All eight bundled plugins (brave-free, ddgs, searxng, exa, parallel,
-  tavily, firecrawl) instantiate and self-report the expected
+  tavily, firecrawl, xai) instantiate and self-report the expected
  capabilities + ABC-derived defaults.
 - Each plugin's ``is_available()`` correctly reflects env-var presence.
 - The web_search_registry resolves an active provider in the documented
@ -47,6 +47,7 @@ def _clear_web_env(monkeypatch: pytest.MonkeyPatch) -> None:
        "FIRECRAWL_GATEWAY_URL",
        "TOOL_GATEWAY_DOMAIN",
        "TOOL_GATEWAY_USER_TOKEN",
        "XAI_API_KEY",
    ):
        monkeypatch.delenv(k, raising=False)
@ -70,7 +71,7 @@ def _isolate_env(monkeypatch: pytest.MonkeyPatch) -> None:
 class TestBundledPluginsRegister:
-    """All seven bundled web plugins discover and register correctly."""
+    """All eight bundled web plugins discover and register correctly."""
    def test_all_seven_plugins_present_in_registry(self) -> None:
        _ensure_plugins_loaded()
@ -85,6 +86,7 @@ class TestBundledPluginsRegister:
            "parallel",
            "searxng",
            "tavily",
            "xai",
        ]
    @pytest.mark.parametrize(
@ -100,6 +102,8 @@ class TestBundledPluginsRegister:
            # disabled in the migration (fell through to a legacy inline
            # path); the follow-up commit enabled it natively.
            ("firecrawl", True, True, True),
            # xai: search-only via Grok's agentic web_search tool.
            ("xai", True, False, False),
        ],
    )
    def test_capability_flags_match_spec(
@ -120,7 +124,7 @@ class TestBundledPluginsRegister:
    @pytest.mark.parametrize(
        "plugin_name",
-        ["brave-free", "ddgs", "searxng", "exa", "parallel", "tavily", "firecrawl"],
+        ["brave-free", "ddgs", "searxng", "exa", "parallel", "tavily", "firecrawl", "xai"],
    )
    def test_each_plugin_has_name_and_display_name(self, plugin_name: str) -> None:
        _ensure_plugins_loaded()
@ -133,7 +137,7 @@ class TestBundledPluginsRegister:
    @pytest.mark.parametrize(
        "plugin_name",
-        ["brave-free", "ddgs", "searxng", "exa", "parallel", "tavily", "firecrawl"],
+        ["brave-free", "ddgs", "searxng", "exa", "parallel", "tavily", "firecrawl", "xai"],
    )
    def test_each_plugin_has_setup_schema(self, plugin_name: str) -> None:
        """``get_setup_schema()`` returns a dict the picker can consume."""
@ -239,6 +243,17 @@ class TestIsAvailable:
        # Truthy or falsy, just must not raise.
        _ = bool(p.is_available())
    def test_xai_requires_api_key_or_oauth(self, monkeypatch: pytest.MonkeyPatch) -> None:
        """xAI needs XAI_API_KEY or OAuth tokens in auth.json."""
        _ensure_plugins_loaded()
        from agent.web_search_registry import get_provider
        p = get_provider("xai")
        assert p is not None
        assert p.is_available() is False  # no XAI_API_KEY, no auth.json
        monkeypatch.setenv("XAI_API_KEY", "real")
        assert p.is_available() is True
 # ---------------------------------------------------------------------------
 # Registry resolution semantics (Option B — conservative smart fallback)
@ -455,7 +470,7 @@ class TestErrorResponseShapes:
        if result["results"]:
            assert "error" in result["results"][0]
-    def test_firecrawl_crawl_returns_error_dict_when_unconfigured(self) -> None:
+    def test_firecrawl_crawl_returns_error_dict_when_unconfigured(self):
        """firecrawl crawl is async (wraps SDK in to_thread); error must be
        surfaced via the per-page result shape, not raised."""
        _ensure_plugins_loaded()
@ -473,3 +488,15 @@ class TestErrorResponseShapes:
        assert len(result["results"]) >= 1
        assert "error" in result["results"][0]
        assert result["results"][0]["url"] == "https://example.com"
    def test_xai_search_returns_error_dict_when_unconfigured(self) -> None:
        """xAI returns a typed error dict (no XAI_API_KEY)."""
        _ensure_plugins_loaded()
        from agent.web_search_registry import get_provider
        p = get_provider("xai")
        assert p is not None
        result = p.search("test", limit=5)
        assert isinstance(result, dict)
        assert result.get("success") is False
        assert "error" in result
--- a/tests/test_run_tests_parallel.py
+++ b/tests/test_run_tests_parallel.py
@ -0,0 +1,187 @@
 """Verify scripts/run_tests_parallel.py kills test-spawned grandchildren.
 Setup
 -----
 A test in this file spawns a long-lived Python grandchild that writes
 its PID + a nonce to a tempfile, then exits without cleaning up.
 With the old ``subprocess.run`` runner, that grandchild would orphan
 and outlive the test (and the whole runner). With the current Popen +
 ``start_new_session`` + ``_kill_tree`` runner, the grandchild gets
 SIGKILL'd via process-group kill when its file's pytest exits.
 The leaker test always passes — its only job is to spawn a grandchild
 and walk away. The verifier runs the runner over the leaker file in a
 subprocess, then waits for the grandchild PID to disappear from the
 kernel's process table.
 POSIX-only: Windows has its own grandchild lifecycle (no shared session,
 ``taskkill /F /T`` semantics). Marked accordingly.
 """
 from __future__ import annotations
 import json
 import os
 import subprocess
 import sys
 import textwrap
 import time
 from pathlib import Path
 import pytest
 # Both tests share the same handoff file: the leaker writes here, the
 # verifier reads here. We park it in $TMPDIR with a unique-per-run name
 # so concurrent invocations of the suite don't clobber each other.
 _HANDOFF_DIR = Path(os.environ.get("TMPDIR", "/tmp")) / "hermes-isolation-probe"
 _HANDOFF_DIR.mkdir(exist_ok=True)
 def _handoff_path_for(nonce: str) -> Path:
    return _HANDOFF_DIR / f"grandchild-{nonce}.json"
 def _pid_alive(pid: int) -> bool:
    """POSIX: send signal 0 to probe whether ``pid`` is still alive.
    ``os.kill(pid, 0)`` raises ``ProcessLookupError`` if the process is
    gone, ``PermissionError`` if it exists but we can't signal it
    (someone else's pid). We treat PermissionError as "alive" because
    the process exists and that's all we need to know.
    """
    if sys.platform == "win32":  # pragma: no cover — POSIX-only test
        # On Windows we'd use OpenProcess + GetExitCodeProcess; this
        # test is skipped on Windows so the path is unreachable.
        raise RuntimeError("_pid_alive POSIX-only")
    try:
        os.kill(pid, 0)
    except ProcessLookupError:
        return False
    except PermissionError:
        return True
    return True
@pytest.mark.skipif(sys.platform == "win32", reason="POSIX-only probe")
@pytest.mark.live_system_guard_bypass
 def test_grandchild_leak_is_killed_by_runner(tmp_path: Path) -> None:
    """Run the parallel runner over a probe file and verify cleanup.
    1. Materialize a probe file that spawns a long-lived grandchild and
       writes its PID to disk before exiting.
    2. Invoke ``scripts/run_tests_parallel.py`` against the probe file.
    3. Wait for the grandchild PID to vanish (poll for ~5s).
    4. Assert the runner exited cleanly AND the grandchild is dead.
    """
    repo_root = Path(__file__).resolve().parent.parent
    runner = repo_root / "scripts" / "run_tests_parallel.py"
    assert runner.exists(), f"runner missing at {runner}"
    # Probe lives in a temp dir, NOT under tests/, so the regular suite
    # never picks it up — only our explicit invocation does.
    probe_dir = tmp_path / "probe"
    probe_dir.mkdir()
    probe = probe_dir / "test_probe_leaker.py"
    nonce = f"{os.getpid()}-{int(time.time() * 1000)}"
    handoff = _handoff_path_for(nonce)
    if handoff.exists():
        handoff.unlink()
    probe_src = textwrap.dedent(f"""
        import json, os, subprocess, sys, time
        from pathlib import Path
        HANDOFF = Path({str(handoff)!r})
        def test_spawns_grandchild_and_walks_away():
            # Long-lived grandchild: detached, ignores SIGTERM (we want
            # SIGKILL or process-group kill to be the only thing that
            # works, simulating a misbehaving server).
            child = subprocess.Popen(
                [
                    sys.executable, "-c",
                    "import os, signal, sys, time; "
                    "signal.signal(signal.SIGTERM, signal.SIG_IGN); "
                    "sys.stdout.write(f'gc-pgid={{os.getpgid(0)}} gc-pid={{os.getpid()}}\\\\n'); "
                    "sys.stdout.flush(); "
                    "time.sleep(600)",
                ],
                stdout=subprocess.PIPE,
                stderr=subprocess.STDOUT,
                # IMPORTANT: do NOT pass start_new_session here. We want
                # the grandchild to inherit the pytest subprocess's
                # process group, so when the runner kills the group the
                # grandchild dies too.
            )
            # Read the first line so we can record gc's pgid in the
            # handoff, then walk away — don't close the pipe (would
            # signal EOF and let the child see SIGPIPE on next write).
            first_line = child.stdout.readline().decode().strip()
            HANDOFF.write_text(json.dumps({{
                "pid": child.pid,
                "diag": first_line,
                "test_pid": os.getpid(),
                "test_pgid": os.getpgid(0),
            }}))
            assert child.pid > 0
    """).strip()
    probe.write_text(probe_src + "\n")
    # Run the parallel runner against just the probe file. The runner
    # discovers under ``tests/`` by default, so we override via --paths.
    proc = subprocess.run(
        [
            sys.executable,
            str(runner),
            "--paths",
            str(probe_dir),
            "-j",
            "1",
            # Tight per-file timeout: the probe finishes in <1s, no
            # need for 10min.
            "--file-timeout",
            "30",
        ],
        cwd=repo_root,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        text=True,
        timeout=60,
    )
    assert handoff.exists(), (
        f"probe never wrote handoff file; runner output:\n{proc.stdout}"
    )
    handoff_data = json.loads(handoff.read_text())
    grandchild_pid = handoff_data["pid"]
    diag = handoff_data.get("diag", "(no diag)")
    test_pid = handoff_data.get("test_pid")
    test_pgid = handoff_data.get("test_pgid")
    handoff.unlink()
    # The runner must have exited cleanly (probe test passes).
    assert proc.returncode == 0, (
        f"runner exited {proc.returncode}; output:\n{proc.stdout}"
    )
    # The grandchild must be gone. Poll for a bit because process-group
    # SIGKILL + reaping isn't synchronous; on a loaded box it can take
    # a beat.
    deadline = time.monotonic() + 5.0
    while time.monotonic() < deadline:
        if not _pid_alive(grandchild_pid):
            break
        time.sleep(0.05)
    else:
        # Test cleanup: kill the leaked grandchild ourselves so a
        # FAILED assertion doesn't leave a sleep(600) running.
        try:
            os.kill(grandchild_pid, 9)
        except ProcessLookupError:
            pass
        pytest.fail(
            f"grandchild PID {grandchild_pid} survived runner exit; "
            f"diag={diag!r} test_pid={test_pid} test_pgid={test_pgid}; "
            f"runner output:\n{proc.stdout}"
        )
--- a/tests/tools/conftest.py
+++ b/tests/tools/conftest.py
@ -0,0 +1,50 @@
 """Shared fixtures for tests/tools/ web-provider tests.
 Per-file subprocess isolation means each test file gets a fresh interpreter,
 so module-level state (like the web-search-provider registry) is empty when
 a file starts.  The ``web_registry_populated`` fixture registers all bundled
 providers before each test and resets the registry afterwards — tests that
 depend on the registry being populated should use it explicitly or via
 ``@pytest.mark.usefixtures("web_registry_populated")``.
 """
 import pytest
 def register_all_web_providers():
    """Register all bundled web-search providers into the global registry.
    This is the single source of truth for the provider list used by
    test classes that need the registry populated for dispatch checks.
    """
    from agent.web_search_registry import register_provider, _reset_for_tests
    from plugins.web.brave_free.provider import BraveFreeWebSearchProvider
    from plugins.web.ddgs.provider import DDGSWebSearchProvider
    from plugins.web.exa.provider import ExaWebSearchProvider
    from plugins.web.firecrawl.provider import FirecrawlWebSearchProvider
    from plugins.web.parallel.provider import ParallelWebSearchProvider
    from plugins.web.searxng.provider import SearXNGWebSearchProvider
    from plugins.web.tavily.provider import TavilyWebSearchProvider
    from plugins.web.xai.provider import XAIWebSearchProvider
    _reset_for_tests()
    for cls in (
        BraveFreeWebSearchProvider,
        DDGSWebSearchProvider,
        ExaWebSearchProvider,
        FirecrawlWebSearchProvider,
        ParallelWebSearchProvider,
        SearXNGWebSearchProvider,
        TavilyWebSearchProvider,
        XAIWebSearchProvider,
    ):
        register_provider(cls())
@pytest.fixture
 def web_registry_populated():
    """Populate the web-search-provider registry for one test, then reset."""
    register_all_web_providers()
    yield
    from agent.web_search_registry import _reset_for_tests
    _reset_for_tests()
--- a/tests/tools/test_approval_plugin_hooks.py
+++ b/tests/tools/test_approval_plugin_hooks.py
@ -22,18 +22,28 @@ from tools.approval import (
@pytest.fixture
-def isolated_session(monkeypatch):
+def isolated_session(monkeypatch, tmp_path):
-    """Give each test a fresh session_key and clean approval-state."""
+    """Give each test a fresh session_key, clean approval-state, and isolated
    HERMES_HOME so the real user's command_allowlist doesn't leak in."""
    import tools.approval as _am
    session_key = "test:session:approval_hooks"
    token = set_current_session_key(session_key)
    monkeypatch.setenv("HERMES_SESSION_KEY", session_key)
    # Make sure we don't skip guards via yolo / approvals.mode=off
    monkeypatch.delenv("HERMES_YOLO_MODE", raising=False)
    # Isolate from the real user's permanent allowlist + session state
    _saved_permanent = _am._permanent_approved.copy()
    _saved_session = {k: v.copy() for k, v in _am._session_approved.items()}
    _am._permanent_approved.clear()
    _am._session_approved.clear()
    try:
        yield session_key
    finally:
        _am._permanent_approved.update(_saved_permanent)
        _am._session_approved.update(_saved_session)
        try:
-            approval_module._approval_session_key.reset(token)
+            _am._approval_session_key.reset(token)
        except Exception:
            pass
        clear_session(session_key)
--- a/tests/tools/test_browser_supervisor.py
+++ b/tests/tools/test_browser_supervisor.py
@ -41,7 +41,7 @@ def _find_chrome() -> str:
@pytest.fixture
-def chrome_cdp(worker_id):
+def chrome_cdp(request):
    """Start a headless Chrome with --remote-debugging-port, yield its WS URL.
    Uses a unique port per xdist worker to avoid cross-worker collisions.
@ -51,6 +51,9 @@ def chrome_cdp(worker_id):
    import socket
    # xdist worker_id is "master" in single-process mode or "gw0".."gwN" otherwise.
    # Under subprocess-per-file isolation there's no xdist, so we fall back
    # to "master" via the session-scoped fixture below.
    worker_id = request.getfixturevalue("worker_id") if "worker_id" in request.fixturenames else "master"
    if worker_id == "master":
        port_offset = 0
    else:
--- a/tests/tools/test_discord_tool.py
+++ b/tests/tools/test_discord_tool.py
@ -1089,9 +1089,17 @@ class Test403Enrichment:
 class TestModelToolsIntegration:
    def setup_method(self):
        _reset_capability_cache()
        from model_tools import _clear_tool_defs_cache
        from tools.registry import invalidate_check_fn_cache
        _clear_tool_defs_cache()
        invalidate_check_fn_cache()
    def teardown_method(self):
        _reset_capability_cache()
        from model_tools import _clear_tool_defs_cache
        from tools.registry import invalidate_check_fn_cache
        _clear_tool_defs_cache()
        invalidate_check_fn_cache()
    @patch("tools.discord_tool._discord_request")
    def test_discord_admin_schema_rebuilt_by_get_tool_definitions(
--- a/tests/tools/test_homeassistant_tool.py
+++ b/tests/tools/test_homeassistant_tool.py
@ -501,16 +501,18 @@ class TestRegistration:
    def test_check_fn_gates_availability(self, monkeypatch):
        """Registry should exclude HA tools when HASS_TOKEN is not set."""
-        from tools.registry import registry
+        from tools.registry import invalidate_check_fn_cache, registry
        monkeypatch.delenv("HASS_TOKEN", raising=False)
        invalidate_check_fn_cache()
        defs = registry.get_definitions({"ha_list_entities", "ha_get_state", "ha_call_service"})
        assert len(defs) == 0
    def test_check_fn_includes_when_token_set(self, monkeypatch):
        """Registry should include HA tools when HASS_TOKEN is set."""
-        from tools.registry import registry
+        from tools.registry import invalidate_check_fn_cache, registry
        monkeypatch.setenv("HASS_TOKEN", "test-token")
        invalidate_check_fn_cache()
        defs = registry.get_definitions({"ha_list_entities", "ha_get_state", "ha_call_service"})
        assert len(defs) == 3
--- a/tests/tools/test_kanban_tools.py
+++ b/tests/tools/test_kanban_tools.py
@ -1093,6 +1093,11 @@ def test_kanban_guidance_not_in_normal_prompt(monkeypatch, tmp_path):
    from pathlib import Path as _P
    monkeypatch.setattr(_P, "home", lambda: tmp_path)
    from tools.registry import invalidate_check_fn_cache
    from model_tools import _clear_tool_defs_cache
    invalidate_check_fn_cache()
    _clear_tool_defs_cache()
    from run_agent import AIAgent
    a = AIAgent(
        api_key="test",
@ -1116,6 +1121,11 @@ def test_kanban_guidance_in_worker_prompt(monkeypatch, tmp_path):
    from pathlib import Path as _P
    monkeypatch.setattr(_P, "home", lambda: tmp_path)
    from tools.registry import invalidate_check_fn_cache
    from model_tools import _clear_tool_defs_cache
    invalidate_check_fn_cache()
    _clear_tool_defs_cache()
    from run_agent import AIAgent
    a = AIAgent(
        api_key="test",
--- a/tests/tools/test_send_message_tool.py
+++ b/tests/tools/test_send_message_tool.py
@ -10,6 +10,12 @@ from unittest.mock import AsyncMock, MagicMock, patch
 import pytest
 # python-telegram-bot is an optional dep — skip the entire module when
 # it isn't installed (e.g. CI bare env). Tests that patch telegram.Bot
 # or call _send_telegram need it; tests for other platforms don't but
 # keeping the whole file consistent is simpler.
 _HAS_TELEGRAM = pytest.importorskip("telegram", reason="python-telegram-bot not installed") is not None
@pytest.fixture(autouse=True)
 def _reset_signal_scheduler():
--- a/tests/tools/test_terminal_tool_requirements.py
+++ b/tests/tools/test_terminal_tool_requirements.py
@ -2,11 +2,26 @@
 import importlib
 import pytest
 from model_tools import get_tool_definitions
 terminal_tool_module = importlib.import_module("tools.terminal_tool")
@pytest.fixture(autouse=True)
 def _clear_caches():
    """Invalidate check_fn and tool-definitions caches before each test
    so that monkeypatched env vars / config take effect."""
    from tools.registry import invalidate_check_fn_cache
    from model_tools import _clear_tool_defs_cache
    invalidate_check_fn_cache()
    _clear_tool_defs_cache()
    yield
    invalidate_check_fn_cache()
    _clear_tool_defs_cache()
 class TestTerminalRequirements:
    def test_local_backend_requirements(self, monkeypatch):
        monkeypatch.setattr(
--- a/tests/tools/test_video_generation_tool_surface_matrix.py
+++ b/tests/tools/test_video_generation_tool_surface_matrix.py
@ -95,7 +95,9 @@ def _invoke_tool(home, cfg: dict, args: dict) -> dict:
    if hasattr(cfg_mod, "_invalidate_load_config_cache"):
        cfg_mod._invalidate_load_config_cache()
-    from tools.registry import registry
+    from tools.registry import discover_builtin_tools, registry
    if "video_generate" not in registry._tools:
        discover_builtin_tools()
    handler = registry._tools["video_generate"].handler
    return json.loads(handler(args))
--- a/tests/tools/test_web_providers.py
+++ b/tests/tools/test_web_providers.py
@ -13,6 +13,8 @@ from typing import Any, Dict, List
 import pytest
 from tests.tools.conftest import register_all_web_providers
 # ---------------------------------------------------------------------------
 # ABC enforcement
@ -276,6 +278,15 @@ class TestUnconfiguredErrorEnvelopeParity:
    ``result.get("error")`` detect the failure cleanly.
    """
    _register_providers = staticmethod(register_all_web_providers)
    @pytest.fixture(autouse=True)
    def _populate_web_registry(self):
        self._register_providers()
        yield
        from agent.web_search_registry import _reset_for_tests
        _reset_for_tests()
    def _clear_web_creds(self, monkeypatch):
        for k in (
            "BRAVE_SEARCH_API_KEY",
--- a/tests/tools/test_web_providers_brave_free.py
+++ b/tests/tools/test_web_providers_brave_free.py
@ -15,6 +15,10 @@ from __future__ import annotations
 import json
 from unittest.mock import MagicMock, patch
 import pytest
 from tests.tools.conftest import register_all_web_providers
 # ---------------------------------------------------------------------------
 # BraveFreeWebSearchProvider unit tests
@ -239,6 +243,15 @@ class TestBraveFreeBackendWiring:
 class TestBraveFreeSearchOnlyErrors:
    _register_providers = staticmethod(register_all_web_providers)
    @pytest.fixture(autouse=True)
    def _populate_web_registry(self):
        self._register_providers()
        yield
        from agent.web_search_registry import _reset_for_tests
        _reset_for_tests()
    def test_web_extract_returns_search_only_error(self, monkeypatch):
        import asyncio
        from tools import web_tools
@ -246,6 +259,7 @@ class TestBraveFreeSearchOnlyErrors:
        monkeypatch.setattr(web_tools, "_load_web_config", lambda: {"backend": "brave-free"})
        monkeypatch.setenv("BRAVE_SEARCH_API_KEY", "BSAkey123")
        monkeypatch.setattr(web_tools, "_is_tool_gateway_ready", lambda: False)
        monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
        monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False, raising=False)
        result_str = asyncio.get_event_loop().run_until_complete(
@ -264,6 +278,8 @@ class TestBraveFreeSearchOnlyErrors:
        monkeypatch.setenv("BRAVE_SEARCH_API_KEY", "BSAkey123")
        monkeypatch.setattr(web_tools, "_is_tool_gateway_ready", lambda: False)
        monkeypatch.setattr(web_tools, "check_firecrawl_api_key", lambda: False)
        monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
        monkeypatch.setattr(web_tools, "check_website_access", lambda url: None)
        monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False, raising=False)
        result_str = asyncio.get_event_loop().run_until_complete(
--- a/tests/tools/test_web_providers_ddgs.py
+++ b/tests/tools/test_web_providers_ddgs.py
@ -14,6 +14,10 @@ import sys
 import types
 from unittest.mock import MagicMock
 import pytest
 from tests.tools.conftest import register_all_web_providers
 def _install_fake_ddgs(monkeypatch, *, text_results=None, text_raises=None):
    """Install a stub ``ddgs`` module in sys.modules for the duration of a test.
@ -210,6 +214,15 @@ class TestDDGSBackendWiring:
 class TestDDGSSearchOnlyErrors:
    _register_providers = staticmethod(register_all_web_providers)
    @pytest.fixture(autouse=True)
    def _populate_web_registry(self):
        self._register_providers()
        yield
        from agent.web_search_registry import _reset_for_tests
        _reset_for_tests()
    def test_web_extract_returns_search_only_error(self, monkeypatch):
        import asyncio
        from tools import web_tools
@ -217,6 +230,7 @@ class TestDDGSSearchOnlyErrors:
        monkeypatch.setattr(web_tools, "_load_web_config", lambda: {"backend": "ddgs"})
        monkeypatch.setattr(web_tools, "_ddgs_package_importable", lambda: True)
        monkeypatch.setattr(web_tools, "_is_tool_gateway_ready", lambda: False)
        monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
        monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False, raising=False)
        result_str = asyncio.get_event_loop().run_until_complete(
@ -235,6 +249,8 @@ class TestDDGSSearchOnlyErrors:
        monkeypatch.setattr(web_tools, "_ddgs_package_importable", lambda: True)
        monkeypatch.setattr(web_tools, "_is_tool_gateway_ready", lambda: False)
        monkeypatch.setattr(web_tools, "check_firecrawl_api_key", lambda: False)
        monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
        monkeypatch.setattr(web_tools, "check_website_access", lambda url: None)
        monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False, raising=False)
        result_str = asyncio.get_event_loop().run_until_complete(
--- a/tests/tools/test_web_providers_searxng.py
+++ b/tests/tools/test_web_providers_searxng.py
@ -17,6 +17,8 @@ from unittest.mock import MagicMock, patch
 import pytest
 from tests.tools.conftest import register_all_web_providers
 # ---------------------------------------------------------------------------
 # SearXNGWebSearchProvider unit tests
@ -301,6 +303,15 @@ class TestCheckWebApiKey:
 class TestSearXNGOnlyExtractCrawlErrors:
    """When searxng is the active backend, extract/crawl must return clear errors."""
    _register_providers = staticmethod(register_all_web_providers)
    @pytest.fixture(autouse=True)
    def _populate_web_registry(self):
        self._register_providers()
        yield
        from agent.web_search_registry import _reset_for_tests
        _reset_for_tests()
    def test_web_crawl_searxng_returns_clear_error(self, monkeypatch):
        import asyncio
        from tools import web_tools
@ -309,6 +320,8 @@ class TestSearXNGOnlyExtractCrawlErrors:
        monkeypatch.setenv("SEARXNG_URL", "http://localhost:8080")
        monkeypatch.setattr(web_tools, "_is_tool_gateway_ready", lambda: False)
        monkeypatch.setattr(web_tools, "check_firecrawl_api_key", lambda: False)
        monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
        monkeypatch.setattr(web_tools, "check_website_access", lambda url: None)
        monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False, raising=False)
        import json
@ -326,6 +339,7 @@ class TestSearXNGOnlyExtractCrawlErrors:
        monkeypatch.setattr(web_tools, "_load_web_config", lambda: {"backend": "searxng"})
        monkeypatch.setenv("SEARXNG_URL", "http://localhost:8080")
        monkeypatch.setattr(web_tools, "_is_tool_gateway_ready", lambda: False)
        monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
        monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False, raising=False)
        import json
--- a/tests/tools/test_web_tools_tavily.py
+++ b/tests/tools/test_web_tools_tavily.py
@ -13,6 +13,8 @@ import asyncio
 import pytest
 from unittest.mock import patch, MagicMock
 from tests.tools.conftest import register_all_web_providers
 # ─── _tavily_request ─────────────────────────────────────────────────────────
@ -163,6 +165,15 @@ class TestNormalizeTavilyDocuments:
 class TestWebSearchTavily:
    """Test web_search_tool dispatch to Tavily."""
    _register_providers = staticmethod(register_all_web_providers)
    @pytest.fixture(autouse=True)
    def _populate_web_registry(self):
        self._register_providers()
        yield
        from agent.web_search_registry import _reset_for_tests
        _reset_for_tests()
    def test_search_dispatches_to_tavily(self):
        mock_response = MagicMock()
        mock_response.json.return_value = {
@ -186,6 +197,15 @@ class TestWebSearchTavily:
 class TestWebExtractTavily:
    """Test web_extract_tool dispatch to Tavily."""
    _register_providers = staticmethod(register_all_web_providers)
    @pytest.fixture(autouse=True)
    def _populate_web_registry(self):
        self._register_providers()
        yield
        from agent.web_search_registry import _reset_for_tests
        _reset_for_tests()
    def test_extract_dispatches_to_tavily(self):
        mock_response = MagicMock()
        mock_response.json.return_value = {
@ -211,6 +231,15 @@ class TestWebExtractTavily:
 class TestWebCrawlTavily:
    """Test web_crawl_tool dispatch to Tavily."""
    _register_providers = staticmethod(register_all_web_providers)
    @pytest.fixture(autouse=True)
    def _populate_web_registry(self):
        self._register_providers()
        yield
        from agent.web_search_registry import _reset_for_tests
        _reset_for_tests()
    def test_crawl_dispatches_to_tavily(self):
        mock_response = MagicMock()
        mock_response.json.return_value = {
--- a/tests/tools/test_website_policy.py
+++ b/tests/tools/test_website_policy.py
@ -4,6 +4,8 @@ from pathlib import Path
 import pytest
 import yaml
 from tests.tools.conftest import register_all_web_providers
 from tools.website_policy import WebsitePolicyError, check_website_access, load_website_blocklist
@ -347,40 +349,191 @@ def test_browser_navigate_allows_when_shared_file_missing(monkeypatch, tmp_path)
    assert result is None
-@pytest.mark.asyncio
+class TestWebToolPolicy:
-async def test_web_extract_short_circuits_blocked_url(monkeypatch):
+    """Tests that exercise web_extract_tool / web_crawl_tool with website-policy gates.
    from tools import web_tools
    from plugins.web.firecrawl import provider as firecrawl_provider
-    # Allow test URLs past SSRF check so website policy is what gets tested
+    These tests need the bundled web providers to be registered in the
-    monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
+    agent.web_search_registry so the tool dispatchers can find an active
-    # The per-URL website-policy gate moved into the firecrawl plugin's
+    provider.  Without registration, the tools return an error dict that
-    # extract() during the web-provider migration. Patch it at the new
+    lacks a ``results`` key, causing ``KeyError``.
-    # location; the dispatcher-level gate (used by web_crawl_tool's
+    """
    # pre-flight) still lives on tools.web_tools.
    monkeypatch.setattr(
        firecrawl_provider,
        "check_website_access",
        lambda url: {
            "host": "blocked.test",
            "rule": "blocked.test",
            "source": "config",
            "message": "Blocked by website policy",
        },
    )
    monkeypatch.setattr(
        firecrawl_provider,
        "_get_firecrawl_client",
        lambda: pytest.fail("firecrawl should not run for blocked URL"),
    )
    monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False)
    # Force the firecrawl plugin to be the active extract provider.
    monkeypatch.setenv("FIRECRAWL_API_KEY", "fake-key")
-    result = json.loads(await web_tools.web_extract_tool(["https://blocked.test"], use_llm_processing=False))
+    _register_providers = staticmethod(register_all_web_providers)
-    assert result["results"][0]["url"] == "https://blocked.test"
+    @pytest.fixture(autouse=True)
-    assert "Blocked by website policy" in result["results"][0]["error"]
+    def _populate_web_registry(self):
        self._register_providers()
        yield
        from agent.web_search_registry import _reset_for_tests
        _reset_for_tests()
    @pytest.mark.asyncio
    async def test_web_extract_short_circuits_blocked_url(self, monkeypatch):
        from tools import web_tools
        from plugins.web.firecrawl import provider as firecrawl_provider
        # Allow test URLs past SSRF check so website policy is what gets tested
        monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
        # The per-URL website-policy gate moved into the firecrawl plugin's
        # extract() during the web-provider migration. Patch it at the new
        # location; the dispatcher-level gate (used by web_crawl_tool's
        # pre-flight) still lives on tools.web_tools.
        monkeypatch.setattr(
            firecrawl_provider,
            "check_website_access",
            lambda url: {
                "host": "blocked.test",
                "rule": "blocked.test",
                "source": "config",
                "message": "Blocked by website policy",
            },
        )
        monkeypatch.setattr(
            firecrawl_provider,
            "_get_firecrawl_client",
            lambda: pytest.fail("firecrawl should not run for blocked URL"),
        )
        monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False)
        # Force the firecrawl plugin to be the active extract provider.
        monkeypatch.setenv("FIRECRAWL_API_KEY", "fake-key")
        result = json.loads(await web_tools.web_extract_tool(["https://blocked.test"], use_llm_processing=False))
        assert result["results"][0]["url"] == "https://blocked.test"
        assert "Blocked by website policy" in result["results"][0]["error"]
    @pytest.mark.asyncio
    async def test_web_extract_blocks_redirected_final_url(self, monkeypatch):
        from tools import web_tools
        from plugins.web.firecrawl import provider as firecrawl_provider
        # Allow test URLs past SSRF check so website policy is what gets tested
        monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
        def fake_check(url):
            if url == "https://allowed.test":
                return None
            if url == "https://blocked.test/final":
                return {
                    "host": "blocked.test",
                    "rule": "blocked.test",
                    "source": "config",
                    "message": "Blocked by website policy",
                }
            pytest.fail(f"unexpected URL checked: {url}")
        class FakeFirecrawlClient:
            def scrape(self, url, formats):
                return {
                    "markdown": "secret content",
                    "metadata": {
                        "title": "Redirected",
                        "sourceURL": "https://blocked.test/final",
                    },
                }
        # After the web-provider migration, the per-URL gate + firecrawl client
        # live in the plugin. Patch both at the plugin location.
        monkeypatch.setattr(firecrawl_provider, "check_website_access", fake_check)
        monkeypatch.setattr(firecrawl_provider, "_get_firecrawl_client", lambda: FakeFirecrawlClient())
        monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False)
        monkeypatch.setenv("FIRECRAWL_API_KEY", "fake-key")
        result = json.loads(await web_tools.web_extract_tool(["https://allowed.test"], use_llm_processing=False))
        assert result["results"][0]["url"] == "https://blocked.test/final"
        assert result["results"][0]["content"] == ""
        assert result["results"][0]["blocked_by_policy"]["rule"] == "blocked.test"
    @pytest.mark.asyncio
    async def test_web_crawl_short_circuits_blocked_url(self, monkeypatch):
        from tools import web_tools
        # web_crawl_tool checks for Firecrawl env before website policy
        monkeypatch.setenv("FIRECRAWL_API_KEY", "fake-key")
        # Allow test URLs past SSRF check so website policy is what gets tested
        monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
        # The dispatcher-level (seed-URL) policy gate still lives on web_tools.
        # No per-page gate runs in this test because the dispatcher returns
        # immediately when the seed is blocked, before delegating to the plugin.
        monkeypatch.setattr(
            web_tools,
            "check_website_access",
            lambda url: {
                "host": "blocked.test",
                "rule": "blocked.test",
                "source": "config",
                "message": "Blocked by website policy",
            },
        )
        # If the dispatcher ever reaches the firecrawl plugin's crawl(), the test
        # fails — pin the plugin module's client lookup so we'd notice.
        from plugins.web.firecrawl import provider as firecrawl_provider
        monkeypatch.setattr(
            firecrawl_provider,
            "_get_firecrawl_client",
            lambda: pytest.fail("firecrawl plugin should not run for blocked crawl URL"),
        )
        monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False)
        result = json.loads(await web_tools.web_crawl_tool("https://blocked.test", use_llm_processing=False))
        assert result["results"][0]["url"] == "https://blocked.test"
        assert result["results"][0]["blocked_by_policy"]["rule"] == "blocked.test"
    @pytest.mark.asyncio
    async def test_web_crawl_blocks_redirected_final_url(self, monkeypatch):
        from tools import web_tools
        from plugins.web.firecrawl import provider as firecrawl_provider
        # Force the firecrawl plugin to be the active crawl provider.
        monkeypatch.setenv("FIRECRAWL_API_KEY", "fake-key")
        # Allow test URLs past SSRF check so website policy is what gets tested
        monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
        def fake_check(url):
            # Dispatcher seed-URL gate (web_tools.check_website_access call)
            # and plugin per-page gate (firecrawl_provider.check_website_access
            # call) both flow through this single fake_check.
            if url == "https://allowed.test":
                return None
            if url == "https://blocked.test/final":
                return {
                    "host": "blocked.test",
                    "rule": "blocked.test",
                    "source": "config",
                    "message": "Blocked by website policy",
                }
            pytest.fail(f"unexpected URL checked: {url}")
        class FakeCrawlClient:
            def crawl(self, url, **kwargs):
                return {
                    "data": [
                        {
                            "markdown": "secret crawl content",
                            "metadata": {
                                "title": "Redirected crawl page",
                                "sourceURL": "https://blocked.test/final",
                            },
                        }
                    ]
                }
        # After PR #25182 follow-up: per-page policy gate lives in
        # plugins.web.firecrawl.provider.crawl(). Patch the gate + client at
        # the plugin location. The dispatcher-level (seed) gate also reads
        # web_tools.check_website_access — patch both.
        monkeypatch.setattr(web_tools, "check_website_access", fake_check)
        monkeypatch.setattr(firecrawl_provider, "check_website_access", fake_check)
        monkeypatch.setattr(firecrawl_provider, "_get_firecrawl_client", lambda: FakeCrawlClient())
        monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False)
        result = json.loads(await web_tools.web_crawl_tool("https://allowed.test", use_llm_processing=False))
        assert result["results"][0]["content"] == ""
        assert result["results"][0]["error"] == "Blocked by website policy"
        assert result["results"][0]["blocked_by_policy"]["rule"] == "blocked.test"
 def test_check_website_access_fails_open_on_malformed_config(tmp_path, monkeypatch):
@ -400,139 +553,3 @@ def test_check_website_access_fails_open_on_malformed_config(tmp_path, monkeypat
    # With default path, errors are caught and fail open
    result = check_website_access("https://example.com")
    assert result is None  # allowed, not crashed
@pytest.mark.asyncio
 async def test_web_extract_blocks_redirected_final_url(monkeypatch):
    from tools import web_tools
    from plugins.web.firecrawl import provider as firecrawl_provider
    # Allow test URLs past SSRF check so website policy is what gets tested
    monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
    def fake_check(url):
        if url == "https://allowed.test":
            return None
        if url == "https://blocked.test/final":
            return {
                "host": "blocked.test",
                "rule": "blocked.test",
                "source": "config",
                "message": "Blocked by website policy",
            }
        pytest.fail(f"unexpected URL checked: {url}")
    class FakeFirecrawlClient:
        def scrape(self, url, formats):
            return {
                "markdown": "secret content",
                "metadata": {
                    "title": "Redirected",
                    "sourceURL": "https://blocked.test/final",
                },
            }
    # After the web-provider migration, the per-URL gate + firecrawl client
    # live in the plugin. Patch both at the plugin location.
    monkeypatch.setattr(firecrawl_provider, "check_website_access", fake_check)
    monkeypatch.setattr(firecrawl_provider, "_get_firecrawl_client", lambda: FakeFirecrawlClient())
    monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False)
    monkeypatch.setenv("FIRECRAWL_API_KEY", "fake-key")
    result = json.loads(await web_tools.web_extract_tool(["https://allowed.test"], use_llm_processing=False))
    assert result["results"][0]["url"] == "https://blocked.test/final"
    assert result["results"][0]["content"] == ""
    assert result["results"][0]["blocked_by_policy"]["rule"] == "blocked.test"
@pytest.mark.asyncio
 async def test_web_crawl_short_circuits_blocked_url(monkeypatch):
    from tools import web_tools
    # web_crawl_tool checks for Firecrawl env before website policy
    monkeypatch.setenv("FIRECRAWL_API_KEY", "fake-key")
    # Allow test URLs past SSRF check so website policy is what gets tested
    monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
    # The dispatcher-level (seed-URL) policy gate still lives on web_tools.
    # No per-page gate runs in this test because the dispatcher returns
    # immediately when the seed is blocked, before delegating to the plugin.
    monkeypatch.setattr(
        web_tools,
        "check_website_access",
        lambda url: {
            "host": "blocked.test",
            "rule": "blocked.test",
            "source": "config",
            "message": "Blocked by website policy",
        },
    )
    # If the dispatcher ever reaches the firecrawl plugin's crawl(), the test
    # fails — pin the plugin module's client lookup so we'd notice.
    from plugins.web.firecrawl import provider as firecrawl_provider
    monkeypatch.setattr(
        firecrawl_provider,
        "_get_firecrawl_client",
        lambda: pytest.fail("firecrawl plugin should not run for blocked crawl URL"),
    )
    monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False)
    result = json.loads(await web_tools.web_crawl_tool("https://blocked.test", use_llm_processing=False))
    assert result["results"][0]["url"] == "https://blocked.test"
    assert result["results"][0]["blocked_by_policy"]["rule"] == "blocked.test"
@pytest.mark.asyncio
 async def test_web_crawl_blocks_redirected_final_url(monkeypatch):
    from tools import web_tools
    from plugins.web.firecrawl import provider as firecrawl_provider
    # Force the firecrawl plugin to be the active crawl provider.
    monkeypatch.setenv("FIRECRAWL_API_KEY", "fake-key")
    # Allow test URLs past SSRF check so website policy is what gets tested
    monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
    def fake_check(url):
        # Dispatcher seed-URL gate (web_tools.check_website_access call)
        # and plugin per-page gate (firecrawl_provider.check_website_access
        # call) both flow through this single fake_check.
        if url == "https://allowed.test":
            return None
        if url == "https://blocked.test/final":
            return {
                "host": "blocked.test",
                "rule": "blocked.test",
                "source": "config",
                "message": "Blocked by website policy",
            }
        pytest.fail(f"unexpected URL checked: {url}")
    class FakeCrawlClient:
        def crawl(self, url, **kwargs):
            return {
                "data": [
                    {
                        "markdown": "secret crawl content",
                        "metadata": {
                            "title": "Redirected crawl page",
                            "sourceURL": "https://blocked.test/final",
                        },
                    }
                ]
            }
    # After PR #25182 follow-up: per-page policy gate lives in
    # plugins.web.firecrawl.provider.crawl(). Patch the gate + client at
    # the plugin location. The dispatcher-level (seed) gate also reads
    # web_tools.check_website_access — patch both.
    monkeypatch.setattr(web_tools, "check_website_access", fake_check)
    monkeypatch.setattr(firecrawl_provider, "check_website_access", fake_check)
    monkeypatch.setattr(firecrawl_provider, "_get_firecrawl_client", lambda: FakeCrawlClient())
    monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False)
    result = json.loads(await web_tools.web_crawl_tool("https://allowed.test", use_llm_processing=False))
    assert result["results"][0]["content"] == ""
    assert result["results"][0]["error"] == "Blocked by website policy"
    assert result["results"][0]["blocked_by_policy"]["rule"] == "blocked.test"
--- a/tests/tools/test_write_deny.py
+++ b/tests/tools/test_write_deny.py
@ -1,8 +1,10 @@
 """Tests for _is_write_denied() — verifies deny list blocks sensitive paths on all platforms."""
 import os
 import pytest
 from pathlib import Path
 from unittest.mock import patch
 from tools.file_operations import _is_write_denied
@ -97,8 +99,22 @@ class TestWriteDenyPrefixes:
    def test_sudoers_d_prefix(self):
        assert _is_write_denied("/etc/sudoers.d/custom") is True
-    def test_systemd_prefix(self):
+    def test_systemd_prefix(self, tmp_path):
-        assert _is_write_denied("/etc/systemd/system/evil.service") is True
+        # On NixOS, /etc/systemd is a symlink into /nix/store, so
        # realpath() resolves it to a store path that doesn't match
        # the /etc/systemd/ prefix.  Build a real directory tree so
        # realpath is a no-op and prefix matching works.
        fake_etc = tmp_path / "etc" / "systemd" / "system"
        fake_etc.mkdir(parents=True)
        target = str(fake_etc / "evil.service")
        # Patch the prefix builder to include our tmp_path prefix
        import agent.file_safety as _fs
        _orig = _fs.build_write_denied_prefixes
        _extra_prefix = str(tmp_path / "etc" / "systemd") + os.sep
        def _patched(home):
            return _orig(home) + [_extra_prefix]
        with patch.object(_fs, "build_write_denied_prefixes", _patched):
            assert _is_write_denied(target) is True
 class TestWriteAllowed:
--- a/uv.lock
+++ b/uv.lock
@ -1261,15 +1261,6 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/e2/bc/7a34e904a415040ba626948d0b0a36a08cd073f12b13342578a68331be3c/exa_py-2.10.2-py3-none-any.whl", hash = "sha256:ecb2a7581f4b7a8aeb6b434acce1bbc40f92ed1d4126b2aa6029913acd904a47", size = 72248, upload-time = "2026-03-26T20:29:37.306Z" },
 ]
 [[package]]
 name = "execnet"
 version = "2.1.2"
 source = { registry = "https://pypi.org/simple" }
 sdist = { url = "https://files.pythonhosted.org/packages/bf/89/780e11f9588d9e7128a3f87788354c7946a9cbb1401ad38a48c4db9a4f07/execnet-2.1.2.tar.gz", hash = "sha256:63d83bfdd9a23e35b9c6a3261412324f964c2ec8dcd8d3c6916ee9373e0befcd", size = 166622, upload-time = "2025-11-12T09:56:37.75Z" }
 wheels = [
    { url = "https://files.pythonhosted.org/packages/ab/84/02fc1827e8cdded4aa65baef11296a9bbe595c474f0d6d758af082d849fd/execnet-2.1.2-py3-none-any.whl", hash = "sha256:67fba928dd5a544b783f6056f449e5e3931a5c378b128bc18501f7ea79e296ec", size = 40708, upload-time = "2025-11-12T09:56:36.333Z" },
 ]
 [[package]]
 name = "fal-client"
 version = "0.13.1"
@ -1635,9 +1626,7 @@ all = [
    { name = "ptyprocess", marker = "sys_platform != 'win32'" },
    { name = "pytest" },
    { name = "pytest-asyncio" },
    { name = "pytest-split" },
    { name = "pytest-timeout" },
    { name = "pytest-xdist" },
    { name = "pywinpty", marker = "sys_platform == 'win32'" },
    { name = "ruff" },
    { name = "simple-term-menu" },
@ -1668,9 +1657,7 @@ dev = [
    { name = "mcp" },
    { name = "pytest" },
    { name = "pytest-asyncio" },
    { name = "pytest-split" },
    { name = "pytest-timeout" },
    { name = "pytest-xdist" },
    { name = "ruff" },
    { name = "ty" },
 ]
@ -1863,9 +1850,7 @@ requires-dist = [
    { name = "pyjwt", extras = ["crypto"], specifier = "==2.12.1" },
    { name = "pytest", marker = "extra == 'dev'", specifier = "==9.0.2" },
    { name = "pytest-asyncio", marker = "extra == 'dev'", specifier = "==1.3.0" },
    { name = "pytest-split", marker = "extra == 'dev'", specifier = "==0.11.0" },
    { name = "pytest-timeout", marker = "extra == 'dev'", specifier = "==2.4.0" },
    { name = "pytest-xdist", marker = "extra == 'dev'", specifier = "==3.8.0" },
    { name = "python-dotenv", specifier = "==1.2.2" },
    { name = "python-telegram-bot", extras = ["webhooks"], marker = "extra == 'messaging'", specifier = "==22.6" },
    { name = "python-telegram-bot", extras = ["webhooks"], marker = "extra == 'termux'", specifier = "==22.6" },
@ -3482,18 +3467,6 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/e5/35/f8b19922b6a25bc0880171a2f1a003eaeb93657475193ab516fd87cac9da/pytest_asyncio-1.3.0-py3-none-any.whl", hash = "sha256:611e26147c7f77640e6d0a92a38ed17c3e9848063698d5c93d5aa7aa11cebff5", size = 15075, upload-time = "2025-11-10T16:07:45.537Z" },
 ]
 [[package]]
 name = "pytest-split"
 version = "0.11.0"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
    { name = "pytest" },
 ]
 sdist = { url = "https://files.pythonhosted.org/packages/2f/16/8af4c5f2ceb3640bb1f78dfdf5c184556b10dfe9369feaaad7ff1c13f329/pytest_split-0.11.0.tar.gz", hash = "sha256:8ebdb29cc72cc962e8eb1ec07db1eeb98ab25e215ed8e3216f6b9fc7ce0ec2b5", size = 13421, upload-time = "2026-02-03T09:14:31.469Z" }
 wheels = [
    { url = "https://files.pythonhosted.org/packages/ae/a1/d4423657caaa8be9b31e491592b49cebdcfd434d3e74512ce71f6ec39905/pytest_split-0.11.0-py3-none-any.whl", hash = "sha256:899d7c0f5730da91e2daf283860eb73b503259cb416851a65599368849c7f382", size = 11911, upload-time = "2026-02-03T09:14:33.708Z" },
 ]
 [[package]]
 name = "pytest-timeout"
 version = "2.4.0"
@ -3506,19 +3479,6 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/fa/b6/3127540ecdf1464a00e5a01ee60a1b09175f6913f0644ac748494d9c4b21/pytest_timeout-2.4.0-py3-none-any.whl", hash = "sha256:c42667e5cdadb151aeb5b26d114aff6bdf5a907f176a007a30b940d3d865b5c2", size = 14382, upload-time = "2025-05-05T19:44:33.502Z" },
 ]
 [[package]]
 name = "pytest-xdist"
 version = "3.8.0"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
    { name = "execnet" },
    { name = "pytest" },
 ]
 sdist = { url = "https://files.pythonhosted.org/packages/78/b4/439b179d1ff526791eb921115fca8e44e596a13efeda518b9d845a619450/pytest_xdist-3.8.0.tar.gz", hash = "sha256:7e578125ec9bc6050861aa93f2d59f1d8d085595d6551c2c90b6f4fad8d3a9f1", size = 88069, upload-time = "2025-07-01T13:30:59.346Z" }
 wheels = [
    { url = "https://files.pythonhosted.org/packages/ca/31/d4e37e9e550c2b92a9cbc2e4d0b7420a27224968580b5a447f420847c975/pytest_xdist-3.8.0-py3-none-any.whl", hash = "sha256:202ca578cfeb7370784a8c33d6d05bc6e13b4f25b5053c30a152269fd10f0b88", size = 46396, upload-time = "2025-07-01T13:30:56.632Z" },
 ]
 [[package]]
 name = "python-dateutil"
 version = "2.9.0.post0"