The compact "<n>|content" gutter from #35368 is now the sole behavior.
Removes the HERMES_READ_GUTTER=padded escape hatch and its env lookup —
no legacy fixed-width path to maintain. Padding was pure token overhead
(~48% more tokens than bare content, ~16% more than compact) with no
measured accuracy gain in the original A/B.
- file_operations.py: drop env lookup + os import; gutter always f"{i}|{line}"
- tests: drop the padded env-override test; compact assertions retained
Relative paths in write_file/patch could resolve against the agent PROCESS cwd
instead of the terminal's working directory. In a git-worktree session with a
stale TERMINAL_CWD='.' (a relative base), early edits silently landed in the
MAIN checkout, verified there, and reported success — while the agent inspected
the worktree and saw nothing, misreading it as the patch tool no-op'ing.
- _resolve_base_dir(): resolution base is now ALWAYS absolute. A relative
TERMINAL_CWD is anchored to the process cwd once, deterministically, instead
of being left to resolve()-time cwd. Live terminal cwd stays authoritative.
- write_file/patch pass the resolved absolute path to the shell FileOps layer
so the tool layer and shell layer can't disagree about which file is edited.
- Responses now report the absolute resolved_path and files_modified, so a
wrong-cwd mismatch is visible on the first call.
- _path_resolution_warning(): emits a _warning when a relative path resolves
OUTSIDE the live terminal cwd (e.g. a worktree session writing into main).
Validation: 11 new unit tests + 43 live E2E assertions (worktree routing,
mid-session cd, V4A patches, divergence warning, absolute paths, consecutive
patches); 466 existing file/path/terminal tests green.
browser_console(expression="document.body") returned the cryptic CDP error
"Object reference chain is too long" instead of a usable result.
With returnByValue=true, Chrome deep-serializes the eval result; for a live
DOM Node/NodeList/Window that serialization overruns CDP's recursion guard
and fails the whole call with a protocol-level error (not a JS exception),
which _browser_eval surfaced raw.
- browser_supervisor.evaluate_runtime: on that specific error, retry once
with returnByValue=false so Chrome returns the node's description string —
the same graceful path already used for document.querySelector() results.
- browser_tool._browser_eval (CLI subprocess fallback): the subprocess can't
retry, so convert the reference-chain error into actionable guidance
(extract a primitive / use JSON.stringify) instead of leaking it raw.
No expression rewriting — normal evals (1+41 -> 42) are untouched.
read_file's gutter used a fixed-width zero/space-padded prefix
(" 1|content"). The padding is pure token overhead: measured with
cl100k on real Hermes source, the padded gutter costs ~48% more tokens
than bare content and ~16% more than a compact "<n>|content" gutter,
because the leading spaces tokenize into extra tokens on every line.
Switched the default to the compact "<n>|content" form. An A/B
(Sonnet 4.6 via OpenRouter, 2 passes, 4-task battery, every claim
verified against ground truth) showed:
- padded : 4/4 PASS both passes
- compact : 4/4 PASS both passes ← keeps line-referencing + patch
- none : 3/4 PASS both passes ← dropping numbers entirely made
the model hand-count lines and answer off-by-one (33 vs 34)
So we keep the line numbers (the model genuinely uses them to reference
lines) but drop the wasteful padding — capturing ~14% of the read-token
cost with zero measured accuracy change. Dropping numbers entirely
(the larger 33% saving) is rejected: it regresses line-referencing.
patch/fuzzy_match never consumed the gutter (they match old_string text
and compute char offsets internally), so editing is unaffected. No
downstream parser keys on the fixed-width columns. HERMES_READ_GUTTER=
padded restores the legacy format for anyone relying on alignment.
Tests: updated the 3 format assertions to the compact gutter; added an
env-override test for the legacy padded format. 209 file-tool tests green.
Some Windows editors prepend an invisible UTF-8 BOM (U+FEFF) to text
files. We had no awareness of it, so: read_file surfaced a phantom
U+FEFF as the first character; patch matches against the true first
line could miss; and a write/patch round-trip silently stripped the
marker, changing the file's byte signature.
Now:
- read_file / read_file_raw strip a single leading BOM so the model
never sees it (only on the first chunk — the marker lives at byte 0).
- patch_replace strips the BOM before fuzzy-matching (so an exact
first-line match works) and its post-write verification compares
BOM-stripped content.
- write_file restores the BOM when the original file had one and the
new content doesn't, mirroring the existing line-ending preservation
(detect on disk via a cheap `head -c 3` probe or reuse pre_content,
re-prepend across the edit). Guards against double-BOM.
Mid-content U+FEFF is left alone (it's data there, not a file marker).
Tests: TestBomHandling (real LocalEnvironment) — read-strips, raw-read
strips, write preserves, no-BOM-when-original-had-none, no-double-BOM,
patch round-trip preserves, patch matches first line through a BOM,
plus helper unit tests. 208 file-tool tests green.
The no-home-channel error for send_message derived the env var name
generically as <PLATFORM>_HOME_CHANNEL, producing EMAIL_HOME_CHANNEL for
the email platform. But gateway/config.py reads EMAIL_HOME_ADDRESS, so a
user following the error's guidance would set a variable that is never
consulted. Add a per-platform override map so the email hint names the
variable actually read; all other platforms keep the generic hint.
When using send_message with the email platform, valid email addresses
like user@example.com were not recognized as explicit targets by
_parse_target_ref(). This caused the function to return (None, None,
False), forcing the system into channel-name resolution which has no
way to resolve a raw email address, resulting in 'No home channel set
for email' errors.
Add _EMAIL_TARGET_RE pattern and email platform handler in
_parse_target_ref() so email addresses are treated as explicit targets
and routed directly without requiring a home target configuration.
The orphan reaper for stdio MCP subprocesses only tracked the direct child
PID spawned by ``stdio_client`` (e.g. ``openclaw mcp serve``). When that
wrapper itself spawned a helper (``claude mcp serve``) and then exited, the
helper reparented to ``systemd --user`` and survived shutdown.
The MCP SDK already spawns stdio children with ``start_new_session=True``,
so the wrapper is its own pgroup leader and same-pgroup descendants are
reachable via ``killpg``. Capture the pgid at spawn time and reap via
``killpg(pgid, sig)`` so reparented grandchildren are reaped alongside the
direct child, even after the wrapper itself exits. Falls back to per-pid
``os.kill`` on Windows or when no pgid was recorded.
Fixes part 2 (orphan ``claude mcp serve``) of #23799. Part 1 (per-invocation
respawn) was confirmed by the reporter to be an environmental artifact, not
a code bug.
* Inspired by Claude Code: /compress here [N] — boundary-aware 'summarize up to here'
Adds a user-chosen compression boundary to the existing /compress command.
/compress here [N] summarizes everything except the most recent N exchanges
(default 2), which are preserved verbatim — letting the user pick the
compression boundary instead of relying on the automatic token-budget heuristic.
Inspired by Claude Code's Rewind 'Summarize up to here' action (v2.1.139,
Week 20, May 2026): https://code.claude.com/docs/en/whats-new/2026-w20
- hermes_cli/partial_compress.py: pure split/parse helpers + seam-alternation
guard (shared by CLI and gateway).
- cli.py / gateway/run.py: route 'here [N]' / '--keep N' to partial compression;
compress only the head, re-append the verbatim tail through the seam guard.
- Preserves message-flow role alternation (seam guard merges any illegal
user->user / assistant->assistant adjacency).
- Reuses the existing _compress_context session-rotation/lock machinery — no
changes to the compression core.
- Bare /compress (full) and /compress <focus> behavior unchanged.
Tests: 12 helper unit tests + 5 CLI integration tests + E2E (interleaved
tool-call transcript, degenerate/multimodal seams, real handler path).
* fix(file-tools): make write_file/patch atomic (temp-file + rename)
write_file streamed content straight into the target via `cat > path`, so
a crash, SIGKILL, or truncated pipe mid-write left the file half-written
and corrupt. patch_replace routes through write_file, so it shared the flaw.
Now writes stream into a temp file in the SAME directory and `mv` it over
the target — a real same-filesystem rename, which is atomic on POSIX and on
every terminal backend (local/docker/ssh/modal). A failed write leaves the
original byte-intact and leaks no temp file. The existing file's mode is
preserved across the swap (stat + chmod, GNU/BSD), and content still rides
stdin so there's no ARG_MAX limit. A trap cleans the temp on any error path.
Tests: added TestAtomicWrite (real LocalEnvironment, no mocks) covering
inode-change-on-overwrite, mode preservation, failed-write-leaves-original,
no-temp-leak, special chars, and patch routing. Updated two mocks in
test_file_operations.py that keyed on the literal `cat >` write command to
key on the stdin_data behavioral signal instead. 200 file-tool tests green.
The cherry-picked fix's onerror handler chmod'd only the failing path, but
unlinking a child requires write permission on its PARENT directory. On a true
Nix-store copy (r-xr-xr-x dirs + files) rmtree still failed. Now chmod the
parent dir as well before retrying.
Also rewrites the regression test: the original asserted the helper FAILS on a
read-only dir (documenting the limitation), which is the wrong success criterion.
Split into two tests — restore succeeds on a full read-only tree (real Nix case),
and manifest is preserved when removal genuinely cannot proceed (monkeypatched).
Two related bugs in tools/skills_sync.py affecting Nix-store and
immutable-package installs:
**#34972 — reset_bundled_skill corrupts manifest on rmtree failure:**
The function deleted the manifest entry BEFORE attempting rmtree. If
rmtree failed (read-only files from Nix store), the function returned
early — leaving the skill in a manifest-less limbo state where future
syncs silently skip it forever.
Fix: reorder steps — attempt rmtree FIRST, only delete manifest entry
after rmtree succeeds. If rmtree fails, nothing is changed.
**#34860 — stale .bak directories after sync:**
sync_skills() called shutil.rmtree(backup, ignore_errors=True) which
silently failed on read-only files, leaving persistent .bak dirs.
Fix: add _rmtree_writable() helper that makes files writable via an
onerror callback before retrying removal. Used in both sync_skills()
backup cleanup and reset_bundled_skill().
Fixes#34972Fixes#34860
_download_image() wrapped every download attempt in a blanket
`except Exception` and retried 3x with 2s/4s/8s backoff regardless of
cause. A 404/403 image URL would never resolve on retry, so it just
burned up to 6s of wall-clock + extra GETs before failing — inflating
latency for a deterministic failure (issue #32296, umbrella #35114).
Add _is_retryable_download_error(): 4xx client errors (except 429),
website-policy PermissionError, and too-large/SSRF ValueError now raise
on the first attempt. 429, 5xx, and unclassified network errors stay
retryable. Removed the now-unreachable fall-through branch since the
loop always returns on success or re-raises on the final/terminal attempt.
Starlette < 1.0.1 is affected by CVE-2026-48710 ("BadHost", CWE-444).
The HTTP Host header was not validated before being used to rebuild
`request.url`, so a malformed Host could make `request.url.path` desync
from the raw ASGI path the router actually dispatched. Middleware and
endpoints that apply path-based authorization off `request.url` (rather
than `scope["path"]`) can therefore be bypassed.
Hermes pulls Starlette transitively, never directly:
- [web] -> fastapi==0.133.1 (starlette>=0.40.0, no upper bound)
- [mcp] -> mcp==1.26.0 + sse-starlette (starlette>=0.27 / >=0.49.1)
- [computer-use] -> mcp==1.26.0
- [dev] -> mcp==1.26.0
A fresh resolve landed starlette 0.52.1 — vulnerable. With no upper
bound on the transitive specs, pip/uv could resolve any pre-1.0.1
release on a fresh install.
Fix: pin starlette==1.0.1 directly in every extra that exposes a
Starlette-backed server surface, regenerate uv.lock (only starlette
moves: 0.52.1 -> 1.0.1, hash-verified), and mirror the pin in the
lazy-install map (tools/lazy_deps.py `tool.dashboard`) so `hermes`
on-demand dashboard installs can't re-resolve a vulnerable version.
1.0.1 is the advisory's named fix floor and the oldest patched release
(more bake time than 1.1.0/1.2.0, which are days old); it satisfies
every carrier constraint and our requires-python>=3.11.
Scope note: this is a dependency-level fix complementing the
application-layer Host-header validator added in #34162
(`hermes_cli/web_server.py` `_is_accepted_host`). Defense in depth at
both the framework and app layers.
Guards: two invariant tests in tests/test_packaging_metadata.py assert
every server-surface extra pins starlette and that pyproject + uv.lock
both resolve >= the 1.0.1 CVE floor — a dropped pin or stale lock fails
in CI instead of shipping the bypass.
Closes#35067
asyncio.gather(return_exceptions=True) captures CancelledError as a
BaseException value. The previous isinstance(result, Exception) check
missed CancelledError, silently dropping it without logging.
Since Python 3.9, CancelledError is a BaseException subclass (not
Exception). This one-line change ensures all failure types from MCP
server connections are properly logged.
FixesNousResearch/hermes-agent#34443
* docs(code-execution): document HERMES_* env narrowing + passthrough workaround
The execute_code sandbox-child env scrub (108397726, #27303) deliberately
dropped the broad HERMES_ prefix passthrough, keeping only an operational
4-var allowlist (HERMES_HOME/PROFILE/CONFIG/ENV). A script that relied on a
non-secret HERMES_* var (HERMES_BASE_URL, HERMES_KANBAN_DB, HERMES_*_WEBHOOK,
or a plugin-defined one) now sees it unset in the child.
Document the behavior change and the two recovery routes (terminal.env_passthrough
in config.yaml, or required_environment_variables in skill frontmatter), plus
the debug log line that surfaces the drop for diagnosis.
* fix(stt,tts): restore mistralai — 2.4.8 is clean, ban lifted
PyPI quarantined mistralai on 2026-05-12 after the malicious 2.4.6
release (Mini Shai-Hulud worm). 2.4.6 has since been removed from the
registry and clean releases resumed (2.4.7 2026-05-25, 2.4.8 2026-05-28).
This rolls back the blanket runtime ban so Voxtral STT + TTS work again,
following the restoration checklist the repo left in pyproject.toml.
Verified against the real SDK: 2.4.8 keeps the import path the code uses
(from mistralai.client import Mistral) and the audio.transcriptions.complete
/ audio.speech.complete surfaces.
Changes:
- pyproject.toml: re-add mistral extra pinned to mistralai==2.4.8; left
OUT of [all] per the 2026-05-12 lazy-install policy (one quarantined
release must not break fresh installs). uv.lock regenerated.
- tools/lazy_deps.py: add stt.mistral / tts.mistral entries so the SDK
lazy-installs on first use (matches edge / elevenlabs).
- tools/transcription_tools.py: restore explicit-provider gate
(_HAS_MISTRAL + key) and auto-detect entry (local>groq>openai>mistral>xai);
_transcribe_mistral lazy-installs before import.
- tools/tts_tool.py: dispatcher routes back to _generate_mistral_tts;
_import_mistral_client lazy-installs the SDK.
- hermes_cli/tools_config.py, hermes_cli/web_server.py: un-hide Mistral
from the TTS provider picker and dashboard STT options.
- hermes_cli/security_advisories.py: KEEP the shai-hulud-2026-05 advisory
(module policy forbids removal) — it is scoped to 2.4.6 only, so it
still warns anyone with the poisoned build cached and never fires on
2.4.8. Summary note updated to reflect the un-quarantine.
- tests: revert the disabled-behavior assertions added by the ban commit
back to routing/positive expectations; add mistral to the
lazy-installable-extras-excluded-from-[all] contract.
Reported by @SkYNewZ (#34503).
Validation: 189 targeted STT/TTS/lazy_deps/metadata tests pass; E2E with
the real mistralai 2.4.8 SDK routes both STT and TTS to mistral.
A GitHub tap can ship a repo-root skills.sh.json (the published skills.sh
schema) declaring category groupings. The Skills Hub now reads it at index
time and uses each grouping title as the skill's category label, instead of
the tag-derived guess. Generic: any tap that ships the file gets real
categorization — NVIDIA's groupings (Inference AI, Decision Optimization,
GPU Development, etc.) flow through automatically.
- GitHubSource: _get_skillsh_groupings() fetches+caches the sidecar per repo;
_parse_skillsh_groupings() flattens it to {skill_name: title};
_list_skills_in_repo() stamps meta.extra['category']; _meta_to_dict now
serializes extra so the category survives the index cache round-trip.
- extract-skills.py: prefers extra['category'] over the tag heuristic and
exempts sidecar categories from the small-category to Other collapse.
- Docs + 12 tests.
NVIDIA/skills is now a default trusted tap in the Hermes Skills Hub —
discoverable, browsable, searchable, and auto-updating through the same
pipeline that already serves OpenAI, Anthropic, and HuggingFace skills.
Rebased onto current main.
* docs(code-execution): document HERMES_* env narrowing + passthrough workaround
The execute_code sandbox-child env scrub (108397726, #27303) deliberately
dropped the broad HERMES_ prefix passthrough, keeping only an operational
4-var allowlist (HERMES_HOME/PROFILE/CONFIG/ENV). A script that relied on a
non-secret HERMES_* var (HERMES_BASE_URL, HERMES_KANBAN_DB, HERMES_*_WEBHOOK,
or a plugin-defined one) now sees it unset in the child.
Document the behavior change and the two recovery routes (terminal.env_passthrough
in config.yaml, or required_environment_variables in skill frontmatter), plus
the debug log line that surfaces the drop for diagnosis.
* fix: drain thread no longer crashes on fd-less stdout streams
The _wait_for_process drain thread called proc.stdout.fileno()
unconditionally. ProcessHandle implementations whose stdout is not
backed by a real OS fd (iterator-style in-memory streams, mock procs)
raised 'list_iterator' object has no attribute 'fileno' (or
'fileno() returned a non-integer' from select.select), killing the
daemon thread and silently losing all process output.
Resolve the fd defensively at the top of _drain; when stdout has no
usable integer fileno, fall back to draining it as an iterable (the
legacy 'for line in proc.stdout' contract). The real subprocess /
os.pipe-backed select() fast path is unchanged.
Vulture + per-symbol verification (whole-repo grep incl. tests, string
literals, getattr, decorator/registry/argparse dispatch) confirmed each of
these has zero callers anywhere — not reachable via any dynamic-dispatch path,
not referenced by tests, not re-exported.
Removed:
- acp_adapter/tools.py: _build_patch_mode_content
- agent/anthropic_adapter.py: read_claude_managed_key (diagnostics-only, never called)
- agent/bedrock_adapter.py: get_bedrock_model_ids
- agent/browser_registry.py: get_active_browser_provider
- agent/chat_completion_helpers.py: _take_request_client (x2 nested closures, never invoked)
- gateway/platforms/weixin.py: _rewrite_headers_for_weixin, _rewrite_table_block_for_weixin
- hermes_cli/banner.py: _skin_branding
- hermes_cli/debug.py: _delete_hint
- hermes_cli/gateway.py: _setup_email, _setup_sms, _setup_yuanbao
(platform keys absent from the _builtin_setup_fn dispatch dict; handled by
the _setup_standard_platform fallback)
- hermes_cli/kanban_db.py: set_max_runtime, active_run
- hermes_cli/kanban_diagnostics.py: severity_of_highest, _latest_clean_event_ts
- hermes_cli/main.py: _build_provider_choices, cmd_portal
(portal subcommand is wired via portal_cli.add_parser, not this wrapper)
- hermes_cli/model_switch.py: CustomAutoResult (orphaned by the switch_model() extraction)
- hermes_cli/models.py: format_model_pricing_table, fetch_nous_account_tier
- hermes_cli/portal_cli.py: _nous_portal_base_url
- hermes_cli/proxy/server.py: handle_models_fallback (defined but never registered on the router)
- tools/computer_use/cua_backend.py: _parse_element, _is_arm_mac
- tools/file_operations.py: _get_safe_write_root (prod uses the imported
agent.file_safety.get_safe_write_root directly)
- tools/skills_tool.py: _load_category_description
Also dropped two imports left unused by the removals:
- tools/file_operations.py: get_safe_write_root alias
- tools/computer_use/cua_backend.py: import platform
Pure deletion: -551 LOC. No behavior change. Test files covering the edited
modules pass (640/640); the broader suite's pre-existing/env-dependent
failures reproduce unchanged on origin/main.
Web search/extract dispatch read agent.web_search_registry before plugin
discovery had run, so in any process that hadn't imported model_tools.py
(subprocess agent runs, delegate children, standalone scripts) the registry
was empty: get_provider('firecrawl') returned None and the dispatcher emitted
the misleading 'No web extract provider configured' error even with
web.extract_backend set and FIRECRAWL_API_KEY exported.
Adds an idempotent _ensure_web_plugins_loaded() helper (mirrors
tools.browser_tool._ensure_browser_plugins_loaded) and calls it at the top of
both the web_search_tool and web_extract_tool dispatch sites before the
registry lookup.
Fixes#27580.
Co-authored-by: briandevans <252620095+briandevans@users.noreply.github.com>
The fast-path decision (native routing + provider allowlist OR
supports_vision override) lived inline in vision_analyze and was copied
into browser_vision. Extract it to _should_use_native_vision_fast_path()
so both tools share one source of truth.
- vision_tools: gate logic now one helper; vision_analyze calls it in 3 lines
- browser_tool: thin envelope decoration over the shared helper, not a copy
- browser_vision typed Union[str, Dict] to match its real return shape
- tests slimmed to target the override path + text-mode-wins invariant
Follow-up mitigation for the #27303 env-scrub tightening. Dropping the
broad HERMES_ prefix in favor of a 4-var operational allowlist is correct
hardening, but a sandbox script that imports a repo module reading a
non-allowlisted HERMES_* var at import time would otherwise see it
silently unset. _scrub_child_env now emits a one-shot debug log naming the
dropped non-secret HERMES_* vars and pointing at the env_passthrough
opt-in escape hatch. Secret-shaped vars are never named in the log.
Tests: dropped vars are logged + env_passthrough named; no log when
nothing is dropped; secret vars excluded from the diagnostic.
Wrap both execute_code RPC threads (local UDS + remote file-RPC) with propagate_context_to_thread so gateway sessions no longer fall into check_dangerous_command's non-interactive auto-approve branch and the CLI approval prompt stays reachable. Add check_execute_code_guard: one-shot fail-closed approval of the whole script in gateway/ask/cron-deny before the child spawns (skips isolated backends; command-string built only past the early returns). Drop the broad HERMES_ env passthrough for an explicit operational allowlist plus DSN/WEBHOOK secret substrings, and update the POSIX-equivalence oracle.
Refs #4146, #27303, #30882, #33057
Worker threads that dispatch Hermes tools started with an empty contextvars.Context and no thread-local approval/sudo callbacks. Add tools/thread_context.propagate_context_to_thread factoring that capture/install/clear lifecycle (mirrors the GHSA-qg5c-hvr5-hjgr pattern), and refactor agent/tool_executor onto it so the security-critical logic lives in one audited place. Update the contextvar-propagation source guard for the new call shape.
Refs #33057
Tool Search read its catalog from the global registry (get_tool_definitions
with no toolset scope = 'start with everything'), so a restricted-toolset
session — subagent, kanban worker, curated gateway session — could:
1. tool_search the entire process registry, not just its granted tools, and
2. tool_call any registered plugin/MCP tool it was never given, because
registry.dispatch() has no enabled_tools gate for non-execute_code tools.
A scoped session (enabled_toolsets=['mcp-github']) reported total_available=26
and successfully invoked an out-of-scope plugin tool via tool_call.
Fix:
- handle_function_call gains enabled_toolsets/disabled_toolsets; the bridge
dispatch scopes get_tool_definitions to them (also stops polluting the
process-global _last_resolved_tool_names with out-of-scope tools, which
leaked into execute_code's sandbox-tool fallback).
- A defense-in-depth gate rejects any tool_call'd name not in the scoped
deferrable catalog.
- tool_executor's unwrap (both concurrent + sequential paths) enforces the
same scope before dispatch, since it unwraps tool_call -> underlying name
and bypasses the bridge branch. New _tool_search_scoped_names() helper,
cached per-agent on registry generation + toolset scope.
- New scoped_deferrable_names() helper in tool_search.py shared by both sites.
Tests: 4 new regression tests in TestRegression_ToolsetScoping (scoped
catalog, out-of-scope tool_call rejection, no global pollution, helper).
Adds Tool Search, a structured-tools progressive-disclosure layer that
replaces MCP and non-core plugin tools in the model-visible tools array
with three bridge tools (tool_search / tool_describe / tool_call) when
the deferrable surface would consume more than a configurable percentage
of the active model's context window. Core Hermes tools are never deferred.
Default mode is 'auto' with a 10% context threshold, so small toolsets
pay no overhead. Set tools.tool_search.enabled to 'on' to force or 'off'
to disable.
Design carefully reflects the OpenClaw production failure modes
documented in the openclaw-tool-search-report:
- Core tools never defer (toolsets._HERMES_CORE_TOOLS). Addresses the
'tools silently missing from isolated cron turns' regression class
(openclaw#84141) by construction: there is no code path that can
drop a core tool.
- Catalog is stateless across turns — rebuilt from the live tool-defs
list on every assembly. No session-keyed Map that can drift out of
sync with the registry.
- tool_call unwraps the bridge call before any hook fires, so plugin
pre/post hooks, guardrails, approval flows, and the activity feed
all see the underlying tool name, not the bridge (addresses
openclaw#85588 and the verbose-mode complaint on openclaw#79823).
- The unwrap happens in both the parallel and sequential paths of
agent/tool_executor.py and also in handle_function_call, so direct
callers (sandboxed code, eval harnesses) are covered too.
- Bridge tools cannot invoke each other (recursion guard) and cannot
invoke core tools (those must be called directly).
- Tools mode only — no JS-sandbox code-mode. Keeps the surface small.
- Token estimation via cheap char/4 heuristic; precision isn't needed
for the threshold decision.
Files:
- tools/tool_search.py — new module (BM25 retrieval, classification,
threshold gate, bridge dispatch, unwrap helper).
- tests/tools/test_tool_search.py — 35 tests including the OpenClaw
#84141 regression guard.
- model_tools.py — wires assembly into _compute_tool_definitions as the
final step, adds skip_tool_search_assembly kwarg so the bridge can
see the real catalog, dispatches the three bridge tools.
- agent/tool_executor.py — unwraps tool_call in both parallel and
sequential parsing loops so checkpointing, guardrails, plugin hooks,
and tool-progress callbacks all observe the underlying tool name.
- hermes_cli/config.py — DEFAULT_CONFIG['tools']['tool_search'] block.
- website/docs/user-guide/features/tool-search.md — user docs.
Validation:
- 35/35 new tests pass.
- Existing tool/registry/model_tools/config/coercion/executor tests
(82 + 74 + small adjacents) green.
- Live E2E: 20 fake MCP tools registered, get_tool_definitions returns
3 bridges, tool_search returns top 3 hits, tool_describe returns
full schema, tool_call dispatches to the real underlying handler
and the underlying result is what the model sees.
- Reserved-name recursion guard verified live.
- Core-tool refusal via tool_call verified live.
Scopes the AWS_SDK subprocess strip down from the full AWS credential chain
to just AWS_BEARER_TOKEN_BEDROCK — the only Hermes-managed *inference* secret
(analogous to OPENAI_API_KEY). The general AWS credential chain
(AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_SESSION_TOKEN / AWS_PROFILE
/ config + role pointers) is intentionally left inheritable.
Why: per SECURITY.md §3.2 the local terminal is the user's trusted operator
shell. Hard-blocklisting the general chain would (a) regress *every* user who
runs aws/terraform/cdk/boto3 in the agent terminal — not just Bedrock users,
since PROVIDER_REGISTRY is iterated unconditionally at import — and (b) be
unrecoverable, because env_passthrough.py refuses to re-allow anything in
_HERMES_PROVIDER_ENV_BLOCKLIST (GHSA-rhgp-j443-p4rf). The narrow strip closes
the reported leak (opencode enumerating the Bedrock catalog off the leaked
bearer token) with no capability loss.
Keeps zapabob's self-healing auth_type=="aws_sdk" mechanism so any future
SDK-cred provider is covered automatically.
Tests: bearer token stripped + general chain preserved (no-regression guard),
on both the runtime strip path and the blocklist-membership path.
Co-authored-by: zapabob <1920071390@campus.ouj.ac.jp>
The dispatcher watchdog (release_stale_claims) reads tasks.last_heartbeat_at
to decide whether to reclaim a running task. The agent maintains its own
in-process `_last_activity_ts` for every chunk/tool result, but those
liveness ticks never reach the board unless the model explicitly calls
the `kanban_heartbeat` tool — so a worker actively executing a long run
without tool-level heartbeats can be reclaimed mid-flight as 'stale',
returning the task to ready and orphaning the in-flight worker's progress.
Fix: in `_touch_activity` (the canonical 'we just did work' hook in
run_agent.py), call a new `heartbeat_current_worker_from_env` helper
in `tools/kanban_tools.py` that:
- No-ops outside dispatcher-spawned worker context (no HERMES_KANBAN_TASK).
- Rate-limited to one DB write per 60s (runtime activity ticks too often
to faithfully mirror; we just need the watchdog to see liveness).
- Best-effort: never raises. heartbeat_claim + heartbeat_worker calls are
individually try/except'd; any DB error logs at debug and returns.
- Uses worker env identity: HERMES_KANBAN_TASK + HERMES_KANBAN_RUN_ID +
HERMES_KANBAN_CLAIM_LOCK (all pinned by the dispatcher at spawn time).
- No durable note on auto-heartbeats — that's reserved for the explicit
`kanban_heartbeat` tool which carries a model-supplied note.
The explicit `kanban_heartbeat` tool stays available unchanged for
workers that want to attach a note or pre-emptively extend a claim
across a known-long single tool call.
Co-authored-by: faisfamilytravel <223516181+faisfamilytravel@users.noreply.github.com>
The mechanical ruff prune in the previous commit removed several names that
`appear` unused inside their defining module but are external test/runtime
anchors:
run_agent
OpenAI, _SafeWriter
get_tool_definitions, handle_function_call, check_toolset_requirements
estimate_request_tokens_rough
DEFAULT_AGENT_IDENTITY, build_context_files_prompt,
build_environment_hints, build_nous_subscription_prompt
_is_destructive_command, _extract_parallel_scope_path, _paths_overlap,
_append_subdir_hint_to_multimodal, _trajectory_normalize_msg
tools/web_tools
Firecrawl, _get_firecrawl_client
These get accessed via four channels that are invisible to ruff's
in-module usage analysis:
1. `mock.patch('module.name', ...)` in tests — resolves the attribute
lazily, so `pytest --collect-only` passes even when the name is
gone, but every test using the patch fails at runtime with
AttributeError.
2. `from run_agent import X` in production siblings (agent/transports
/codex.py, etc.).
3. The `_ra().X` indirection pattern in agent/system_prompt.py et al.
— explicitly documented ("Many tests patch('run_agent.load_soul_md')")
to preserve the patch contract.
4. `from tools.web_tools import _get_firecrawl_client` in tests.
Each re-added import carries an explicit `# noqa: F401` with a comment
naming the channel, so future cleanup passes won't strip them again.
Remove unused imports (F401) and duplicate/shadowed import
redefinitions (F811) across the codebase using ruff's safe
autofixes. No behavioral changes -- imports only.
- ~1400 safe autofixes applied across 644 files (net -1072 lines)
- __init__.py re-exports preserved (excluded from F401 removal so
public re-export surfaces stay intact)
- Re-exports that are imported or monkeypatched by tests but look
unused in their defining module are kept with explicit # noqa:
F401 (gateway/run.py load_dotenv; run_agent re-exports from
agent.message_sanitization, agent.context_compressor,
agent.retry_utils, agent.prompt_builder, agent.process_bootstrap,
agent.codex_responses_adapter)
- Unsafe F841 (unused-variable) fixes deliberately skipped -- those
can change behavior when the RHS has side effects
- ruff lints remain disabled in pyproject.toml (only PLW1514 is
selected); this is a one-time cleanup, not a config change
Verification:
- python -m compileall: clean
- pytest --collect-only: all 27161 tests collect (zero import errors)
- core entry points import clean (run_agent, model_tools, cli,
toolsets, hermes_state, batch_runner, gateway)
- static scan: every name any test imports directly from an edited
module still resolves
* fix(codex): surface error code in Responses 'failed' status errors
When a Codex Responses turn ends with status=failed, the response carries
the failure details under `response.error` as
`{code, message, param, ...}`. The previous extractor pulled only
`message`, so users seeing a rate-limit failure got a bare "Slow down"
string indistinguishable from a generic stream truncation; an
internal_error with empty message degraded to a dict dump
("{'code': 'internal_error', 'message': ''}").
Extract a `_format_responses_error()` helper that:
- prefixes `code` when both code and message are present
(e.g. 'rate_limit_exceeded: Slow down')
- falls back to the bare `code` when message is empty
- accepts both dict and attribute-style payloads (SDK and JSON-RPC paths)
- preserves the prior status-only fallback when no error payload exists
Apply the same helper at the sibling site in
`codex_app_server_session.run_turn()` so codex-CLI subprocess turn
failures get the same treatment.
Tests:
- 8 new unit tests for `_format_responses_error` covering both shapes,
empty/missing fields, non-string fields, and the status-only fallback.
- 2 regression tests on `_normalize_codex_response` for failed status
with and without a code, asserting the exact RuntimeError message.
- All 3603 tests in tests/agent/ pass.
Adapted from anomalyco/opencode#28757.
* feat(prompt): universal task-completion guidance + local Python toolchain probe
Two cross-model failure modes get a single-line answer in the cached
system prompt. Both gated by config (default on), both add zero overhead
when not needed, both verified via real AIAgent prompt builds.
## What changed
`TASK_COMPLETION_GUIDANCE` — short prompt block applied to ALL models.
Targets two failure modes observed on a real Sarasota real-estate build
task: (1) Opus stopped after writing an 85-byte stub and gave a prose
response with finish_reason=stop on call #3 of 90; (2) DeepSeek pushed
through a PEP-668 wall, then returned fabricated listings instead of
admitting the blocker. Both behaviors are model-family-agnostic, so the
guidance lives outside the existing tool_use_enforcement gate (~192
tokens, paid once per session via prefix cache).
`tools/env_probe.py` — local Python toolchain probe. Detects
python3/pip/uv/PEP-668 state and emits ONE short line in the system
prompt when something is non-default. Emits NOTHING when the env is
clean (zero token cost for normal users). Skipped entirely for remote
terminal backends (docker/modal/ssh) — they have their own probe.
Example output on a broken environment (the actual case):
Python toolchain: python3=3.11.15 (no pip module),
python=missing (use python3), pip→python3.12 (mismatch),
PEP 668=yes (use venv or uv).
## Config
Both flags live under `agent.` in config.yaml, default True:
agent:
task_completion_guidance: true # universal "finish the job" block
environment_probe: true # local Python toolchain hints
Neither addition required a `_config_version` bump — deep-merge fills
defaults in for existing user configs.
## Validation
| Test surface | Result |
|---|---|
| tests/tools/test_env_probe.py | 10/10 pass (probe unit) |
| tests/run_agent/test_run_agent.py — new classes | 8/8 pass (integration) |
| TestToolUseEnforcementConfig | 17/17 pass (no regression) |
| TestBuildSystemPrompt | 9/9 pass (no regression) |
| TestInvalidateSystemPrompt | 2/2 pass (no regression) |
| tests/agent/test_prompt_builder.py | 124/124 pass (no regression) |
| tests/hermes_cli/ | 5662/5662 pass (config defaults) |
| E2E AIAgent build (broken env) | Both blocks present, 2,178 chars |
| E2E AIAgent build (clean env) | 771-char net overhead, env probe silent |
Salvages #24490 by @liuhao1024 against current main.
The Docker daemon will silently auto-create a directory at the host
path of any `-v <host>:<container>` bind mount when the host path
doesn't exist. In Docker-in-Docker setups (where the outer host's
real credential file isn't visible inside the agent's parent
container), this leaves a directory at the credential mount source —
and the inner `docker run` then refuses to mount a directory over a
file destination with exit 125.
Add defensive shape guards to all three mount loops in
DockerEnvironment.__init__:
* credentials (expected: file) — skip + warn on directory or missing
* skills (expected: dir) — skip + warn when not a directory
* cache (expected: dir) — skip + warn when not a directory
Failed mounts surface as WARN logs rather than crashing the container
start. Existing well-formed sources mount unchanged.
The original PR's branch was on a pre-container-reuse-rework base
(May 12) and conflicted with the post-May-28 driver work (label
tagging, container reuse, orphan reaper). Reconstructed the same
intent on current main; the three guard blocks slot cleanly into
`tools/environments/docker.py` around the existing mount loops.
Three new tests pinned in `tests/tools/test_docker_environment.py`:
directory-source skip, missing-source skip, valid-file mounts. Test-
first regression verification: reverted just the production code to
`origin/main` and confirmed the new tests fail with
`'deleted_token.json' is contained here: /root/.hermes/...` — the
fixed code makes them pass. Full file passes (54/54).
Closes#24490
Co-authored-by: liuhao1024 <11816344+liuhao1024@users.noreply.github.com>
When HOME=/root (Docker containers) and the process runs as unprivileged
user (hermes, uid 10000), Path.home() / '.modal.toml' raises PermissionError
because /root/ is inaccessible. This crashes the dashboard /api/skills endpoint.
Catch PermissionError/OSError and treat as 'no config file'. Env vars still
take priority (tested).
Fixes#33525
NVIDIA's verified skills catalog (https://github.com/NVIDIA/skills) ships
NVIDIA-signed skills for CUDA-X, AIQ, cuOpt, cuPyNumeric, DeepStream, NeMo,
NemoClaw and the Skill Card Generator — each bundle carrying a detached
`skill.oms.sig` signature, a governance `skill-card.md`, and `evals/`. The
sync pipeline drops any skill missing those artifacts before publishing.
Changes:
- tools/skills_hub.py: add NVIDIA/skills to GitHubSource.DEFAULT_TAPS so
it lights up in `hermes skills browse`, `hermes skills search <q>`, the
twice-daily skills-index build, and the docs-site Skills Hub page
(https://hermes-agent.nousresearch.com/docs/skills) automatically.
- tools/skills_guard.py: add NVIDIA/skills to TRUSTED_REPOS so installs
resolve to trust_level="trusted" (looser install policy than community).
- website/scripts/extract-skills.py: map the `github` source id to a
friendly "NVIDIA" pill label for the docs hub page.
- website/src/pages/skills/index.tsx: register the NVIDIA pill (green
#76b900) and slot it into SOURCE_ORDER after HuggingFace.
- website/docs/user-guide/features/skills.md (+ zh-Hans i18n): document
the new default tap and the expanded trusted-repos list.
- tests/tools/test_skills_guard.py: assert NVIDIA/skills resolves to
"trusted" (including the skills-sh-wrapped form).
- tests/tools/test_skills_hub.py: invariant — every TRUSTED_REPOS entry
must be reachable via GitHubSource.DEFAULT_TAPS (prevents future
trusted repos from being declared but never browseable).
Validation:
- Live GitHub fetch: `src.fetch('NVIDIA/skills/skills/aiq-deploy')` pulled
17 files including SKILL.md (13 KB), skill-card.md, skill.oms.sig, and
the full references/ + evals/ tree. trust_level="trusted".
- Live inspect resolved name, description, and trust correctly.
- All 193 existing skills_guard + skills_hub tests still pass.
When docker.sock is mounted (common Docker Compose pattern), the agent
can restart/stop/kill containers without user approval. hermes gateway
restart is already protected, but docker restart, docker stop,
docker kill, and their docker compose equivalents were not.
This caused repeated self-termination: the agent ran docker restart
hermes, killed its own container, Docker restarted it (restart policy),
and the agent resumed the same session — creating a restart loop.
Added patterns mirror the existing gateway lifecycle protection:
- docker compose restart/stop/kill/down
- docker restart/stop/kill
Co-authored-by: Sarbai <sarbai@users.noreply.github.com>
Commit 4 made cleanup_vm() default to force_remove=True, which was wrong:
cleanup_vm() is called from AIAgent.close() (TUI session close at
tui_gateway/server.py:2991, gateway session teardown at gateway/run.py:3569)
and from per-turn cleanup (agent/chat_completion_helpers.py:1517). All
three are session-lifecycle events that should honor persist mode, not
explicit user-initiated teardown.
Ben reported the symptom: container shared between multiple TUI sessions
(good) but killed as soon as any session closed (bad). With force_remove=True
as the default, every `session.close` JSON-RPC tore down the container.
The fix is to flip cleanup_vm()'s force_remove default back to False.
The kwarg still exists for future explicit-teardown paths (`/reset`-style
flows, "destroy my sandbox" commands) that haven't been wired up yet.
Two new unit tests pin the behavior:
* `test_cleanup_vm_default_honors_persist_mode` — asserts
`cleanup_vm(task_id)` does neither docker stop nor docker rm on a
persist-mode container (the regression Ben caught).
* `test_cleanup_vm_force_remove_tears_down_persist_container` —
asserts the kwarg still flows through the runtime-signature-inspection
plumbing to the backend's cleanup().
E2E verified against real Docker (in addition to all 17 existing checks):
✓ Default cleanup_vm() leaves persist-mode container running
✓ cleanup_vm(force_remove=True) removed the container
Refs #20561
The first iteration of this PR did docker stop on every cleanup in
persist mode (only skipping docker rm). Ben caught this as
contradicting the documented "ONE long-lived container shared across
sessions" semantics: stopping the container on every Hermes /quit kills
any background processes inside (npm watchers, pytest watchers,
long-running scripts) — exactly the case persist mode is supposed to
protect.
This commit splits the cleanup paths cleanly:
* **Persist mode (default)** — cleanup() is a NO-OP for the
container. Container stays running, processes survive, next Hermes
process attaches via the existing label probe in ~ms instead of
waiting for docker start. Resource reclamation happens via the
orphan reaper at next startup (2 × lifetime_seconds threshold), which
covers the SIGKILL / OOM / abandoned-laptop cases.
* **Opt-out mode (persist_across_processes=False)** — unchanged:
docker stop + docker rm -f on cleanup as before.
* **Explicit teardown** — new cleanup(force_remove=True) kwarg
overrides persist mode and tears the container down unconditionally.
cleanup_vm(task_id) now defaults to force_remove=True since
it's the user-driven reset path (called from AIAgent.close(),
/reset-style flows, and the idle reaper's per-turn cleanup).
The idle reaper in _cleanup_inactive_envs calls env.cleanup()
directly with no kwargs, so idle persist-mode envs are no-op'd — the
container survives the in-process pop and the next tool call re-probes
via labels. No state leak: _container_id is still cleared on the
in-process handle.
E2E verified against real Docker:
✓ Container is still running after cleanup()
✓ Background process (sleep loop) survived cleanup()
✓ Filesystem state preserved across cleanup()
✓ In-process container_id cleared (next __init__ will re-probe)
✓ Background process visible from reused env (no docker start happened)
✓ force_remove=True removed the container even in persist mode
✓ cleanup_vm() removed the container (defaults to force_remove=True)
Test changes:
* Replaces `test_cleanup_with_persist_only_stops_no_rm` with
`test_cleanup_with_persist_is_noop_for_container` — asserts neither
stop nor rm runs in persist mode, and the in-process handle is
cleared so re-probe works.
* Adds `test_cleanup_force_remove_stops_and_rms_even_in_persist_mode`
— covers the new kwarg.
* Updates `test_cleanup_uses_subprocess_run_not_detached_shell` and
`test_wait_for_cleanup_after_cleanup_returns_true` to pass
`force_remove=True` so they actually exercise the docker code path
(default no-op would trivially pass).
cleanup_vm() forwards `force_remove` only to backends whose cleanup()
accepts the kwarg (currently just DockerEnvironment) via runtime
signature inspection — Modal/Daytona/SSH `cleanup()` signatures are
unchanged.
Refs #20561
The cleanup-fix in the previous commit handles the graceful-exit leak: a
Hermes process that runs ``atexit`` will now actually wait on the docker
stop/rm worker thread, so containers either survive (persist mode) or are
fully removed (opt-out mode) by the time the interpreter exits.
But ``atexit`` doesn't fire on SIGKILL, OOM-kill, or terminal-window
close. Containers from those exits stay parked with no surviving Python
process to reuse or remove them, so they accumulate until the operator
intervenes with ``docker rm -f``. The cleanup-fix doesn't help this class
— there's no live cleanup() to fix.
This commit adds the safety net: a startup orphan reaper that runs once
per Hermes process and removes long-Exited hermes-labeled containers
that the prior commit couldn't reach.
Implementation:
* New ``reap_orphan_containers()`` in ``tools/environments/docker.py``.
Filters: ``label=hermes-agent=1`` + ``status=exited`` + (optional)
``label=hermes-profile=<current>``. Per-container ``docker inspect``
parses ``State.FinishedAt`` (with nanosecond-precision trimming for
Python's microsecond-bound ``fromisoformat``); containers older than
the threshold get ``docker rm -f``'d. The ``status=exited`` filter is
load-bearing — a running container may belong to a sibling Hermes
process whose reuse path will pick it up; killing it would crash the
sibling mid-command. Single-container failures are logged and the
sweep continues to the next candidate.
* New ``_maybe_reap_docker_orphans()`` helper in
``tools/terminal_tool.py``. Wired into ``_create_environment()`` for
``env_type == "docker"``. Gated by:
- ``terminal.docker_orphan_reaper: true`` (default; opt-out for
operators running multiple Hermes processes in the same profile
who don't trust the conservative defaults)
- ``_docker_orphan_reaper_ran`` module flag with double-checked
locking — parallel subagents and RL rollouts don't trigger N
concurrent docker ps storms
- Age threshold = ``2 × TERMINAL_LIFETIME_SECONDS`` with a 60s floor
(so ``TERMINAL_LIFETIME_SECONDS=0`` doesn't race the user's own
setup)
- Profile scoping — a research profile NEVER reaps the default
profile's stragglers
- Exception swallow — a janitor failure must never block container
creation
* New config ``terminal.docker_orphan_reaper`` wired through all four
config-bridge sites (cli.py, gateway/run.py, hermes_cli/config.py,
tests/conftest.py) and pinned by
``test_docker_orphan_reaper_is_bridged_everywhere``.
Coverage:
* 9 new unit tests in test_docker_environment.py — happy path, recent-
container sparing, profile scoping, unparseable-timestamp safety,
docker-ps-failure handling, partial-failure continuation, nanosecond
timestamp parsing, zero-value FinishedAt rejection.
* 6 new integration tests in test_docker_orphan_reaper_integration.py
— once-per-process gate, disable-flag respected, lifetime doubling
with 60s floor, current-profile filter wiring, exception swallow.
* 1 new bridge-invariant regression test.
Closes#20561 (combined with the two prior commits on this branch).
The Docker backend docs claim "Single persistent container — ONE long-
lived container shared across sessions, /new, /reset, and delegate_task
subagents. Stopped/removed on shutdown." In practice the code only
honored that contract within a single Python process via the in-memory
\`_active_environments[task_id]\` cache. Every \`hermes chat\` invocation
spawned a fresh \`hermes-<hex>\` container; older containers piled up in
\`Exited\` state and accumulated until manual \`docker rm\` (issue #20561).
Three root causes, all addressed by this commit:
1. No cross-process container discovery.
2. \`cleanup()\` used fire-and-forget \`subprocess.Popen("... &", shell=True)\`
which raced with parent-process exit — when Python exited promptly the
detached shell child got killed mid-\`docker stop\`, leaving stopped
containers behind.
3. The \`docker rm\` step in cleanup was gated on \`not self._persistent\`
(the bind-mount-persistence flag). Default config sets
\`container_persistent: true\`, so the default happy path skipped \`rm\`
entirely — even when the user explicitly didn't want cross-process
reuse, containers leaked.
Fix:
* Add \`DockerEnvironment.__init__(persist_across_processes=True)\`. When
true, init probes
\`docker ps -a --filter label=hermes-agent=1
--filter label=hermes-task-id=<task>
--filter label=hermes-profile=<profile>\`
and reuses a matching container (running → attach; stopped →
\`docker start\` → attach; \`docker start\` failure → fall through to a
fresh \`docker run\`). Multiple matches prefer the running one, with the
stragglers left for the orphan reaper (next commit) to clean up.
* Rewrite \`cleanup()\`. Uses \`subprocess.run(..., timeout=30)\` on a
daemon \`threading.Thread\`, not the racy \`Popen(... &)\`. The
\`_persistent\` guard is dropped on the \`rm\` step — \`rm\` now runs
whenever \`persist_across_processes\` is false, regardless of the
bind-mount-persistence setting. The leak class is gone in all
combinations.
* Add \`wait_for_cleanup(timeout)\`. \`tools/terminal_tool.py\`'s atexit
hook calls this on every active env, blocking up to 15s for the
cleanup thread before interpreter exit. Without this, \`hermes /quit\`
raced the daemon-thread teardown and dropped the stop/rm work.
* New config \`terminal.docker_persist_across_processes\` (default
\`true\` — restores the documented contract). Set \`false\` for hard
per-process isolation. Wired through all four config-bridge sites
(cli.py env_mappings, gateway/run.py _terminal_env_map,
hermes_cli/config.py _config_to_env_sync, tests/conftest.py env-strip
list); regression-pinned by
\`test_docker_persist_across_processes_is_bridged_everywhere\` matching
the existing pattern for docker_run_as_host_user / docker_env.
Reuse intentionally does NOT compare image / mounts / resources — only
the labels. Operators changing those settings should set
\`docker_persist_across_processes: false\` (or \`docker rm -f\` the
labeled container) to force a fresh start. This keeps the probe cheap
and the failure mode obvious.
Coverage: 12 new unit tests in tests/tools/test_docker_environment.py
covering reuse paths (running, stopped, fallback, opt-out, duplicate
preference) and cleanup behavior (persist-mode no-rm, opt-out always-rm,
no-Popen, wait_for_cleanup semantics, partial-init safety). Plus one
config-bridge regression pin.
Refs #20561
Issue #20561 (Docker containers accumulate) needs a way to identify
hermes-created containers from the outside — both for the orphan reaper
(a follow-up commit) and for operators triaging `docker ps -a | grep
hermes-` after a SIGKILL leaves stragglers. The previous `hermes-<hex>`
name prefix was the only signal, which broke down under cross-process
reuse (planned) and against any custom `--name` someone might pass via
`docker_extra_args`.
This commit adds three labels at `docker run` time:
--label hermes-agent=1 # global sweep target
--label hermes-task-id=<sanitized> # per-task reuse key
--label hermes-profile=<sanitized> # per-profile isolation key
Values are sanitized to `[A-Za-z0-9_.-]` and truncated to 63 chars so the
label round-trips cleanly through `docker ps --filter label=key=value`.
Empty or non-string inputs collapse to "unknown" rather than producing
an unqueryable empty value.
No behavior change: the labels are pure metadata. The follow-up commits
in this PR (cleanup-fix + orphan reaper) are what use them.
Refs #20561