hermes-agent

Author	SHA1	Message	Date
Ben	cae6b5486f	feat(dashboard): always enable embedded chat; remove dashboard --tui flag The dashboard's embedded Chat surface (/chat, /api/ws, /api/pty) was gated behind `hermes dashboard --tui` / HERMES_DASHBOARD_TUI=1. The desktop app and the dashboard's own Chat tab both drive the agent over the /api/ws + /api/pty WebSockets, so a dashboard started without the flag would pass the /api/status health check but slam the chat WebSocket shut with WS code 4403 — the app connects, reports "ready", and chat stays dead. This was the root cause behind multiple user reports of the desktop app failing to connect to a self-hosted gateway/dashboard, and it bit Docker and host installs alike. Make the embedded chat unconditional: - web_server.py: _DASHBOARD_EMBEDDED_CHAT_ENABLED defaults to True; drop the embedded_chat parameter and the runtime reassignment from start_server(). The WS gates still read the constant (now always true) so the seam — and its "rejects when disabled" contract test — stays meaningful. - main.py: remove the `--tui` argument from the dashboard subparser and the `embedded_chat = args.tui or HERMES_DASHBOARD_TUI==1` derivation. - web/: isDashboardEmbeddedChatEnabled() returns true unconditionally; drop the deprecated __HERMES_DASHBOARD_TUI__ alias and the dead LEGACY_TUI_RE scrape in the vite dev-token plugin. - apps/desktop/electron/main.cjs: drop `--tui` from the spawned dashboardArgs (it would now error with "unrecognized arguments: --tui") and the redundant HERMES_DASHBOARD_TUI env injection. - Docker: no s6 run-script change needed — the script never passed --tui; the HERMES_DASHBOARD_TUI env var is now simply a no-op, so the image works out of the box with no extra var. - Docs: remove every dashboard --tui / HERMES_DASHBOARD_TUI reference across the CLI reference, env-var reference, docker/desktop/web-dashboard guides, in-app tips, and the zh-Hans translations. The terminal `hermes --tui` / HERMES_TUI references are intentionally left untouched. Tests: 270 passing across web_server, dashboard lifecycle, host-header, auth-gate, and docker-override-scripts suites.	2026-06-04 03:03:35 -07:00
Ben Barclay	e2ea648a08	test(docker): make tty-passthrough probe robust to container boot-log noise (#38665 ) `test_tty_passthrough_to_container` asserted `int(numeric_lines[0]) > 0` where `numeric_lines` was every `.isdigit()` token in the FULL PTY stream — but the container's s6 boot output (cont-init diagnostics, the preinit `uid=0 ... egid=0` line, skills-sync summaries like `Done: 90 new, 0 updated, 0 unchanged. 90 total bundled.`) is written to the same PTY before the `tput cols` probe runs. So the test was really asserting on "the first number anywhere in the boot log", which passed only by luck on whatever that first digit happened to be. Any PR that shifts boot output flips the first digit to a stray `0` and breaks the test with `assert 0 > 0` — even when TTY passthrough is working perfectly (`tput cols` returns the right value). This is a latent landmine for every Docker PR that changes boot output (e.g. adding a bundled dependency changes the skills-sync counts). Fix: emit the probe result behind a unique marker (`HERMES_TTY_COLS=<cols>` / `HERMES_TTY_COLS=NO_TTY`) and parse only the marked value, ignoring all boot-log noise. The test's real intent — verify `docker run -t` delivers a real TTY with a positive column count — is preserved (NO_TTY and non-numeric values still fail). Verified against a real build, adversarially: - Built an image with extra boot output (the markdown core-dep change from #38649, which is what surfaced this) so the OLD logic grabs a stray `0` -> reproduced `assert 0 > 0` locally. - The hardened test PASSES against that same image, and against a clean image. `tput cols` correctly returns 123 in both.	2026-06-04 13:19:13 +10:00
Ben Barclay	9272e4019a	fix(docker): point TUI launcher at prebuilt bundle via HERMES_TUI_DIR (#37923 ) The embedded dashboard Chat tab dies on hosted images with a 502 / "[session ended]": the PTY child's `hermes --tui` spawn runs a runtime `npm install` that fails. Root cause: the root package-lock.json describes the WHOLE npm monorepo workspace set (root + web + ui-tui + apps/), but the image only installs root/web/ui-tui — apps/ (the desktop app) is never `npm install`ed here, and its deps hoist into the shared root node_modules. So the actualized node_modules permanently disagrees with the canonical lock, `_tui_need_npm_install()` returns True on every launch, and the runtime `npm install` it triggers (a) can never converge against the partial monorepo and (b) races itself across concurrent /api/pty connections -> ENOTEMPTY -> the launcher `sys.exit(1)`s, the slow install blows past Fly's WS-upgrade window -> 502 -> the browser shows "[session ended]". Fix: set `ENV HERMES_TUI_DIR=/opt/hermes/ui-tui` so `_make_tui_argv` takes the prebuilt-bundle fast path (`node --expose-gc /opt/hermes/ui-tui/dist/entry.js`) and never reaches the install check — exactly the nix/packaged-release path the launcher was designed for. The bundle is already built at Layer 8 (`ui-tui && npm run build`); this just tells the launcher to use it. Verified on a freshly-built image: HERMES_TUI_DIR is set, the prebuilt dist/entry.js is present, `_make_tui_argv` resolves to the prebuilt node invocation (no npm), and `docker run ... --tui` no longer prints "npm install failed". New regression guard: tests/docker/test_tui_prebuilt_bundle.py. A separate launcher hardening (make _tui_need_npm_install tolerant of partial-monorepo installs) is tracked independently; this Docker-side fix resolves the hosted-chat symptom on its own. Area: docker (Dockerfile + tests/docker).	2026-06-03 15:30:45 +10:00
Ben	a618789dba	fix(dashboard-auth): share /api/* public allowlist between legacy and OAuth gates Two parallel public-path allowlists drifted: _PUBLIC_API_PATHS in hermes_cli/web_server.py (legacy _SESSION_TOKEN middleware) and _GATE_PUBLIC_PREFIXES in hermes_cli/dashboard_auth/middleware.py (OAuth gate). The legacy list included /api/status (documented as a non-sensitive read-only liveness target); the OAuth gate's list did not. Effect: every wildcard-subdomain agent surfaced as STARTING/down to the portal even though the dashboard was serving correctly. Nous account service (src/server/agents/fly-provider.ts getInstanceRuntimeStatus) fetches ``/api/status`` without a cookie as its sole liveness probe; the OAuth gate's 401 looked identical to 'agent dead' on the portal side. Fix: lift the allowlist into hermes_cli/dashboard_auth/public_paths.py and have both middlewares import it. _path_is_public now consults the shared frozenset first, then falls back to the gate's auth-bootstrap/static prefix list. Future additions to the public list hit both gates automatically. Endpoint inventory (verified safe to remain public): * /api/status — version, gateway state, active session count, auth-gate shape. Portal liveness probe target. * /api/config/defaults — config-defaults feed for the SPA's Config page * /api/config/schema — config schema for the SPA's Config page * /api/model/info — model catalogue metadata (context windows) * /api/dashboard/themes — theme manifests for the skin engine * /api/dashboard/plugins — plugin manifests for the dashboard No user data, no session content, no secrets. Same shape an external monitoring agent would hit on /healthz. Tests: * New: test_gated_status_is_public (regression guard with the NAS fly-provider.ts liveness-probe rationale spelled out in the docstring) * New: test_other_public_api_paths_are_public_under_gate (parametrised over the rest of PUBLIC_API_PATHS — proves 401 / 302-to-login is never the response) * New: docker integration check #3 in test_dashboard_oauth_gate_engaged_by_default — /api/status remains 200 under the gate AND reports auth_required=True so the portal can distinguish modes * Updated: test_full_login_round_trip_unlocks_gated_api now probes /api/sessions instead of /api/status (status is public, so it can no longer distinguish 'logged in' from 'gate accidentally disabled') * Updated: TestApi401Envelope (the no-cookie / invalid-cookie / dead-cookie tests) probes /api/sessions for the same reason * Updated: docker integration check #2 in test_dashboard_oauth_gate_engaged_by_default probes /api/sessions to prove the gate is intercepting * Removed: dead _login() helper in test_dashboard_auth_status_endpoint.py (no longer needed since /api/status is reachable cold) Companion to docs/handover/hermes-agent-dashboard-s6-insecure-fix.md (the --insecure flag fix that shipped earlier).	2026-05-29 12:17:12 +10:00
Ben	1b1e30510a	test(docker): repair dashboard tests broken by the insecure-opt-in fix The Docker integration test job started failing on main after `fb5125362` ("docker: opt in to dashboard --insecure via env var"). Two distinct failures, both fallout from that change being more behaviour-changing than the existing test harness anticipated. Failure 1 — test_dashboard_port_override (silent regression in an already-existing test) The test starts the container with just HERMES_DASHBOARD=1, defaults to host=0.0.0.0, no HERMES_DASHBOARD_OAUTH_CLIENT_ID, no HERMES_DASHBOARD_INSECURE. Pre-fix that combination got --insecure auto-injected by the s6 run script (anything non-loopback was implicitly insecure), so the OAuth gate stayed off and start_server bound the port. Post-fix the gate engages, no provider is registered, and start_server raises SystemExit before binding — under s6 the dashboard goes into a restart loop and the test's /proc/net/tcp poll finds nothing. Same silent regression was masking three sibling tests (test_dashboard_slot_reports_up_when_enabled, test_dashboard_opt_in_starts, test_dashboard_restarts_after_crash) — they all only sample pgrep or s6-svstat and so caught the supervised process mid-restart loop, appearing to pass while the dashboard was actually never reaching a healthy state. Fix: pin HERMES_DASHBOARD_INSECURE=1 on every test that enables the dashboard but doesn't itself exercise the auth gate. Each pinned site carries an inline comment pointing back to test_dashboard_slot_reports_up_when_enabled for the full rationale. Failure 2 — test_dashboard_oauth_gate_engages_on_non_loopback_bind (bug in the test I added in `fb5125362`) The probe used urllib.request.urlopen() against /api/status. Under the now-engaged OAuth gate /api/status no longer answers unauthenticated callers (the gate middleware runs upstream of the legacy _SESSION_TOKEN allowlist and 401s anything without a valid session cookie). urlopen() raises HTTPError on the 401, the wrapper treated that as "not ready yet", and the poll loop hit timeout. Fix: split the probe into a generic _http_probe() helper that returns (status_code, body) for any HTTP response — including 401, which IS the gate-engaged success signal. The helper feeds a multi-line Python program over stdin via a POSIX heredoc so the try/except branch reads naturally; far less fragile than the earlier semicolon-laden -c one-liner. The OAuth-gate test now verifies two independent observable consequences of the gate being on: 1. GET /api/auth/providers (publicly reachable through the gate so the login page can bootstrap) returns 200 with `nous` in the provider list — proves the bundled provider registered. 2. GET /api/status returns 401 — proves the OAuth gate runs upstream of the legacy public-paths allowlist and is actively intercepting unauthenticated callers. The insecure-opt-out test still hits /api/status, but now asserts status_code == 200 first (proves the gate is bypassed) before parsing the JSON for auth_required: false (proves the gate-state flag is also correctly off). Verified locally end-to-end against a fresh image build on a real Docker daemon: all 41 tests under tests/docker/ pass in 2m38s, including the two formerly-failing dashboard tests and the three sibling tests that were passing by accident.	2026-05-29 10:30:52 +10:00
Ben	fb51253620	docker: opt in to dashboard --insecure via env var, never derive from bind host The s6 dashboard run script flipped `--insecure` on whenever `HERMES_DASHBOARD_HOST` was anything other than 127.0.0.1 / localhost. That comment ("the dashboard refuses otherwise") predates the OAuth auth gate: back when it was written, `start_server` would SystemExit on any non-loopback bind, so the run script's `--insecure` was the only way to make in-container deployments work at all. The gate has since been replaced by `should_require_auth(host, allow_public)`, which engages the OAuth flow when a `DashboardAuthProvider` is registered (the bundled `dashboard_auth/nous` provider auto-registers on `HERMES_DASHBOARD_OAUTH_CLIENT_ID`) and fails closed with a specific operator-facing error when none is. The host-derived `--insecure` ran upstream of all that and silently disabled the gate on every container-deployed dashboard. Most visible under the portal's wildcard-subdomain rollout: every Fly machine binds 0.0.0.0 so the edge can reach Flycast, every machine boots with the correct `HERMES_DASHBOARD_OAUTH_CLIENT_ID`, the nous provider registers — and `/api/status` still returns `{"auth_required": false, "auth_providers": ["nous"]}` because the run script disabled the gate before `start_server` ever saw the request. The dashboard SPA was served to anyone, no `/login` redirect, no OAuth challenge. Fix: derive `--insecure` from an explicit opt-in env var, `HERMES_DASHBOARD_INSECURE` (truthy values matching the rest of the s6 boolean envs: 1, true, TRUE, True, yes, YES, Yes). Operators on trusted LANs behind a reverse proxy without the OAuth contract (the existing `docker-compose.windows.yml` use case) opt in explicitly; portal-managed agent deployments leave it unset and let the gate engage. `docker-compose.windows.yml` already passes `--insecure` on the `command:` array directly (line 38), so it doesn't depend on the s6 auto-injection. No compose-file change required. Tests: * `tests/test_docker_home_override_scripts.py` — extends the existing static-text guard with a regression assertion that the legacy host-derived case-statement is gone and the new env-var opt-in is present (locks against accidental revert). * `tests/docker/test_dashboard.py` — adds two Docker-in-Docker tests exercising the actual `/api/status` round-trip: - 0.0.0.0 bind + `HERMES_DASHBOARD_OAUTH_CLIENT_ID` → gate engaged - 0.0.0.0 bind + `HERMES_DASHBOARD_INSECURE=1` → gate disabled Docs: * `website/docs/user-guide/docker.md` + zh-Hans i18n — adds the new env var to the table, replaces the stale prose ("the entrypoint no longer auto-enables insecure mode" — which until this PR was flat-out wrong) with an accurate description of the gate's trigger conditions and the explicit opt-out. shellcheck clean. Python static-text test passes locally. Behavioural test will run against any future image build (CI's Docker harness).	2026-05-29 09:56:40 +10:00
Ben	66489f38c7	fix(docker): bake build-time git SHA into the image `hermes dump` and the startup banner both call `git rev-parse HEAD` to report the running commit, but `.dockerignore` line 2 excludes `.git` — so inside the published image `hermes dump` shows `version: ... [(unknown)]` and the banner drops its `· upstream <sha>` suffix entirely. That makes support triage from container bug reports impossible: we can't tell which commit the user is actually running. Fix: thread the build-time SHA through as a Docker build-arg, write it to `/opt/hermes/.hermes_build_sha` in the image, and have a new `hermes_cli/build_info.get_build_sha()` read it as a fallback after the existing live-git lookup fails. Output format is unchanged in both callsites — same 8-char short SHA whether resolved live or baked. Wiring: - Dockerfile: `ARG HERMES_GIT_SHA=` + write-file step after the source copy. Empty/missing arg → no file written → callers fall through to live git (so local `docker build` without --build-arg is unchanged). - docker-publish.yml: passes `HERMES_GIT_SHA=${{ github.sha }}` on all four build-push-action steps (amd64/arm64, smoke-test + final push). - dump.py:_get_git_commit() / banner.py:get_git_banner_state(): try live git first, fall back to baked SHA, then to legacy `(unknown)` / None. Banner returns `upstream == local, ahead=0` because a built image is by definition pinned to one commit. Coverage: - Unit tests cover build_info (file present/absent/empty/error, truncation, whitespace), dump (live-git wins, both fallbacks, identical output-format regression guard), and banner (no-repo + baked, no-repo + no-sha, shallow-clone fallback). - tests/docker/test_dump_build_sha.py is an integration regression guard that runs against the real image, reads `/opt/hermes/.hermes_build_sha`, and asserts `hermes dump` surfaces its content (or stays at `(unknown)` if no file). - Verified end-to-end: `docker build --build-arg HERMES_GIT_SHA=abc...` → `docker run ... dump` reports `[abc12345]`; without the build-arg it reports `[(unknown)]` as before.	2026-05-28 15:14:05 +10:00
Ben Barclay	aeb992d343	fix(docker): drop `docker exec` to hermes uid before invoking the CLI When operators ran `docker exec <c> hermes login` (or anything else that wrote under $HERMES_HOME) they defaulted to root, leaving /opt/data/auth.json root:root mode 0600. The supervised gateway (UID 10000) then couldn't read its own credentials and returned "Provider authentication failed: Hermes is not logged into Nous Portal" on every Telegram/Discord/etc. message — even though `docker exec <c> hermes chat -q ping` (also root) succeeded because root could read its own root-owned file. _load_auth_store swallowed PermissionError as a parse failure and copied the file aside as auth.json.corrupt, making the diagnostic more misleading. Fix: install a privilege-drop shim at /opt/hermes/bin/hermes, prepended ahead of the venv on PATH. When invoked as root the shim exec's the real venv binary via `s6-setuidgid hermes` — so any file the docker-exec session writes is uid-aligned with the supervised processes. Non-root callers (the supervised processes themselves, `docker exec --user hermes`, kanban subagents, anything inside the container that's not coming through docker-exec) hit a single exec to the absolute venv path with no privilege change. Recursion is impossible: the shim exec's the venv binary by absolute path (/opt/hermes/.venv/bin/hermes), so the second hop cannot re-enter the shim regardless of PATH state. No sentinel env var needed (unlike #33583's gateway-run redirect which DOES need HERMES_S6_SUPERVISED_CHILD because there's no absolute-path equivalent for the s6 dispatch). Opt-out: `docker exec -e HERMES_DOCKER_EXEC_AS_ROOT=1 …` for diagnostic sessions where the operator deliberately wants root. Strict truthiness (1/true/yes case-insensitive); typos like `=0` do not silently opt out, mirroring HERMES_GATEWAY_NO_SUPERVISE in #33583. If `s6-setuidgid` is missing (someone stripped s6-overlay in a downstream fork), the shim exits 126 with a remediation message pointing at `--user hermes` and the opt-out — never silently runs as root. Test plan: - tests/docker/test_docker_exec_privilege_drop.py — 11 tests - shim drops root to hermes uid (file ownership check) - shim short-circuits for non-root docker exec - HERMES_DOCKER_EXEC_AS_ROOT=1 keeps root - strict-truthiness parametrization (5 falsy values reject) - main CMD path unaffected (recursion guard) - E2E: every file written by docker-exec is readable by uid 10000 - Full tests/docker/ harness: 32/32 pass against fresh image build - shellcheck --severity=error: clean - hadolint: clean - Manual: reproduced the original symptom (root-owned auth.json) by bypassing the shim; confirmed default docker-exec produces hermes-owned files; confirmed opt-out env keeps root semantics. Known follow-up: this prevents NEW instances of the bug. Volumes that already have root:root /opt/data/auth.json from a pre-shim image need a one-time `chown hermes:hermes` before rebooting onto the new image. A stage2-hook chown sweep can self-heal that, but is deferred per scope decision.	2026-05-28 13:30:36 +10:00
Ben Barclay	b345323195	fix(docker): tee supervised gateway stdout to docker logs Follow-up to #33583 (the gateway-run-supervised redirect). Before this fix, the supervised gateway's stdout (most visibly the "Hermes Gateway Starting…" rich-console banner) was swallowed by `s6-log` into the rotated file at `${HERMES_HOME}/logs/gateways/<profile>/current` and never reached `docker logs`. Operational signal lived in two places: * docker logs — saw stderr (Python `logging` defaults to stderr), so warnings/errors were visible. * the rotated file — saw stdout (rich banners, `print()` output, third-party libs that wrote to fd 1). This was surprising for users coming from the pre-s6 image, where `docker run … gateway run` produced a single unified stream in `docker logs`. They'd see partial output, conclude something was broken, and dig around for the missing pieces. Fix: add the `1` s6-log action directive before the file destination so each line is forwarded to s6-log's stdout — which propagates up the s6-supervise pipeline to /init's stdout = container stdout = `docker logs`. The file destination is preserved as a second destination, so the rotated log (with ISO 8601 timestamps) still exists for `hermes logs` and for survival across container restarts. Trade-off considered: timestamps. Putting `T` between `1` and the file destination (not before `1`) means: * docker logs sees raw lines — Python's logging formatter has its own timestamps, and `docker logs --timestamps` adds another layer when desired. No double-stamping in the common reading path. * The persisted file gets s6-log's ISO 8601 timestamp so even output that lacked a Python-logger timestamp (rich banners, third-party raw prints) is correlatable in `current`. Verification: * New unit-test assertion in `test_service_manager.py` locks the `s6-log 1` directive into the rendered run-script. Mutation- tested by reverting to the pre-fix script (no `1`); the assert catches it cleanly. * New docker-harness test `test_supervised_gateway_stdout_reaches_docker_logs` builds the image, runs `docker run … gateway run`, and asserts the unique `⚕` banner glyph reaches `docker logs`. Also verifies the rotated file still contains the banner (no regression on the existing file destination). Mutation-tested end-to-end: built a deliberately-broken image without the `1` directive and the test failed exactly as designed, citing the banner present in `current` but absent from `docker logs`. * `website/docs/user-guide/docker.md` gains a new `:::note Where gateway logs go` admonition documenting both destinations and the audit-log file at `${HERMES_HOME}/logs/container-boot.log`. Existing functionality preserved: every other docker-harness test still passes against the new image. Unit-test sweep across `tests/hermes_cli/` (5561 tests) is green.	2026-05-28 13:18:41 +10:00
Ben Barclay	0927fb5584	feat(docker): auto-redirect `gateway run` to supervised mode inside s6 image Pre-s6, `docker run nousresearch/hermes-agent gateway run` was the standard invocation: gateway ran as the container's main process, tini reaped zombies, container exit code matched gateway exit code, no supervision. With s6-overlay as PID 1, the same invocation now auto-upgrades to supervised semantics — auto-restart on crash, dashboard supervised alongside (when HERMES_DASHBOARD=1 is set), multiple profile gateways under the same /init. Users get the new behavior with zero changes to their docker run command. A loud one-line breadcrumb on stderr explains the upgrade and points at the opt-out for users who genuinely want pre-s6 foreground semantics. How it works: 1. `_gateway_command_inner` (the `gateway run` handler) checks if we're inside a container with s6 as PID 1. 2. If yes, dispatches `start` to the s6 service manager (registers and starts gateway-default), then `exec sleep infinity` to keep the CMD process alive without binding container lifetime to gateway PID lifetime. The supervised gateway can flap freely; `docker stop` still tears everything down via /init stage 3. 3. If no, falls through to the existing foreground code path unchanged. Host runs of `hermes gateway run` are unaffected. Three gates make the redirect inert outside the intended scope: * `detect_service_manager() != "s6"` — host/non-s6-container runs. * `HERMES_S6_SUPERVISED_CHILD=1` env var (recursion guard) — exported by `S6ServiceManager._render_run_script` for the s6-supervised invocation itself. Without this guard, the supervised `gateway run --replace` would re-enter the redirect and recurse (run → start → run → start → ...) infinitely. * `--no-supervise` CLI flag OR `HERMES_GATEWAY_NO_SUPERVISE=1` env var — explicit user opt-out for CI smoke tests, debugging the foreground startup path, or any case wanting "CMD exit = container exit" semantics. Strict truthiness (1/true/yes, case-insensitive); typos like `=0` do NOT silently opt out. Tests: * Unit tests in tests/hermes_cli/test_gateway_s6_dispatch.py cover all five paths (host no-op, supervised fire, sentinel recursion guard, CLI flag, env var truthy + falsy). The two load-bearing gates (sentinel + opt-out) were mutation-tested by removing each gate in isolation and confirming the dedicated test fails with the expected error. * Docker harness tests in tests/docker/test_gateway_run_supervised.py cover the round trips end-to-end against a built image: redirect fires (sleep-infinity heartbeat + supervised gateway-default slot + breadcrumb), --no-supervise opt-out (foreground gateway, no want-up on the slot), HERMES_GATEWAY_NO_SUPERVISE env var works identically, recursion is impossible (≤1 supervised python gateway-run + exactly 1 sleep-infinity parented to the CMD wrapper), and HERMES_DASHBOARD=1 produces both supervised gateway and supervised dashboard. Docs: * Added a `:::tip Gateway runs supervised` admonition near the main docker.md example explaining the upgrade and pointing at the opt-out. Pre-s6 (tini-based) images still run gateway run as the foreground main process, so the note is scoped to the s6 image only. Trade-off documented in the helper docstring: container exit code under the redirect is sleep's exit code (always 0 on SIGTERM), not the gateway's. That was an explicit design call — the supervised gateway is allowed to flap without taking the container with it, which is what "supervision" means. CI users who want exit-code forwarding can pass --no-supervise.	2026-05-28 12:42:13 +10:00
Ben	c524b8a4dc	test(docker): fix svstat 'want up' assertion in profile-gateway lifecycle test After the supervise-perms fix lands, the s6 lifecycle actually works for the hermes user — hermes -p <profile> gateway start now genuinely brings the supervised gateway up rather than silently no-op'ing on EACCES. That exposes a latent bug in this test's assertion: it expected 'want up' to appear literally in s6-svstat output, but s6-svstat elides redundancies — when the slot is currently up AND s6 wants it up, the output is just 'up (pid N pgid N) X seconds'; the explicit 'want up' token only appears when current ≠ wanted (e.g. 'down (exitcode 1) … , want up' on a crash-loop). Add a small helper _svstat_wants_up() that reads the want-state correctly across both spellings: * 'up …' → wanted up (unless explicit 'want down') * 'down …, want up' → wanted up explicitly * 'down …' → wanted down Both stop and start assertions now use the helper. Also rewords the module docstring to acknowledge that the supervised process may succeed OR crash-loop depending on environment, but the want- state contract holds either way. (cherry picked from commit 02c933aedc8500e5672aed12475a9ba0534bd77a)	2026-05-25 12:25:06 +10:00
Ben	cd5b2c4123	test(docker): poll for boot-log signal instead of fixed sleeps PR #30136 review item O6: test_container_restart.py used fixed `time.sleep(8)` calls after `docker restart` to wait for the cont-init reconciler to finish. Fixed sleeps are slow when the event happens fast and false-fail when the event happens slow. Replace with two polling helpers: * `_wait_for_path(container, path, kind='f' \| 'd', deadline_s=...)` — generic `test -f/-d` poller. Returns True on success, False on timeout; callers assert with a clear message. * `_wait_for_reconcile_log_mention(container, profile, ...)` — the reconciler's per-profile log line is the canonical signal that the cont-init reconcile has finished for that profile. Poll on it instead of a sleep that hopes 8 seconds is enough. The fixture-level setup wait is similarly migrated: it now polls for `profile=default` in the boot log (every container always gets a default-slot entry per item I1) and raises a clear timeout error from the fixture if the container never finishes cont-init — much better diagnostics than a mid-test KeyError. The remaining `time.sleep()` calls are all internal interval_s between probe attempts; no fixed wait points left.	2026-05-24 18:05:33 -07:00
Ben	d735b083e8	fix(service_manager): rip out dead port parameter PR #30136 review caught: `_allocate_gateway_port()` in profiles.py computed a SHA-256-derived port that was threaded through `register_profile_gateway(profile, port=N)` → `_render_run_script(profile, port, extra_env)` → and then ignored. The rendered run script picked the bind port from the profile's config.yaml (`[gateway] port = …`), never from the allocator. So the entire allocator + parameter chain was dead code. Remove: * `hermes_cli.profiles._allocate_gateway_port` (deterministic SHA-256 → [9200, 9800) — never used). * `port` kwarg from `ServiceManager.register_profile_gateway` (Protocol + Mixin + S6 implementation). * `port` positional arg from `_render_run_script(profile, port, extra_env)` — now `_render_run_script(profile, extra_env)`. * The pass-through call in `profiles._maybe_register_gateway_service`. config.yaml is now the single source of truth for gateway port selection — matches reality and reduces the API surface. Three explanatory comments in service_manager.py / profiles.py document the retirement so future readers don't reach for the allocator and find a ghost. Tests: drop the three `_allocate_gateway_port` tests; update fakes' signatures throughout test_service_manager.py and test_profiles_s6_hooks.py to match the new no-port API.	2026-05-24 18:05:33 -07:00
Ben	1dfabe47b3	fix(docker): dashboard slot stays 'down' when HERMES_DASHBOARD unset PR #30136 review caught a false positive: when HERMES_DASHBOARD was unset, the dashboard run script did `exec sleep infinity`, so `s6-svstat /run/service/dashboard` reported the slot as 'up'. `hermes doctor` and any other s6-svstat-based health check saw the dashboard as supervised-running even though no dashboard process existed. Add cont-init.d/03-dashboard-toggle: writes a `down` marker file into `/run/service/dashboard/` when HERMES_DASHBOARD is falsy, removes any leftover marker when it's truthy. s6-supervise honors `down` by not starting the service, so s6-svstat reports 'down' — matching reality. The run script's HERMES_DASHBOARD case-statement stays in place as a belt-and-suspenders guard, so the two layers can never disagree. Two new integration tests lock the behavior: slot reports down when unset; slot reports up when set to 1.	2026-05-24 18:05:33 -07:00
Ben	fc39296e1f	fix(service_manager): s6 detection works for unprivileged hermes user PR #30136 review surfaced two issues, both rooted in the same audit gap: docker integration tests were running as root, not the unprivileged `hermes` user (UID 10000) that the runtime actually uses via `s6-setuidgid hermes`. Anything that probed PID-1 state or wrote to the s6 control surface worked as root in the tests but was inert in production. Fixes: 1. `_s6_running()` previously called `Path("/proc/1/exe").resolve()`, which is root-only readable. For UID 10000 the symlink yields PermissionError, `resolve()` silently returns the unresolved path, and `exe.name == "exe"` — so detection always returned False, the service-manager runtime-registration path was inert, and every `hermes profile create` / `hermes -p X gateway start` silently skipped the s6 hook. Replace with `/proc/1/comm` (world-readable) + `/run/s6/basedir` (s6-overlay-specific) — both required, fail closed. 2. `02-reconcile-profiles` now also chowns `/run/service/.s6-svscan/` {control,lock} to hermes so `s6-svscanctl -a/-an` works without root. Previously the directory chown stopped at `/run/service` and the FIFO inside stayed root-owned, so `register_profile_gateway` from hermes failed at the rescan-trigger step with EACCES — the wrapper in profiles.py caught the exception and printed a swallowed warning, so profile creation appeared to succeed while the slot was rolled back. Audit changes to flush this class of bug next time: - Add `docker_exec` / `docker_exec_sh` helpers to `tests/docker/conftest.py` that default to `-u hermes`. The module docstring explains why and flags `user="root"` as opt-in only for tests that explicitly need root (none currently do). - Refactor every `docker exec` call in tests/docker/ through the new helpers (test_dashboard.py, test_zombie_reaping.py, test_profile_gateway.py, test_container_restart.py, test_s6_profile_gateway_integration.py). - Add 5 unit tests covering `_s6_running` under various probe states (both signals present; comm wrong; basedir missing; PermissionError on /proc/1/comm; missing /proc — non-Linux). The PermissionError test is the explicit regression guard for the original bug. Known follow-up: the per-service `supervise/control` FIFO inside each `/run/service/gateway-<profile>/supervise/` is created root-owned by s6-supervise (which runs as root because s6-svscan is PID 1). `s6-svc -u/-d/-t` from the hermes user will get EACCES on those. The audit under `-u hermes` will reveal this in lifecycle tests — surfacing the issue cleanly so it can be fixed in a focused follow-up (likely via a small SUID helper or a polling chown loop in cont-init.d). The detection + svscanctl fixes here are independent and complete on their own.	2026-05-24 18:05:33 -07:00
Ben	2afefc501c	feat(docker): per-profile s6 supervision + container-restart reconciliation Phase 4 of the s6-overlay supervision plan. Activates the Phase 3 S6ServiceManager by hooking it into the profile lifecycle and the `hermes gateway start/stop/restart` dispatcher, and adds a cont- init.d-time reconciliation pass that survives `docker restart`. Task 4.0 — container-boot reconciliation: /run/service/ is tmpfs, so every `docker restart` wipes every per-profile gateway slot. /etc/cont-init.d/02-reconcile-profiles invokes hermes_cli.container_boot.reconcile_profile_gateways() on every boot, which walks $HERMES_HOME/profiles/<name>/, reads each gateway_state.json, recreates the s6 service slot, and auto-starts only those whose last state was 'running'. Other states (stopped, starting, startup_failed, missing) register the slot in the down state — avoiding crash-loops across restarts for a gateway that was broken last boot. Per-profile outcome is recorded to $HERMES_HOME/logs/container-boot.log. Implementation: hermes_cli/container_boot.py + 12 unit tests. Profile-marker is SOUL.md, not config.yaml, because `hermes profile create` only seeds SOUL.md by default (config.yaml comes from `hermes setup`). Task 4.1 / 4.2 — profile create/delete hooks: hermes_cli/profiles.py::create_profile now calls _maybe_register_gateway_service(<canon>) at the end, which routes through ServiceManager.register_profile_gateway when running on s6 and no-ops on host backends. delete_profile mirrors with _maybe_unregister_gateway_service. _allocate_gateway_port produces a deterministic SHA-256-derived port in [9200, 9800). Task 4.3 — gateway dispatch + remove rejection arms: _dispatch_via_service_manager_if_s6(action) intercepts start/stop/restart at the top of each subcommand and routes them through S6ServiceManager.{start,stop,restart}. The pre-Phase-4 `elif is_container():` rejection arms are kept as fallback for pre-s6 containers / unsupported runtimes, but only ever fire when detect_service_manager() != 's6'. install/uninstall under s6 print informational guidance pointing users at profile create/delete. Removed the two xfail(strict=True) markers from tests/docker/test_profile_gateway.py — both tests now pass strictly. Task 4.4 — status reporting: get_gateway_runtime_snapshot() reports Manager: 's6 (container supervisor)' inside an s6 container instead of 'docker (foreground)'. Plan-vs-reality drift fixed in this commit: - Plan's S6ServiceManager._render_run_script used `gateway start --foreground --port {port}` — invented args; the real CLI is `gateway run`. Switched accordingly. port arg retained for API parity but now documented as 'currently ignored'. - Plan's reconciler keyed on config.yaml; switched to SOUL.md (config.yaml is created by hermes setup, not by hermes profile create, so the original gate caught nothing). - The plan's _dispatch helper used _profile_arg() which returns '--profile <name>' (i.e. with the flag prefix). Switched to _profile_suffix() which returns the bare name. - Architecture B's docker exec doesn't get /command on PATH or the venv on PATH; Dockerfile's runtime PATH now includes /opt/hermes/.venv/bin so 'docker exec <c> hermes ...' works without sourcing the venv. - stage2-hook now chowns $HERMES_HOME/profiles to hermes on every boot, not just on the UID-remap path. Without this, files created by docker-exec-as-root accumulate and the next reconciler run fails with PermissionError reading SOUL.md. Test harness: 19 passed, 0 xfailed (the two pre-Phase-4 xfail targets flip to passing). 78 unit tests across service_manager + container_boot + profiles_s6_hooks + gateway_s6_dispatch. Hadolint + shellcheck pass cleanly. Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md	2026-05-24 18:05:33 -07:00
Ben	0abf661f71	feat(service_manager): add S6ServiceManager for runtime gateway supervision Phase 3 of the s6-overlay supervision plan. Implements the runtime- registration surface from D4 — only the s6 backend supports register_profile_gateway / unregister_profile_gateway / list_profile_gateways; host backends continue to raise NotImplementedError. No caller yet (Phase 4 wires in the profile create/delete hooks). Key implementation notes: - Service directory shape: /run/service/gateway-<profile>/{type,run,log/run}. Atomic register: write to gateway-<profile>.tmp, fsync via os.rename. Cleanup on rescan failure. - Run script uses #!/command/with-contenv sh so HERMES_HOME and any extra_env arrive at exec time. The hermes -p <profile> gateway start --foreground --port <port> command is wrapped in s6-setuidgid hermes for the per-service privilege drop (OQ2-A). - Log script (OQ8-C): persists via s6-log to ${HERMES_HOME}/logs/gateways/<profile>/. CRITICAL — HERMES_HOME is a runtime env-var expansion in the rendered script, NOT a Python f-string substitution. Negative-asserted in test_s6_register_creates_service_dir_and_triggers_scan so regressions are caught. - PATH gotcha: /command/ is only on PATH for processes spawned by the supervision tree (services, cont-init.d). `docker exec` and profile-create hooks don't get it. S6ServiceManager calls all s6-* binaries via absolute path through the new _S6_BIN_DIR constant so callers don't have to fix up env vars. - validate_profile_name rejects path-traversal, leading-dash (s6 would parse as a flag), uppercase, whitespace, and names >251 chars (s6-svscan default name_max). Test coverage: - 13 new unit tests in tests/hermes_cli/test_service_manager.py (kind detection, run-script content, env quoting, register rollback on rescan failure, unregister idempotence, list filter, lifecycle dispatch, svstat parsing). Total: 36 passing. - 2 new in-container integration tests in tests/docker/test_s6_profile_gateway_integration.py validating end-to-end registration against a real s6 supervision tree. Docker harness: 14 passed, 2 xfailed (Phase 4 target unchanged). Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md	2026-05-24 18:05:33 -07:00
Ben	e0e9c895d3	feat(docker)!: replace tini with s6-overlay as PID 1 BREAKING CHANGE: the container ENTRYPOINT is now /init (s6-overlay) instead of /usr/bin/tini. Main hermes runs as the container CMD with TTY inherited (preserving --tui), dashboard runs as a supervised s6-rc service (HERMES_DASHBOARD=1 starts it; crashes auto-restart), and the ground is laid for per-profile gateway supervision (Phase 3+4). All five pre-s6 docker run invocation patterns continue to work identically — verified by the Phase 0 docker harness: docker run <image> → `hermes` with no args docker run <image> chat -q "..." → `hermes chat -q ...` passthrough docker run <image> sleep infinity → `sleep infinity` direct docker run <image> bash → interactive bash docker run -it <image> --tui → interactive Ink TUI Phase 2 harness result: 12 passed, 2 xfailed (Phase 4 target). Hadolint + shellcheck pass cleanly. Architecture pivot from plan v3 (documented in main-hermes/run header): the plan called for main hermes to be an s6-supervised service, but two real s6-overlay v3 mechanics blocked that — cont-init.d scripts receive no arguments (CMD args are not visible to stage2-hook), and `/run/s6/basedir/bin/halt` after writing the exit code did not propagate the desired exit code (container exits 143). We use the s6-overlay-native CMD pattern instead: main-wrapper.sh is the container's main program (ENTRYPOINT prepends it so leading-dash args like --version aren't intercepted by /init), exec's the final program with stdin/stdout/stderr inherited, and the program's exit code becomes the container exit code. main-hermes is now a no-op `sleep infinity` slot kept for future supervised-gateway-container modes. This trades "supervised restart of main hermes" for arg- parity with the pre-s6 contract — main hermes was already unsupervised under tini, so we lose nothing functional. Dashboard supervision is the only new guarantee added by this phase. Files added: docker/main-wrapper.sh # arg routing + s6-setuidgid drop docker/stage2-hook.sh # gosu-equivalent + chown + seed docker/s6-rc.d/main-hermes/{type,run,dependencies.d/base} docker/s6-rc.d/dashboard/{type,run,dependencies.d/base} docker/s6-rc.d/user/contents.d/{main-hermes,dashboard} Files changed: Dockerfile: tini → s6-overlay install + ENTRYPOINT flip + service wiring docker/entrypoint.sh: thin shim to stage2-hook.sh for back-compat tests/docker/test_dashboard.py: add test_dashboard_restarts_after_crash Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md	2026-05-24 18:05:33 -07:00
Ben	440147ebea	test(docker): stabilize Phase 0 baseline harness Two pre-existing baseline issues found while running the Phase 0 harness against the tini image that need fixing before later phases can use the harness as a behavior-parity oracle: 1. The autouse `_enforce_test_timeout` fixture in tests/conftest.py hard-coded a 30s SIGALRM, which preempted any `pytest.mark.timeout` marker (already honored by pytest-timeout). Honor the marker if present; fall back to 30s otherwise. Docker harness tests carry a 180s marker applied at collection time in tests/docker/conftest.py. 2. test_dashboard_port_override polled via `ss -tlnp` / `netstat -tln` — neither is installed in the Hermes image, so the probe trivially failed even when the dashboard was bound. The dashboard also takes 8-15s to bind on cold image; the 5s sleep was insufficient. Replace with a poll loop reading /proc/net/tcp directly (port 9120 = 0x23A0, state 0A = LISTEN). Bump probe deadline to 60s and switch test_dashboard_opt_in_starts to a similar poll for pgrep so we don't regress to the same race. Result: 11 passed, 2 xfailed (Phase 4 target) on tini image. Harness now ready to serve as Phase 2's behavior-parity oracle.	2026-05-24 18:05:14 -07:00
Ben	a18f69eb55	test(docker): apply 180s timeout to docker harness tests The agent-test suite default is 30s; docker test_no_args (the dashboard spin-up, the container restart) routinely take 60-90s. Without this they intermittently fail in CI with TimeoutError.	2026-05-24 18:05:14 -07:00
Ben	6e6acdea2a	test(docker): lock baseline behavior for Phase 0 harness Tasks 0.2-0.6 of the s6-overlay supervision plan. Locks the user-visible behavior we must preserve through the Phase 2 init- system swap: - test_main_invocation.py (Task 0.2): docker run <image> with no args, chat subcommand passthrough, bare executable passthrough, bash pattern, exit-code propagation - test_tui_passthrough.py (Task 0.3): TTY allocation via docker -t using the host's script(1) for a PTY - test_dashboard.py (Task 0.4): HERMES_DASHBOARD=1 opt-in, HERMES_DASHBOARD_PORT override - test_profile_gateway.py (Task 0.5): per-profile gateway start/stop and profile-delete-stops-gateway. Both marked xfail(strict=True) because the current tini image refuses gateway lifecycle commands inside the container; Phase 4 Task 4.3 flips them to passing. - test_zombie_reaping.py (Task 0.6): PID 1 reaps orphaned zombies. tini does this today; s6-overlay's /init must continue to. Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md	2026-05-24 18:05:14 -07:00
Ben	08302135b6	test(docker): add conftest fixtures for docker harness Task 0.1 of the s6-overlay supervision plan. Establishes the test infrastructure for tests/docker/: skip-on-missing-Docker collection hook, session-scoped image-build fixture (overridable via the HERMES_TEST_IMAGE env var for faster local iteration), and a container_name fixture that ensures cleanup on test exit. Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md	2026-05-24 18:05:14 -07:00

22 Commits