main
13 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
| 380ce4789b | Remove prviliges drop when you never ran as root (#34837) | |||
| c341a2d107 | fix(docker): align HOME for dashboard and s6 gateway services (#33481) | |||
| b345323195 |
fix(docker): tee supervised gateway stdout to docker logs
Follow-up to #33583 (the gateway-run-supervised redirect). Before this fix, the supervised gateway's stdout (most visibly the "Hermes Gateway Starting…" rich-console banner) was swallowed by `s6-log` into the rotated file at `${HERMES_HOME}/logs/gateways/<profile>/current` and never reached `docker logs`. Operational signal lived in two places: * **docker logs** — saw stderr (Python `logging` defaults to stderr), so warnings/errors were visible. * **the rotated file** — saw stdout (rich banners, `print()` output, third-party libs that wrote to fd 1). This was surprising for users coming from the pre-s6 image, where `docker run … gateway run` produced a single unified stream in `docker logs`. They'd see partial output, conclude something was broken, and dig around for the missing pieces. Fix: add the `1` s6-log action directive before the file destination so each line is forwarded to s6-log's stdout — which propagates up the s6-supervise pipeline to /init's stdout = container stdout = `docker logs`. The file destination is preserved as a second destination, so the rotated log (with ISO 8601 timestamps) still exists for `hermes logs` and for survival across container restarts. Trade-off considered: timestamps. Putting `T` between `1` and the file destination (not before `1`) means: * docker logs sees raw lines — Python's logging formatter has its own timestamps, and `docker logs --timestamps` adds another layer when desired. No double-stamping in the common reading path. * The persisted file gets s6-log's ISO 8601 timestamp so even output that lacked a Python-logger timestamp (rich banners, third-party raw prints) is correlatable in `current`. Verification: * New unit-test assertion in `test_service_manager.py` locks the `s6-log 1` directive into the rendered run-script. Mutation- tested by reverting to the pre-fix script (no `1`); the assert catches it cleanly. * New docker-harness test `test_supervised_gateway_stdout_reaches_docker_logs` builds the image, runs `docker run … gateway run`, and asserts the unique `⚕` banner glyph reaches `docker logs`. Also verifies the rotated file still contains the banner (no regression on the existing file destination). Mutation-tested end-to-end: built a deliberately-broken image without the `1` directive and the test failed exactly as designed, citing the banner present in `current` but absent from `docker logs`. * `website/docs/user-guide/docker.md` gains a new `:::note Where gateway logs go` admonition documenting both destinations and the audit-log file at `${HERMES_HOME}/logs/container-boot.log`. Existing functionality preserved: every other docker-harness test still passes against the new image. Unit-test sweep across `tests/hermes_cli/` (5561 tests) is green. |
|||
| 0927fb5584 |
feat(docker): auto-redirect gateway run to supervised mode inside s6 image
Pre-s6, `docker run nousresearch/hermes-agent gateway run` was the
standard invocation: gateway ran as the container's main process,
tini reaped zombies, container exit code matched gateway exit code,
no supervision. With s6-overlay as PID 1, the same invocation now
auto-upgrades to supervised semantics — auto-restart on crash,
dashboard supervised alongside (when HERMES_DASHBOARD=1 is set),
multiple profile gateways under the same /init.
Users get the new behavior with zero changes to their docker run
command. A loud one-line breadcrumb on stderr explains the upgrade
and points at the opt-out for users who genuinely want pre-s6
foreground semantics.
How it works:
1. `_gateway_command_inner` (the `gateway run` handler) checks if
we're inside a container with s6 as PID 1.
2. If yes, dispatches `start` to the s6 service manager (registers
and starts gateway-default), then `exec sleep infinity` to keep
the CMD process alive without binding container lifetime to
gateway PID lifetime. The supervised gateway can flap freely;
`docker stop` still tears everything down via /init stage 3.
3. If no, falls through to the existing foreground code path
unchanged. Host runs of `hermes gateway run` are unaffected.
Three gates make the redirect inert outside the intended scope:
* `detect_service_manager() != "s6"` — host/non-s6-container runs.
* `HERMES_S6_SUPERVISED_CHILD=1` env var (recursion guard) —
exported by `S6ServiceManager._render_run_script` for the
s6-supervised invocation itself. Without this guard, the
supervised `gateway run --replace` would re-enter the redirect
and recurse (run → start → run → start → ...) infinitely.
* `--no-supervise` CLI flag OR `HERMES_GATEWAY_NO_SUPERVISE=1` env
var — explicit user opt-out for CI smoke tests, debugging the
foreground startup path, or any case wanting "CMD exit =
container exit" semantics. Strict truthiness (1/true/yes,
case-insensitive); typos like `=0` do NOT silently opt out.
Tests:
* Unit tests in tests/hermes_cli/test_gateway_s6_dispatch.py
cover all five paths (host no-op, supervised fire, sentinel
recursion guard, CLI flag, env var truthy + falsy). The two
load-bearing gates (sentinel + opt-out) were mutation-tested
by removing each gate in isolation and confirming the dedicated
test fails with the expected error.
* Docker harness tests in tests/docker/test_gateway_run_supervised.py
cover the round trips end-to-end against a built image: redirect
fires (sleep-infinity heartbeat + supervised gateway-default
slot + breadcrumb), --no-supervise opt-out (foreground gateway,
no want-up on the slot), HERMES_GATEWAY_NO_SUPERVISE env var
works identically, recursion is impossible (≤1 supervised
python gateway-run + exactly 1 sleep-infinity parented to the
CMD wrapper), and HERMES_DASHBOARD=1 produces both supervised
gateway and supervised dashboard.
Docs:
* Added a `:::tip Gateway runs supervised` admonition near the
main docker.md example explaining the upgrade and pointing at
the opt-out. Pre-s6 (tini-based) images still run gateway run
as the foreground main process, so the note is scoped to the
s6 image only.
Trade-off documented in the helper docstring: container exit code
under the redirect is sleep's exit code (always 0 on SIGTERM), not
the gateway's. That was an explicit design call — the supervised
gateway is allowed to flap without taking the container with it,
which is what "supervision" means. CI users who want exit-code
forwarding can pass --no-supervise.
|
|||
| 4f416fc40c |
fix(docker): make s6 lifecycle work for the unprivileged hermes user
Resolves the explicit "Known follow-up" left by commit 2f8ceeab9 and the resulting CI failures in tests/docker/test_dashboard.py and tests/docker/test_s6_profile_gateway_integration.py. The product gap --------------- Every hermes runtime operation inside the container runs as the hermes user (UID 10000) via s6-setuidgid. But s6-supervise — spawned by s6-svscan running as PID 1 — creates each service's supervise/ and top-level event/ directories with mode 0700 owned by its effective UID (root). That left every s6-svc / s6-svstat / s6-svwait call from hermes hitting EACCES on the supervise/control FIFO and supervise/status — i.e. the entire S6ServiceManager lifecycle (register, start, stop, unregister) was inert in production. The 2f8ceeab9 commit message called this out and deferred the fix. The audit changes that landed alongside it (defaulting docker_exec to -u hermes) made the integration tests reproduce the bug deterministically; the fix below resolves it. The fix: pre-create the supervise/ skeleton hermes-owned ---------------------------------------------------------- Reading s6's source (src/supervision/s6-supervise.c::trymkdir + control_init), the mkdir and mkfifo calls that build the supervise tree are EEXIST-safe: if the directory or FIFO is already present, s6-supervise reuses it and skips the chown/chmod fix-up that would normally make event/ 03730 root:root. So if we lay the skeleton down with hermes ownership before triggering s6-svscanctl -a, s6-supervise inherits our layout and never touches it. The death_tally / lock / status regular files written later by s6-supervise (still as root) land mode 0644 — world-readable — which is all s6-svstat needs. New module-level helper _seed_supervise_skeleton(svc_dir) in hermes_cli/service_manager.py lays down: svc_dir/event/ hermes:hermes 03730 svc_dir/supervise/ hermes:hermes 0755 svc_dir/supervise/event/ hermes:hermes 03730 svc_dir/supervise/control hermes:hermes 0660 (FIFO) svc_dir/log/event/ hermes:hermes 03730 (if log/ present) svc_dir/log/supervise/ hermes:hermes 0755 svc_dir/log/supervise/event/ hermes:hermes 03730 svc_dir/log/supervise/control hermes:hermes 0660 (FIFO) The log/ branch matters because the logger is a second s6-supervise instance — without it, unregister rmtree races on the logger's root-owned supervise dir even after the parent slot's supervise/ is hermes-owned. The helper is idempotent and swallows PermissionError on chown so it works equally well when called from root (cont-init.d) or hermes (runtime register). Wiring ------ 1. S6ServiceManager.register_profile_gateway calls _seed_supervise_skeleton(tmp_dir) just before publishing the slot via Path.replace. Runtime-registered profile gateways are set up by hermes. 2. container_boot._register_service does the same in the cont-init.d reconciliation path so boot-time-restored profile slots inherit the same layout. 3. New cont-init.d/015-supervise-perms script chowns the supervise/ and event/ trees for STATIC s6-rc services (dashboard, main-hermes). These are spawned by s6-rc before cont-init.d gets to run, so the EEXIST-trick doesn't apply; we chown the already-existing tree instead. s6-supervise keeps using the same files; it never re-asserts ownership on a running service. The script skips s6-overlay internal services (s6rc-*, s6-linux-*) so the supervision tree itself stays root-only. 015- slot is intentional: lex-sorts between 01-hermes-setup and 02-reconcile-profiles in the container's C-locale, so the chown finishes before the reconciler walks the scandir. Unregister teardown reordering ------------------------------ S6ServiceManager.unregister_profile_gateway now fires s6-svscanctl -an BEFORE rmtree (with a 200ms grace), so s6-svscan reaps the supervise child and releases its file handles on supervise/lock + supervise/status before we try to remove the directory. Previously rmtree raced s6-supervise on a set of files inside the supervise dir, and even with the parent supervise/ now hermes-owned, the contained files (death_tally, lock, status, written by root) could still be in use. Dashboard down-state redesign ----------------------------- The original PR #30136 review fix wrote a 'down' marker file into /run/service/dashboard/ via cont-init.d/03-dashboard-toggle. That approach was broken in two ways: (a) /run/service/dashboard is a symlink to a TRANSIENT /run/s6-rc:s6-rc-init:<tmpdir>/ directory while s6-rc is mid-transaction; the touch landed in a soon-to-be-discarded tmp. (b) Even when written to the final /run/s6-rc/servicedirs/ location, the 'down' file is only consulted by s6-supervise at slot startup. s6-rc's user-bundle explicitly transitions 'dashboard' to 'up' on every boot, overriding any down marker. The right fix is the canonical s6 pattern: when HERMES_DASHBOARD is unset, the dashboard run script exits 0 and a companion finish script exits 125. Per s6-supervise(8), exit code 125 from the finish script is the 'permanent failure, do not restart' marker — equivalent to s6-svc -O. The slot reports as 'down' to s6-svstat, matching the reality that no dashboard process is running. When HERMES_DASHBOARD IS truthy, finish exits 0 and restart-on-crash semantics apply. 03-dashboard-toggle is removed (its function is now subsumed by the run/finish pair). Tests ----- Adds four unit tests for _seed_supervise_skeleton covering the produced layout, the log/ subservice case, the skip-when-no-log case, and idempotency. The live-container verification continues to live in tests/docker/test_s6_profile_gateway_integration.py and tests/docker/test_dashboard.py — both now pass against the rebuilt image. References ---------- * Skarnet skaware mailing list 2020-02-02 (Laurent Bercot + Guillermo Diaz Hartusch) on unprivileged s6 tool semantics: http://skarnet.org/lists/skaware/1424.html * just-containers/s6-overlay#130 — same EEXIST-preseed pattern, community-validated 2016 onward * https://skarnet.org/software/s6/servicedir.html — exit-code 125 semantics in finish scripts (cherry picked from commit c41f908ad46043728d884f4b1929435636cf1bcb) |
|||
| 5cbb132c1d |
fix(ci): exclude tests/docker/ from regular test shards; pin read_text encoding
Two CI follow-ups to @benbarclay's #30136 salvage: 1. scripts/run_tests_parallel.py — add 'docker' to _SKIP_PARTS so the new tests/docker/ harness doesn't run in the regular test (N) matrix. The harness builds the real Dockerfile in a session fixture, which can exceed pytest-timeout's 180s ceiling on ubuntu-latest where Docker IS available — it surfaced as 6 identical setup-timeout failures across slices 1–6 on the first CI run. The docker harness has its own dedicated runner via .github/actions/hermes-smoke-test (added in #30136) plus the docker-lint workflow. Same treatment as tests/integration/ and tests/e2e/ — runs separately, not in the main shards. 2. hermes_cli/service_manager.py — pin encoding='utf-8' on the /proc/1/comm read_text call. Ruff PLW1514 enforcement rolled in between Ben's last push and the salvage; pure ruff-fix, no behavior change. |
|||
| d735b083e8 |
fix(service_manager): rip out dead port parameter
PR #30136 review caught: `_allocate_gateway_port()` in profiles.py computed a SHA-256-derived port that was threaded through `register_profile_gateway(profile, port=N)` → `_render_run_script(profile, port, extra_env)` → and then **ignored**. The rendered run script picked the bind port from the profile's config.yaml (`[gateway] port = …`), never from the allocator. So the entire allocator + parameter chain was dead code. Remove: * `hermes_cli.profiles._allocate_gateway_port` (deterministic SHA-256 → [9200, 9800) — never used). * `port` kwarg from `ServiceManager.register_profile_gateway` (Protocol + Mixin + S6 implementation). * `port` positional arg from `_render_run_script(profile, port, extra_env)` — now `_render_run_script(profile, extra_env)`. * The pass-through call in `profiles._maybe_register_gateway_service`. config.yaml is now the single source of truth for gateway port selection — matches reality and reduces the API surface. Three explanatory comments in service_manager.py / profiles.py document the retirement so future readers don't reach for the allocator and find a ghost. Tests: drop the three `_allocate_gateway_port` tests; update fakes' signatures throughout test_service_manager.py and test_profiles_s6_hooks.py to match the new no-port API. |
|||
| b28b3f51d3 |
fix(service_manager): friendly errors for missing slots and s6-svc failures
PR #30136 review caught: `S6ServiceManager.start/stop/restart` called `subprocess.run(check=True)` on `s6-svc`, so any failure surfaced as a raw `CalledProcessError` traceback. The two cases operators actually hit are: 1. The service slot doesn't exist — most commonly because the user typed a profile name wrong (`hermes -p typo gateway start`). 2. s6-svc itself fails — most commonly EACCES on the supervise control FIFO when running unprivileged. Both deserve named errors with actionable messages, not stacktraces. Changes: * Add `S6Error` base + two concrete errors in `hermes_cli.service_manager`: - `GatewayNotRegisteredError(profile)` — carries the unprefixed profile name; message: `no such gateway 'typo': register it with `hermes profile create typo` first, or pass an existing profile name via `-p <name>``. - `S6CommandError(service, action, returncode, stderr)` — carries the s6-svc rc and stderr; message: `s6-svc start on 'gateway-coder' failed (rc=111): <stderr>`. * Factor lifecycle dispatch through `_run_svc(flag, label, name)`: pre-checks that the service directory exists (raises GatewayNotRegisteredError before invoking s6-svc), then runs s6-svc and translates any CalledProcessError into S6CommandError. * `_dispatch_via_service_manager_if_s6` in `hermes_cli.gateway` catches both errors and prints `✗ <message>` + `sys.exit(1)` instead of letting the exception bubble. The dispatch path that used to dump a traceback at the user now gives an actionable one-liner. Tests: 6 new tests for the error types and their CLI rendering; existing lifecycle test pre-seeds the slot directory before calling `mgr.start` etc. |
|||
| b044c1ac29 |
fix(container_boot): always register gateway-default slot
PR #30136 review caught: `hermes gateway start` (no `-p`) inside the container resolves `_profile_suffix() == ""` → service name `gateway-default`, but no such slot was ever registered. The Phase 4 profile-create hook only fired on `hermes profile create <name>`, and the root profile (which lives at the top of $HERMES_HOME, not under `profiles/`) was never one of those. So bare `hermes gateway start` landed on `s6-svc -u /run/service/gateway-default` → uncaught `CalledProcessError` → traceback to the user. Changes: 1. `reconcile_profile_gateways` now always registers a `gateway-default` slot before iterating named profiles. Its prior state is read from `$HERMES_HOME/gateway_state.json` (sibling to the profile root, not under `profiles/`); stale runtime files there are swept the same way. Auto-up only if the prior state was `running` — same rule as named profiles. 2. `S6ServiceManager._render_run_script` special-cases `profile == "default"` to emit `hermes gateway run` with NO `-p` flag. Passing `-p default` would resolve to `$HERMES_HOME/profiles/default/` — a different profile that almost certainly doesn't exist. The empty profile-suffix convention is the dispatcher's contract and the run script has to match. 3. A user-created `profiles/default/` collides with the reserved root-profile slot; the reconciler now skips it with a warning rather than producing two registrations of the same service name. Action-list ordering is stable: `default` first, then named profiles in directory order. Boot-log readers can rely on this. Tests: 8 new dedicated default-slot tests plus updates to every existing test that asserted against the action list (via the new `_named_actions` helper that drops the always-present default entry). |
|||
| fc39296e1f |
fix(service_manager): s6 detection works for unprivileged hermes user
PR #30136 review surfaced two issues, both rooted in the same audit gap: docker integration tests were running as root, not the unprivileged `hermes` user (UID 10000) that the runtime actually uses via `s6-setuidgid hermes`. Anything that probed PID-1 state or wrote to the s6 control surface worked as root in the tests but was inert in production. Fixes: 1. `_s6_running()` previously called `Path("/proc/1/exe").resolve()`, which is root-only readable. For UID 10000 the symlink yields PermissionError, `resolve()` silently returns the unresolved path, and `exe.name == "exe"` — so detection always returned False, the service-manager runtime-registration path was inert, and every `hermes profile create` / `hermes -p X gateway start` silently skipped the s6 hook. Replace with `/proc/1/comm` (world-readable) + `/run/s6/basedir` (s6-overlay-specific) — both required, fail closed. 2. `02-reconcile-profiles` now also chowns `/run/service/.s6-svscan/` {control,lock} to hermes so `s6-svscanctl -a/-an` works without root. Previously the directory chown stopped at `/run/service` and the FIFO inside stayed root-owned, so `register_profile_gateway` from hermes failed at the rescan-trigger step with EACCES — the wrapper in profiles.py caught the exception and printed a swallowed warning, so profile creation appeared to succeed while the slot was rolled back. Audit changes to flush this class of bug next time: - Add `docker_exec` / `docker_exec_sh` helpers to `tests/docker/conftest.py` that default to `-u hermes`. The module docstring explains why and flags `user="root"` as opt-in only for tests that explicitly need root (none currently do). - Refactor every `docker exec` call in tests/docker/ through the new helpers (test_dashboard.py, test_zombie_reaping.py, test_profile_gateway.py, test_container_restart.py, test_s6_profile_gateway_integration.py). - Add 5 unit tests covering `_s6_running` under various probe states (both signals present; comm wrong; basedir missing; PermissionError on /proc/1/comm; missing /proc — non-Linux). The PermissionError test is the explicit regression guard for the original bug. Known follow-up: the per-service `supervise/control` FIFO inside each `/run/service/gateway-<profile>/supervise/` is created root-owned by s6-supervise (which runs as root because s6-svscan is PID 1). `s6-svc -u/-d/-t` from the hermes user will get EACCES on those. The audit under `-u hermes` will reveal this in lifecycle tests — surfacing the issue cleanly so it can be fixed in a focused follow-up (likely via a small SUID helper or a polling chown loop in cont-init.d). The detection + svscanctl fixes here are independent and complete on their own. |
|||
| 2afefc501c |
feat(docker): per-profile s6 supervision + container-restart reconciliation
Phase 4 of the s6-overlay supervision plan. Activates the Phase 3
S6ServiceManager by hooking it into the profile lifecycle and the
`hermes gateway start/stop/restart` dispatcher, and adds a cont-
init.d-time reconciliation pass that survives `docker restart`.
Task 4.0 — container-boot reconciliation:
/run/service/ is tmpfs, so every `docker restart` wipes every
per-profile gateway slot. /etc/cont-init.d/02-reconcile-profiles
invokes hermes_cli.container_boot.reconcile_profile_gateways() on
every boot, which walks $HERMES_HOME/profiles/<name>/, reads each
gateway_state.json, recreates the s6 service slot, and auto-starts
only those whose last state was 'running'. Other states
(stopped, starting, startup_failed, missing) register the slot
in the down state — avoiding crash-loops across restarts for a
gateway that was broken last boot. Per-profile outcome is recorded
to $HERMES_HOME/logs/container-boot.log.
Implementation: hermes_cli/container_boot.py + 12 unit tests.
Profile-marker is SOUL.md, not config.yaml, because `hermes profile
create` only seeds SOUL.md by default (config.yaml comes from
`hermes setup`).
Task 4.1 / 4.2 — profile create/delete hooks:
hermes_cli/profiles.py::create_profile now calls
_maybe_register_gateway_service(<canon>) at the end, which routes
through ServiceManager.register_profile_gateway when running on s6
and no-ops on host backends. delete_profile mirrors with
_maybe_unregister_gateway_service. _allocate_gateway_port produces
a deterministic SHA-256-derived port in [9200, 9800).
Task 4.3 — gateway dispatch + remove rejection arms:
_dispatch_via_service_manager_if_s6(action) intercepts
start/stop/restart at the top of each subcommand and routes them
through S6ServiceManager.{start,stop,restart}. The pre-Phase-4
`elif is_container():` rejection arms are kept as fallback for
pre-s6 containers / unsupported runtimes, but only ever fire when
detect_service_manager() != 's6'. install/uninstall under s6
print informational guidance pointing users at profile create/delete.
Removed the two xfail(strict=True) markers from
tests/docker/test_profile_gateway.py — both tests now pass strictly.
Task 4.4 — status reporting:
get_gateway_runtime_snapshot() reports
Manager: 's6 (container supervisor)' inside an s6 container instead
of 'docker (foreground)'.
Plan-vs-reality drift fixed in this commit:
- Plan's S6ServiceManager._render_run_script used
`gateway start --foreground --port {port}` — invented args; the
real CLI is `gateway run`. Switched accordingly. port arg
retained for API parity but now documented as 'currently ignored'.
- Plan's reconciler keyed on config.yaml; switched to SOUL.md
(config.yaml is created by hermes setup, not by hermes profile
create, so the original gate caught nothing).
- The plan's _dispatch helper used _profile_arg() which returns
'--profile <name>' (i.e. with the flag prefix). Switched to
_profile_suffix() which returns the bare name.
- Architecture B's docker exec doesn't get /command on PATH or
the venv on PATH; Dockerfile's runtime PATH now includes
/opt/hermes/.venv/bin so 'docker exec <c> hermes ...' works
without sourcing the venv.
- stage2-hook now chowns $HERMES_HOME/profiles to hermes on every
boot, not just on the UID-remap path. Without this, files created
by docker-exec-as-root accumulate and the next reconciler run
fails with PermissionError reading SOUL.md.
Test harness:
19 passed, 0 xfailed (the two pre-Phase-4 xfail targets flip to
passing). 78 unit tests across service_manager + container_boot +
profiles_s6_hooks + gateway_s6_dispatch. Hadolint + shellcheck
pass cleanly.
Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
|
|||
| 0abf661f71 |
feat(service_manager): add S6ServiceManager for runtime gateway supervision
Phase 3 of the s6-overlay supervision plan. Implements the runtime-
registration surface from D4 — only the s6 backend supports
register_profile_gateway / unregister_profile_gateway /
list_profile_gateways; host backends continue to raise
NotImplementedError. No caller yet (Phase 4 wires in the profile
create/delete hooks).
Key implementation notes:
- Service directory shape: /run/service/gateway-<profile>/{type,run,log/run}.
Atomic register: write to gateway-<profile>.tmp, fsync via
os.rename. Cleanup on rescan failure.
- Run script uses #!/command/with-contenv sh so HERMES_HOME and any
extra_env arrive at exec time. The hermes -p <profile> gateway
start --foreground --port <port> command is wrapped in
s6-setuidgid hermes for the per-service privilege drop (OQ2-A).
- Log script (OQ8-C): persists via s6-log to
${HERMES_HOME}/logs/gateways/<profile>/. CRITICAL — HERMES_HOME is
a runtime env-var expansion in the rendered script, NOT a Python
f-string substitution. Negative-asserted in
test_s6_register_creates_service_dir_and_triggers_scan so
regressions are caught.
- PATH gotcha: /command/ is only on PATH for processes spawned by
the supervision tree (services, cont-init.d). `docker exec` and
profile-create hooks don't get it. S6ServiceManager calls all
s6-* binaries via absolute path through the new _S6_BIN_DIR
constant so callers don't have to fix up env vars.
- validate_profile_name rejects path-traversal, leading-dash (s6
would parse as a flag), uppercase, whitespace, and names >251
chars (s6-svscan default name_max).
Test coverage:
- 13 new unit tests in tests/hermes_cli/test_service_manager.py
(kind detection, run-script content, env quoting, register
rollback on rescan failure, unregister idempotence, list filter,
lifecycle dispatch, svstat parsing). Total: 36 passing.
- 2 new in-container integration tests in
tests/docker/test_s6_profile_gateway_integration.py validating
end-to-end registration against a real s6 supervision tree.
Docker harness: 14 passed, 2 xfailed (Phase 4 target unchanged).
Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
|
|||
| 51914b0514 |
feat(service_manager): add ServiceManager protocol + host wrappers
Phase 1 of the s6-overlay supervision plan. Pure-refactor addition: introduces the abstract interface (with runtime_checkable Protocol), detect_service_manager(), validate_profile_name(), and thin SystemdServiceManager / LaunchdServiceManager / WindowsServiceManager wrappers around the existing systemd_* / launchd_* / gateway_windows.* module-level functions. No host call site was modified — host code continues to use the existing functions directly; the protocol is for new backend-agnostic code (Phase 4 profile create/delete hooks and the Phase 4 s6 dispatch path in 'hermes gateway start/stop/restart'). WindowsServiceManager.install() forwards the v3 kwargs (start_now, start_on_login, elevated_handoff) added in PRs #28169-adjacent so non-Windows callers — there aren't any today — can opt in. The s6 backend lands in Phase 3; until then get_service_manager() raises a clear error if invoked on a host that detects as 's6'. |