hermes-agent

Author	SHA1	Message	Date
Ben Barclay	03ba06ebfb	fix(docker): chown gateway install tree on UID remap (salvage #37928 ) (#38655 ) Salvage of #37928 (@sarvesh1327), reduced to the still-needed delta. `/opt/hermes/gateway` is a runtime-writable Python package: on first import the supervised gateway writes `__pycache__` beneath it, and the image does not set PYTHONDONTWRITEBYTECODE. When HERMES_UID/PUID is remapped at boot (e.g. Unraid 99), `usermod -u` only re-chowns the hermes home dir; the build trees under /opt/hermes keep the build-time UID (10000). main already chowns `.venv`, `ui-tui`, and `node_modules` on remap (#38556) but missed `gateway`, so the remapped gateway hits EACCES writing `__pycache__` (#27221). Add `/opt/hermes/gateway` to both chown sites — the Dockerfile build-time `chown -R hermes:hermes` line and the stage2-hook build-tree repair — so it tracks the remapped UID like the sibling trees. Differs from #37928 as submitted: dropped the `uid_gid_remapped` flag and the `\|\| [ "$uid_gid_remapped" = true ]` chown gate. main's #38556 already solved that half, and more correctly — it probes the actual tree ownership (`venv_owner != actual_hermes_uid`) rather than tracking same-boot remaps, which also catches pre-existing ownership drift and stays idempotent. Keeping #37928's flag would regress that. The salvage is the `gateway`-tree addition only. Verified end-to-end against a real image build: on baseline main a remap to UID 99 leaves `gateway` owned by 10000 and a write as uid 99 fails EACCES; with this change `gateway` is chowned to 99:100 and the write succeeds, while the default-uid (no-remap) path is unchanged. Fixes #27221. Co-authored-by: Sarvesh <sarveshagl1327@gmail.com>	2026-06-04 13:34:23 +10:00
cornna	7402706c5e	fix(docker): accept Unraid uid mappings (#38098 ) Co-authored-by: Cornna <96944678+ymylive@users.noreply.github.com>	2026-06-04 12:38:24 +10:00
Ben Barclay	04d620d91f	fix(docker): run config migrations during container boot (salvage #35508 ) (#36627 ) Salvage of #35508 (@dchenk), rebased onto current main. Resolved the tests/tools/test_stage2_hook_puid_pgid.py conflict (kept both the envdir-creation regression test on main and the new config-migration tests). Docker image upgrades replace code under $INSTALL_DIR but preserve $HERMES_HOME on the mounted volume, so the persisted config.yaml never received the schema migrations that non-Docker `hermes update` runs (#35406). This adds scripts/docker_config_migrate.py, invoked from stage2-hook after first-boot seeding and before gateway services start: it backs up config.yaml + .env, runs migrate_config(interactive=False), and honors HERMES_SKIP_CONFIG_MIGRATION=1 for manual control. Also fixes a latent bug in check_config_version(): it called load_config() which deep-merges DEFAULT_CONFIG, so a legacy config with no raw _config_version falsely reported as already-current. It now reads the raw on-disk file so legacy configs are correctly detected for migration. Differs from #35508 as submitted (Option B cleanup): dropped the `_config_version` line added to cli-config.yaml.example and removed the accompanying test_cli_config_example_declares_latest_version change-detector test. The example is a copy-template and has no business asserting a schema version; check_config_version() reads the user's real config.yaml, not the example. This removes a second sync point that drifts on every version bump. Closes #35508. Fixes #35406. Co-authored-by: Dmitriy Cherchenko <17372886+dchenk@users.noreply.github.com>	2026-06-04 11:11:27 +10:00
Ben Barclay	343c54e35b	fix(docker): reject unsupported --user <arbitrary-uid> start with clear guidance (#38579 ) `docker run --user $(id -u):$(id -g)` was a tini-era trick to make container-written files match the host user. Under s6-overlay it no longer works: the bootstrap (UID remap, volume + build-tree chown, config seeding) needs root, and the baked image dirs (/opt/data, /opt/hermes/.venv, ui-tui, node_modules) are owned by the hermes build UID (10000). A pinned arbitrary UID can't write them, so the runtime fails with EACCES on a bind mount or hard-crashes on a named volume (Docker inits the volume from the image as 10000; the non-root start can't even `cd /opt/data`, and the profile reconciler dies with PermissionError on gateway_state.json). Detect that start early in both the cont-init hook (stage2-hook.sh) and the CMD wrapper (main-wrapper.sh) and fail fast with actionable guidance pointing at the supported path: root start + HERMES_UID/HERMES_GID (or the PUID/PGID aliases), which remaps the hermes user and chowns the volume — the same host-UID-matching outcome --user was used for, without breaking s6. The guard fires only when the current UID is neither root NOR the hermes UID. This preserves the supported non-root start from #34648/#34837 (running with `--user 10000:10000`, i.e. pinned to the hermes UID itself), which is unaffected — only the arbitrary-UID variant that #34837 never actually made writable is rejected. Verified live across five scenarios (built image, bind + named volume): arbitrary --user on bind -> rejected with guidance, hermes does not run; arbitrary --user on named volume -> guidance shown, no raw 'can't cd' crash; --user 10000:10000 -> boots; root + HERMES_UID=4242 remap -> boots, guard not tripped; default root start -> boots. Pre-fix control reproduces the raw PermissionError + 'can't cd' crash with no guidance.	2026-06-04 10:51:51 +10:00
Ben Barclay	5446153c98	fix(docker): chown build trees on UID remap independently of $HERMES_HOME (#35027 regression) (#38556 ) The stage2 hook gates the recursive chown of the build trees under $INSTALL_DIR (.venv, ui-tui, node_modules) so a HERMES_UID/PUID remap leaves them writable by the new runtime UID — needed for lazy_deps 'uv pip install' of platform extras (#15012, #21100) and the TUI esbuild rebuild into ui-tui/dist (#28851). #35027 folded that chown under the $HERMES_HOME ownership check ('stat $HERMES_HOME != hermes_uid'). But 'usermod -u <new> hermes' re-chowns the hermes home dir ($HERMES_HOME == /opt/data) to the new UID as a side effect, so after any remap that stat is already satisfied and needs_chown is false — silently skipping the build-tree chown on the common PUID/NAS path. The venv stays owned by the build-time UID (10000), so lazy installs and TUI rebuilds fail with EACCES. Probe the build trees directly instead: chown only when /opt/hermes/.venv is not already owned by the runtime hermes UID. Independent of $HERMES_HOME ownership, idempotent across restarts. Verified live: built the image, booted with HERMES_UID/HERMES_GID on a fresh named volume, confirmed .venv/ui-tui/node_modules end up owned by the remapped UID and 'uv pip install' into the venv succeeds; confirmed the recursive chown fires once and is skipped on restart.	2026-06-04 10:17:55 +10:00
Ben Barclay	d9f7e7ac81	fix(docker): seed gateway_state.json from HERMES_GATEWAY_BOOTSTRAP_STATE on first boot (#37896 ) On a fresh volume there is no gateway_state.json, so the boot reconciler (cont-init.d/02-reconcile-profiles) registers the gateway-default s6 slot but leaves it down — it only auto-starts when the last recorded state was "running". A freshly-provisioned container therefore comes up with the gateway down until something starts it (e.g. the dashboard's start button). Add a generic, first-boot-only env-seed in stage2-hook.sh (which runs before 02-reconcile-profiles): when HERMES_GATEWAY_BOOTSTRAP_STATE=running and no gateway_state.json exists yet, seed {"gateway_state":"running"} so the reconciler brings the supervised slot up on the very first boot. This mirrors the existing HERMES_AUTH_JSON_BOOTSTRAP pattern: it seeds the same state file the reconciler already consults, guarded by [ ! -f ] so persisted runtime state always wins on later boots (a deliberately-stopped gateway stays stopped across restarts). Only the literal "running" is honoured (the sole value in the reconciler's _AUTOSTART_STATES). Generic container contract — no host-specific code. Useful to any orchestrator that provisions a blank volume and wants the gateway up from first boot (the supervised gateway/dashboard already work on such hosts; only the first-boot autostart was missing because the CLI lifecycle commands can't drive the s6 layer when container self-detection misses). Adds a shell-level contract test and documents the env var.	2026-06-03 15:11:15 +10:00
Nicolay	b3aaf2676b	fix(docker): discover Playwright headless_shell browser (#35717 ) Co-authored-by: Nic <nicsequenzy@gmail.com>	2026-06-01 16:06:44 +10:00
Amin Vakil	f106e58afa	fix(docker): create s6 envdir before browser path export (#34601 )	2026-06-01 15:44:30 +10:00
Ben Barclay	bdceedf784	fix(docker): chown hermes-owned top-level state files on boot (#35098 ) (#36236 ) The targeted data-volume chown in stage2-hook.sh only covers hermes-owned subdirectories; loose state files living directly under $HERMES_HOME (auth.json, state.db, gateway.lock, gateway_state.json, …) are missed. When created or rewritten by `docker exec <container> hermes …` (root unless `-u` is passed) they land root-owned, and the unprivileged hermes runtime then hits PermissionError on next startup, producing a gateway restart loop. Fix: reset ownership of an explicit allowlist of hermes-owned top-level files on every boot. The list mirrors the top-level file entries of hermes_cli.profile_distribution.USER_OWNED_EXCLUDE plus the runtime lock files. This uses a targeted allowlist rather than the originally-proposed blanket `find $HERMES_HOME -maxdepth 1 -user root` sweep, preserving the targeted-ownership contract from #19788 / PR #19795: a bind-mounted $HERMES_HOME may contain host-owned files Hermes does not manage, and those must never be chowned. Verified end-to-end: allowlisted root-owned files are reset to hermes on restart while a non-allowlisted host file keeps its root ownership. Co-authored-by: x1am1 <2663402852@qq.com>	2026-06-01 14:38:08 +10:00
Nacho Avecilla	380ce4789b	Remove prviliges drop when you never ran as root (#34837 )	2026-06-01 13:54:18 +10:00
Foldblade	1031031dec	fix(docker): skip unnecessary boot chown when volume ownership already matches remapped UID (#35027 )	2026-06-01 11:59:43 +10:00
Teknium	758454d1e4	fix(docker): validate HERMES_UID/GID to prevent privilege escalation in stage2-hook (#35340 ) Co-authored-by: sprmn24 <oncuevtv@gmail.com>	2026-06-01 11:46:53 +10:00
Ben	ec7736f8a7	fix(docker): auto-join Docker socket group for docker-in-docker backend When users bind-mount /var/run/docker.sock to use TERMINAL_ENV=docker from inside the container, the supervised hermes user (UID 10000) lacks permission to talk to the socket — every `docker` invocation EACCES'es and check_terminal_requirements() returns False. In messaging mode this also silently strips the file/terminal toolset from the registered tool list, so the agent rationalizes the missing tools as a platform restriction. The naive workaround (docker run --group-add <socket-gid>) does NOT work with our s6-setuidgid privilege drop: s6-setuidgid calls initgroups() for the target user, which rebuilds supp groups from /etc/group. Without a matching /etc/group entry the kernel-granted supp group is wiped between PID 1 and the dropped hermes process. Verified empirically: --group-add 998 alone: PID 1 Groups: 0 998 → after drop: Groups: 10000 This fix's /etc/group add: id hermes shows 998 → after drop: Groups: 998 10000 Detect the socket's GID at boot in stage2-hook (runs as root before the privilege drop), reuse an existing group name if one matches the GID, otherwise create 'hostdocker'. Idempotent across container restarts. Silent no-op when no socket is mounted. End-to-end verified by building the image and running the supervised hermes user against the real host Docker daemon: `docker version` succeeds and check_terminal_requirements() returns True. Fixes #16703	2026-05-29 16:15:44 +10:00
Ben Barclay	48083211ef	fix(docker): accept PUID/PGID as aliases for HERMES_UID/HERMES_GID (#25872 ) (#34401 ) Salvages #25872 by @konsisumer against current main. NAS users (UGOS, Synology, unRAID) expect the LinuxServer.io PUID/PGID convention and bind-mount /opt/data from a host directory owned by their own UID. Without this alias those vars are silently ignored and the s6-setuidgid drop to UID 10000 leaves the runtime unable to read the volume. HERMES_UID/HERMES_GID still take precedence when both are set. The original PR targeted docker/entrypoint.sh, which is now a 27-line deprecation shim under s6-overlay (the May 2026 rework moved all bootstrap logic to docker/stage2-hook.sh, installed as /etc/cont-init.d/01-hermes-setup). Re-applied the same 2-line alias resolution at the equivalent spot in stage2-hook.sh just before the existing UID/GID remap block. Test was retargeted at docker/stage2-hook.sh; docs hunk adapted to current main's wording ("stage2 hook" + s6-setuidgid, not the obsolete "entrypoint drops via gosu") with the NAS bind-mount example preserved verbatim. Test-first regression verification: reverted just docker/stage2-hook.sh to origin/main and re-ran the new tests. Result: FAILED test_stage2_hook_resolves_puid_pgid_aliases FAILED test_puid_pgid_populate_hermes_uid_gid AssertionError: assert ':' == '1000:10' That's the exact bug shape — PUID=1000 PGID=10 silently ignored, HERMES_UID/HERMES_GID stay empty. With the salvage applied, all 4 tests pass. Closes #25872 Co-authored-by: konsisumer <11262660+konsisumer@users.noreply.github.com>	2026-05-29 16:07:15 +10:00
Ben	3e33e14335	fix(docker): discover agent-browser Chromium binary at boot The image's Dockerfile runs npx playwright install chromium, which populates $PLAYWRIGHT_BROWSERS_PATH (=/opt/hermes/.playwright) with a `chromium_headless_shell-<build>/chrome-headless-shell-linux64/` tree. agent-browser (the runtime CLI Hermes spawns for the browser tool) doesn't recognise this layout in its own cache scan and fails with `Auto-launch failed: Chrome not found` — even though the binary is right there. Reproduction on current main: $ docker run --rm <image> sh -c 'npx -y agent-browser snapshot --url about:blank' ✗ Auto-launch failed: Chrome not found. Checked: - agent-browser cache: /tmp/.../.agent-browser/browsers - System Chrome installations - Puppeteer browser cache - Playwright browser cache Run `agent-browser install` to download Chrome, or use --executable-path. Fix: at boot, locate the binary under $PLAYWRIGHT_BROWSERS_PATH and export AGENT_BROWSER_EXECUTABLE_PATH via /run/s6/container_environment so the with-contenv shebang on main-wrapper.sh propagates it into the supervised `hermes` process and thence to agent-browser subprocesses. Filename-matched (chrome / chromium / chrome-headless-shell / chromium-browser), not path-matched: the chromium dir contains many shared libraries (libGLESv2.so, libEGL.so, ...) which inherit the executable bit from Playwright's tarball but are NOT browser binaries. Compare PR #18635's earlier `find \| grep -Ei 'chrome\|chromium'` which would match the path .../chrome-headless-shell-linux64/libGLESv2.so and pick a .so as the browser binary. User overrides (e.g. `-e AGENT_BROWSER_EXECUTABLE_PATH=/usr/bin/...`) are respected — the discovery block is skipped when the env var is already set. Quietly skipped when $PLAYWRIGHT_BROWSERS_PATH doesn't exist (e.g. custom builds that strip Playwright). This salvages PR #18635 by @jackey8616, who identified the bug and proposed the same env-var approach but in the now-deprecated docker/entrypoint.sh shim and with a path-match find command that selected .so files instead of the chrome binary. The fix retargets docker/stage2-hook.sh (the s6-overlay cont-init script where boot-time env setup belongs) with a corrected filename-match query. Fixes #15697 Closes #18635 Co-authored-by: Clooooode <12930377+jackey8616@users.noreply.github.com>	2026-05-27 20:43:27 +10:00
Ben	fb298a958c	fix(docker): mkdir HERMES_HOME as root in stage2 before chown / privilege drop (#18488 ) When HERMES_HOME points at a custom path whose parent directories only root can create (e.g. HERMES_HOME=/home/hermes/.hermes in a Compose file, or any path under a fresh / not pre-populated by the image), stage2-hook.sh fails on first boot: [stage2] Warning: chown failed (rootless container?) - continuing mkdir: cannot create directory '/custom': Permission denied mkdir: cannot create directory '/custom': Permission denied ... (one per s6-setuidgid hermes mkdir invocation) cont-init: info: /etc/cont-init.d/01-hermes-setup exited 1 The mkdirs fail because s6-setuidgid drops to hermes (UID 10000) before invoking mkdir -p, and the runtime user has no permission to create root-owned ancestor directories. 02-reconcile-profiles then crashes with FileNotFoundError, .install_method never lands, and the container limps on in a half-initialized state. Bootstrap HERMES_HOME with mkdir -p while still root, before the ownership normalization. Idempotent on the default /opt/data path (directory already exists from the Dockerfile RUN mkdir -p) and on any subsequent restart. (#18482) Retargeted from the original PR's docker/entrypoint.sh (now a deprecated shim) to docker/stage2-hook.sh where the related chown logic moved during the s6-overlay rework. Co-authored-by: wpengpeng168 <133926080+wpengpeng168@users.noreply.github.com>	2026-05-27 17:16:40 +10:00
Ben	22eb4d13f7	fix(docker): chown ui-tui and node_modules on UID remap so TUI esbuild works (#28851 ) When HERMES_UID remaps the hermes user from 10000 to another UID (e.g. matching the host user's UID for bind-mount ergonomics), the TUI launcher's esbuild step fails: ✘ [ERROR] Failed to write to output file: open /opt/hermes/ui-tui/dist/entry.js: permission denied TUI build failed. This is because the Dockerfile's build-time `chown -R hermes:hermes` on `/opt/hermes/{.venv,ui-tui,node_modules}` (line 154) wrote UID 10000, and stage2-hook.sh only re-chowned `.venv` on UID remap — leaving the TUI build trees still owned by the old UID. Extend the stage2 re-chown to include the same set as the build-time chown: `.venv`, `ui-tui`, `node_modules`. These are the runtime-writable trees under $INSTALL_DIR; everything else under /opt/hermes is read-only at runtime so keeping it root-owned is fine. Original fix targeted docker/entrypoint.sh which is now a deprecated shim; retargeted to docker/stage2-hook.sh where the .venv chown moved during the s6-overlay rework. Co-authored-by: Andreas Steffan <623481+deas@users.noreply.github.com>	2026-05-27 15:41:48 +10:00
Ben	9eadb6805c	fix(docker): targeted chown to preserve host file ownership in HERMES_HOME (#19795 ) Replaces the recursive chown of $HERMES_HOME in stage2-hook.sh with a targeted approach: chown the top-level dir (so hermes can create new subdirs) plus the specific hermes-owned subdirectories (cron/, sessions/, logs/, hooks/, memories/, skills/, skins/, plans/, workspace/, home/, profiles/) — the same canonical list seeded by the s6-setuidgid mkdir -p block below. Avoids clobbering host-side file ownership when $HERMES_HOME is a bind mount that contains user-owned files not managed by hermes (issue #19788). Original fix targeted docker/entrypoint.sh which is now a deprecated shim; retargeted to docker/stage2-hook.sh where the recursive chown moved during the s6-overlay rework. Co-authored-by: Ptichalouf <1809721+ptichalouf@users.noreply.github.com>	2026-05-27 15:08:41 +10:00
dusterbloom	79fc92e9cb	fix(security): tighten .env file permissions to 0600 at all creation sites .env holds API keys and secrets. Multiple creation sites used `cp` / `touch` / `shutil.copy2` which obey the process umask — commonly 0o022, leaving the file at 0o644 (world-readable). Apply chmod 0o600 explicitly at every site that creates or copies .env. Sites covered: - docker/stage2-hook.sh: after the seed_one '.env' call, applied unconditionally (not just on first-seed) so a host-mounted .env with loose perms gets tightened on every container restart - hermes_cli/doctor.py: 'hermes doctor --fix' touches an empty .env when missing - hermes_cli/profiles.py: 'hermes profile create --clone' copies .env from the source profile; shutil.copy2 preserves source mode, so a source .env at 0o644 was being cloned into 0o644 - setup-hermes.sh: in-tree setup script's cp .env.example .env path, plus the already-exists branch (mirror of install.sh which already chmods 600 unconditionally on line 1442) scripts/install.sh was NOT changed — it already chmod 600's the .env unconditionally after the create/already-exists branches (line 1442). Salvaged from PR #25726 by @dusterbloom. The docker/entrypoint.sh portion of the original PR was dropped because main switched to an s6-overlay shim — the .env creation logic moved to stage2-hook.sh, which is where the chmod now lives. Closes #25497 (subset — install.sh + setup-hermes.sh) and #8448 (subset — install.sh only) as superseded. Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>	2026-05-25 03:40:47 -07:00
Ben	9914bfc594	docker: drop sh -c wrappers from stage2-hook.sh PR #30136 review caught: three `s6-setuidgid hermes sh -c "..."` invocations in stage2-hook.sh interpolated $HERMES_HOME into a nested shell context. Practically low-risk (a malicious HERMES_HOME already requires container-launch privileges) but the cleaner pattern is to invoke commands directly so the shell isn't a second interpreter. * `mkdir -p` of the data subdirs now runs directly via s6-setuidgid, one path per arg. * The .install_method stamp is written via `printf \| tee` — also no shell wrapper. * The skills_sync invocation uses the venv's python by absolute path instead of sourcing activate inside a shell. skills_sync.py doesn't need anything from activate beyond sys.path, which the bin-stub python already provides. No behavior change. Just a smaller attack surface and a script that's easier to read.	2026-05-24 18:05:33 -07:00
Ben	2afefc501c	feat(docker): per-profile s6 supervision + container-restart reconciliation Phase 4 of the s6-overlay supervision plan. Activates the Phase 3 S6ServiceManager by hooking it into the profile lifecycle and the `hermes gateway start/stop/restart` dispatcher, and adds a cont- init.d-time reconciliation pass that survives `docker restart`. Task 4.0 — container-boot reconciliation: /run/service/ is tmpfs, so every `docker restart` wipes every per-profile gateway slot. /etc/cont-init.d/02-reconcile-profiles invokes hermes_cli.container_boot.reconcile_profile_gateways() on every boot, which walks $HERMES_HOME/profiles/<name>/, reads each gateway_state.json, recreates the s6 service slot, and auto-starts only those whose last state was 'running'. Other states (stopped, starting, startup_failed, missing) register the slot in the down state — avoiding crash-loops across restarts for a gateway that was broken last boot. Per-profile outcome is recorded to $HERMES_HOME/logs/container-boot.log. Implementation: hermes_cli/container_boot.py + 12 unit tests. Profile-marker is SOUL.md, not config.yaml, because `hermes profile create` only seeds SOUL.md by default (config.yaml comes from `hermes setup`). Task 4.1 / 4.2 — profile create/delete hooks: hermes_cli/profiles.py::create_profile now calls _maybe_register_gateway_service(<canon>) at the end, which routes through ServiceManager.register_profile_gateway when running on s6 and no-ops on host backends. delete_profile mirrors with _maybe_unregister_gateway_service. _allocate_gateway_port produces a deterministic SHA-256-derived port in [9200, 9800). Task 4.3 — gateway dispatch + remove rejection arms: _dispatch_via_service_manager_if_s6(action) intercepts start/stop/restart at the top of each subcommand and routes them through S6ServiceManager.{start,stop,restart}. The pre-Phase-4 `elif is_container():` rejection arms are kept as fallback for pre-s6 containers / unsupported runtimes, but only ever fire when detect_service_manager() != 's6'. install/uninstall under s6 print informational guidance pointing users at profile create/delete. Removed the two xfail(strict=True) markers from tests/docker/test_profile_gateway.py — both tests now pass strictly. Task 4.4 — status reporting: get_gateway_runtime_snapshot() reports Manager: 's6 (container supervisor)' inside an s6 container instead of 'docker (foreground)'. Plan-vs-reality drift fixed in this commit: - Plan's S6ServiceManager._render_run_script used `gateway start --foreground --port {port}` — invented args; the real CLI is `gateway run`. Switched accordingly. port arg retained for API parity but now documented as 'currently ignored'. - Plan's reconciler keyed on config.yaml; switched to SOUL.md (config.yaml is created by hermes setup, not by hermes profile create, so the original gate caught nothing). - The plan's _dispatch helper used _profile_arg() which returns '--profile <name>' (i.e. with the flag prefix). Switched to _profile_suffix() which returns the bare name. - Architecture B's docker exec doesn't get /command on PATH or the venv on PATH; Dockerfile's runtime PATH now includes /opt/hermes/.venv/bin so 'docker exec <c> hermes ...' works without sourcing the venv. - stage2-hook now chowns $HERMES_HOME/profiles to hermes on every boot, not just on the UID-remap path. Without this, files created by docker-exec-as-root accumulate and the next reconciler run fails with PermissionError reading SOUL.md. Test harness: 19 passed, 0 xfailed (the two pre-Phase-4 xfail targets flip to passing). 78 unit tests across service_manager + container_boot + profiles_s6_hooks + gateway_s6_dispatch. Hadolint + shellcheck pass cleanly. Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md	2026-05-24 18:05:33 -07:00
Ben	e0e9c895d3	feat(docker)!: replace tini with s6-overlay as PID 1 BREAKING CHANGE: the container ENTRYPOINT is now /init (s6-overlay) instead of /usr/bin/tini. Main hermes runs as the container CMD with TTY inherited (preserving --tui), dashboard runs as a supervised s6-rc service (HERMES_DASHBOARD=1 starts it; crashes auto-restart), and the ground is laid for per-profile gateway supervision (Phase 3+4). All five pre-s6 docker run invocation patterns continue to work identically — verified by the Phase 0 docker harness: docker run <image> → `hermes` with no args docker run <image> chat -q "..." → `hermes chat -q ...` passthrough docker run <image> sleep infinity → `sleep infinity` direct docker run <image> bash → interactive bash docker run -it <image> --tui → interactive Ink TUI Phase 2 harness result: 12 passed, 2 xfailed (Phase 4 target). Hadolint + shellcheck pass cleanly. Architecture pivot from plan v3 (documented in main-hermes/run header): the plan called for main hermes to be an s6-supervised service, but two real s6-overlay v3 mechanics blocked that — cont-init.d scripts receive no arguments (CMD args are not visible to stage2-hook), and `/run/s6/basedir/bin/halt` after writing the exit code did not propagate the desired exit code (container exits 143). We use the s6-overlay-native CMD pattern instead: main-wrapper.sh is the container's main program (ENTRYPOINT prepends it so leading-dash args like --version aren't intercepted by /init), exec's the final program with stdin/stdout/stderr inherited, and the program's exit code becomes the container exit code. main-hermes is now a no-op `sleep infinity` slot kept for future supervised-gateway-container modes. This trades "supervised restart of main hermes" for arg- parity with the pre-s6 contract — main hermes was already unsupervised under tini, so we lose nothing functional. Dashboard supervision is the only new guarantee added by this phase. Files added: docker/main-wrapper.sh # arg routing + s6-setuidgid drop docker/stage2-hook.sh # gosu-equivalent + chown + seed docker/s6-rc.d/main-hermes/{type,run,dependencies.d/base} docker/s6-rc.d/dashboard/{type,run,dependencies.d/base} docker/s6-rc.d/user/contents.d/{main-hermes,dashboard} Files changed: Dockerfile: tini → s6-overlay install + ENTRYPOINT flip + service wiring docker/entrypoint.sh: thin shim to stage2-hook.sh for back-compat tests/docker/test_dashboard.py: add test_dashboard_restarts_after_crash Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md	2026-05-24 18:05:33 -07:00

22 Commits