From d9f7e7ac815e747e8b55d51dcb3329f2f29a30ca Mon Sep 17 00:00:00 2001 From: Ben Barclay Date: Wed, 3 Jun 2026 15:11:15 +1000 Subject: [PATCH] fix(docker): seed gateway_state.json from HERMES_GATEWAY_BOOTSTRAP_STATE on first boot (#37896) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On a fresh volume there is no gateway_state.json, so the boot reconciler (cont-init.d/02-reconcile-profiles) registers the gateway-default s6 slot but leaves it down — it only auto-starts when the last recorded state was "running". A freshly-provisioned container therefore comes up with the gateway down until something starts it (e.g. the dashboard's start button). Add a generic, first-boot-only env-seed in stage2-hook.sh (which runs before 02-reconcile-profiles): when HERMES_GATEWAY_BOOTSTRAP_STATE=running and no gateway_state.json exists yet, seed {"gateway_state":"running"} so the reconciler brings the supervised slot up on the very first boot. This mirrors the existing HERMES_AUTH_JSON_BOOTSTRAP pattern: it seeds the same state file the reconciler already consults, guarded by [ ! -f ] so persisted runtime state always wins on later boots (a deliberately-stopped gateway stays stopped across restarts). Only the literal "running" is honoured (the sole value in the reconciler's _AUTOSTART_STATES). Generic container contract — no host-specific code. Useful to any orchestrator that provisions a blank volume and wants the gateway up from first boot (the supervised gateway/dashboard already work on such hosts; only the first-boot autostart was missing because the CLI lifecycle commands can't drive the s6 layer when container self-detection misses). Adds a shell-level contract test and documents the env var. --- docker/stage2-hook.sh | 32 ++++ ...est_stage2_hook_gateway_bootstrap_state.py | 152 ++++++++++++++++++ .../docs/reference/environment-variables.md | 1 + 3 files changed, 185 insertions(+) create mode 100644 tests/tools/test_stage2_hook_gateway_bootstrap_state.py diff --git a/docker/stage2-hook.sh b/docker/stage2-hook.sh index 413e9a211..56925198e 100755 --- a/docker/stage2-hook.sh +++ b/docker/stage2-hook.sh @@ -278,6 +278,38 @@ if [ ! -f "$HERMES_HOME/auth.json" ] && [ -n "${HERMES_AUTH_JSON_BOOTSTRAP:-}" ] chmod 600 "$HERMES_HOME/auth.json" fi +# gateway_state.json: declare the gateway's INITIAL supervised state on a +# fresh volume. Same first-boot-only env-seed pattern as auth.json above. +# +# On a blank volume there is no gateway_state.json, so the boot reconciler +# (cont-init.d/02-reconcile-profiles → container_boot.reconcile_profile_gateways) +# registers the gateway-default s6 slot but leaves it DOWN — it only +# auto-starts when the last recorded state was "running". That means a +# freshly-provisioned container comes up with the gateway down until +# someone starts it (e.g. from the dashboard). An orchestrator that +# provisions a fresh volume and wants the gateway running from first boot +# can set HERMES_GATEWAY_BOOTSTRAP_STATE=running; we seed the state file +# here, BEFORE 02-reconcile-profiles runs (cont-init.d scripts run in +# lexicographic order), so the reconciler sees prior_state=running and +# brings the supervised slot up on the very first boot. +# +# This is a generic container contract, not specific to any host: it seeds +# the SAME gateway_state.json the reconciler already consults, exactly as +# HERMES_AUTH_JSON_BOOTSTRAP seeds auth.json. The [ ! -f ] guard is the +# load-bearing part — on every subsequent boot the persisted state wins, +# so a gateway the operator deliberately stopped stays stopped across +# restarts and we never clobber real runtime state. +# +# Only a literal "running" is honoured (the sole value in the reconciler's +# _AUTOSTART_STATES); any other value is ignored so a typo can't write a +# bogus state the reconciler would treat as "no prior state" anyway. +if [ ! -f "$HERMES_HOME/gateway_state.json" ] && \ + [ "${HERMES_GATEWAY_BOOTSTRAP_STATE:-}" = "running" ]; then + printf '{"gateway_state":"running"}\n' > "$HERMES_HOME/gateway_state.json" + chown hermes:hermes "$HERMES_HOME/gateway_state.json" 2>/dev/null || true + chmod 644 "$HERMES_HOME/gateway_state.json" +fi + # --- Sync bundled skills --- # Invoke the venv's python by absolute path so we don't need a `sh -c` # wrapper to source the activate script. This is safe because diff --git a/tests/tools/test_stage2_hook_gateway_bootstrap_state.py b/tests/tools/test_stage2_hook_gateway_bootstrap_state.py new file mode 100644 index 000000000..813d18d89 --- /dev/null +++ b/tests/tools/test_stage2_hook_gateway_bootstrap_state.py @@ -0,0 +1,152 @@ +"""Contract test: the s6-overlay stage2 hook seeds gateway_state.json from +HERMES_GATEWAY_BOOTSTRAP_STATE on first boot, so a freshly-provisioned +container can come up with the gateway already running. + +Background. On a blank volume there is no gateway_state.json, so the boot +reconciler (cont-init.d/02-reconcile-profiles -> +container_boot.reconcile_profile_gateways) registers the gateway-default s6 +slot but leaves it DOWN — it only auto-starts when the last recorded state was +"running". A container provisioned on a fresh volume therefore comes up with +the gateway down until something starts it. + +An orchestrator that wants the gateway running from first boot sets +HERMES_GATEWAY_BOOTSTRAP_STATE=running; stage2-hook.sh (installed as +/etc/cont-init.d/01-hermes-setup, which runs lexicographically BEFORE +02-reconcile-profiles) seeds the state file so the reconciler sees +prior_state=running and brings the slot up on the very first boot. + +This mirrors the existing HERMES_AUTH_JSON_BOOTSTRAP env-seed pattern: it seeds +the SAME gateway_state.json the reconciler already consults, guarded by +``[ ! -f ]`` so persisted runtime state always wins on subsequent boots (a +deliberately-stopped gateway must stay stopped across restarts). +""" +from __future__ import annotations + +import json +import re +import shutil +import subprocess +import tempfile +from pathlib import Path + +import pytest + +REPO_ROOT = Path(__file__).resolve().parents[2] +STAGE2_HOOK = REPO_ROOT / "docker" / "stage2-hook.sh" + + +@pytest.fixture(scope="module") +def stage2_text() -> str: + if not STAGE2_HOOK.exists(): + pytest.skip("docker/stage2-hook.sh not present in this checkout") + return STAGE2_HOOK.read_text() + + +def _seed_block(text: str) -> str: + """Extract the ``if [ ! -f "$HERMES_HOME/gateway_state.json" ] && … fi`` + block that seeds the gateway state file from the bootstrap env var.""" + m = re.search( + r'(if \[ ! -f "\$HERMES_HOME/gateway_state\.json" \] && \\\n' + r"(?:.*\n)*?fi)", + text, + ) + assert m, ( + "stage2-hook.sh must contain the gateway_state.json bootstrap-seed block " + "guarded on HERMES_GATEWAY_BOOTSTRAP_STATE" + ) + return m.group(1) + + +def test_seed_block_present_and_guarded(stage2_text: str) -> None: + block = _seed_block(stage2_text) + # Must be a first-boot-only seed (the [ ! -f ] guard) keyed on the env var. + assert '[ ! -f "$HERMES_HOME/gateway_state.json" ]' in block, ( + "seed must be guarded by [ ! -f ] so persisted state wins on restart" + ) + assert "HERMES_GATEWAY_BOOTSTRAP_STATE" in block + assert "gateway_state" in block + + +def _run_seed( + text: str, *, env_value: str | None, preexisting: str | None +) -> str | None: + """Run the extracted seed block in a sandbox $HERMES_HOME. + + ``env_value`` is the HERMES_GATEWAY_BOOTSTRAP_STATE value (None = unset). + ``preexisting`` is the contents of a gateway_state.json placed before the + block runs (None = no file). Returns the file's contents afterwards, or + None if it doesn't exist. ``chown``/``chmod`` are stubbed so the block + runs without real root. + """ + bash = shutil.which("bash") + if bash is None: + pytest.skip("bash not available") + block = _seed_block(text) + + with tempfile.TemporaryDirectory() as d: + dpath = Path(d) + home = dpath / "home" + home.mkdir() + state_file = home / "gateway_state.json" + if preexisting is not None: + state_file.write_text(preexisting) + + env_line = ( + f'export HERMES_GATEWAY_BOOTSTRAP_STATE="{env_value}"\n' + if env_value is not None + else "unset HERMES_GATEWAY_BOOTSTRAP_STATE\n" + ) + script = ( + "set -e\n" + f'HERMES_HOME="{home}"\n' + # Stub privilege ops — the sandbox isn't root. + "chown() { :; }\n" + "chmod() { :; }\n" + + env_line + + block + ) + script_path = dpath / "harness.sh" + script_path.write_text(script) + + proc = subprocess.run( + [bash, str(script_path)], capture_output=True, text=True + ) + assert proc.returncode == 0, proc.stderr + + if not state_file.exists(): + return None + return state_file.read_text() + + +def test_seeds_running_state_on_blank_volume(stage2_text: str) -> None: + """env=running + no pre-existing file -> writes a valid running state.""" + out = _run_seed(stage2_text, env_value="running", preexisting=None) + assert out is not None, "seed must create gateway_state.json" + assert json.loads(out).get("gateway_state") == "running" + + +def test_does_not_clobber_existing_state(stage2_text: str) -> None: + """The [ ! -f ] guard: an existing state file is never overwritten, even + when the bootstrap env var says running. A deliberately-stopped gateway + must stay stopped across restarts.""" + existing = json.dumps({"gateway_state": "stopped", "pid": 123}) + out = _run_seed(stage2_text, env_value="running", preexisting=existing) + assert out == existing, "seed must not clobber a persisted state file" + + +def test_no_seed_when_env_unset(stage2_text: str) -> None: + """No env var -> no file written (preserves the default down-on-first-boot + behaviour for orchestrators that don't opt in).""" + out = _run_seed(stage2_text, env_value=None, preexisting=None) + assert out is None, "seed must not run when HERMES_GATEWAY_BOOTSTRAP_STATE is unset" + + +def test_non_running_value_ignored(stage2_text: str) -> None: + """Only a literal "running" is honoured; any other value is ignored so a + typo can't write a bogus state. (The reconciler's _AUTOSTART_STATES is + exactly {"running"}.)""" + for bogus in ("stopped", "Running", "1", "true", "starting"): + out = _run_seed(stage2_text, env_value=bogus, preexisting=None) + assert out is None, ( + f"only 'running' should seed a state file, not {bogus!r}" + ) diff --git a/website/docs/reference/environment-variables.md b/website/docs/reference/environment-variables.md index 664a4f068..ae620c9a3 100644 --- a/website/docs/reference/environment-variables.md +++ b/website/docs/reference/environment-variables.md @@ -519,6 +519,7 @@ Advanced per-platform knobs for throttling the outbound message batcher. Most us | `HERMES_GATEWAY_BUSY_INPUT_MODE` | Default gateway busy-input behavior: `queue`, `steer`, or `interrupt`. Can be overridden per chat with `/busy`. | | `HERMES_GATEWAY_BUSY_ACK_ENABLED` | Whether the gateway sends an acknowledgment message (⚡/⏳/⏩) when a user sends input while the agent is busy (default: `true`). Set to `false` to suppress these messages entirely — the input is still queued/steered/interrupts as normal, only the chat reply is silenced. Bridged from `display.busy_ack_enabled` in `config.yaml`. | | `HERMES_GATEWAY_NO_SUPERVISE` | Inside the s6-overlay Docker image, opt out of auto-supervision when running `hermes gateway run` and use pre-s6 foreground semantics (no auto-restart, gateway is the container's main process). Truthy values: `1`, `true`, `yes`. Equivalent to the `--no-supervise` CLI flag. No-op outside the s6 image. | +| `HERMES_GATEWAY_BOOTSTRAP_STATE` | Inside the s6-overlay Docker image, declare the gateway's **initial** supervised state on a fresh volume. On a blank volume there is no persisted `gateway_state.json`, so the boot reconciler registers the `gateway-default` slot but leaves it **down** (it only auto-starts when the last recorded state was `running`). Set this to `running` and the first-boot setup hook seeds `gateway_state.json` *before* the reconciler runs, so the gateway comes up on the very first boot. Only the literal value `running` is honoured. First-boot-only: an existing `gateway_state.json` is never overwritten, so a deliberately-stopped gateway stays stopped across restarts. No-op outside the s6 image. | | `HERMES_FILE_MUTATION_VERIFIER` | Enable the per-turn file-mutation verifier footer (default: `true`). When enabled, Hermes appends an advisory listing any `write_file` / `patch` calls that failed during the turn and were not superseded by a successful write. Set to `0`, `false`, `no`, or `off` to suppress. Mirrors `display.file_mutation_verifier` in `config.yaml`; the env var wins when set. | | `HERMES_CRON_TIMEOUT` | Inactivity timeout for cron job agent runs in seconds (default: `600`). The agent can run indefinitely while actively calling tools or receiving stream tokens — this only triggers when idle. Set to `0` for unlimited. | | `HERMES_CRON_SCRIPT_TIMEOUT` | Timeout for pre-run scripts attached to cron jobs in seconds (default: `120`). Override for scripts that need longer execution (e.g., randomized delays for anti-bot timing). Also configurable via `cron.script_timeout_seconds` in `config.yaml`. |