fix(docker): seed gateway_state.json from HERMES_GATEWAY_BOOTSTRAP_STATE on first boot (#37896)

On a fresh volume there is no gateway_state.json, so the boot reconciler
(cont-init.d/02-reconcile-profiles) registers the gateway-default s6 slot
but leaves it down — it only auto-starts when the last recorded state was
"running". A freshly-provisioned container therefore comes up with the
gateway down until something starts it (e.g. the dashboard's start button).

Add a generic, first-boot-only env-seed in stage2-hook.sh (which runs
before 02-reconcile-profiles): when HERMES_GATEWAY_BOOTSTRAP_STATE=running
and no gateway_state.json exists yet, seed {"gateway_state":"running"} so
the reconciler brings the supervised slot up on the very first boot.

This mirrors the existing HERMES_AUTH_JSON_BOOTSTRAP pattern: it seeds the
same state file the reconciler already consults, guarded by [ ! -f ] so
persisted runtime state always wins on later boots (a deliberately-stopped
gateway stays stopped across restarts). Only the literal "running" is
honoured (the sole value in the reconciler's _AUTOSTART_STATES).

Generic container contract — no host-specific code. Useful to any
orchestrator that provisions a blank volume and wants the gateway up from
first boot (the supervised gateway/dashboard already work on such hosts;
only the first-boot autostart was missing because the CLI lifecycle
commands can't drive the s6 layer when container self-detection misses).

Adds a shell-level contract test and documents the env var.
This commit is contained in:
Ben Barclay
2026-06-03 15:11:15 +10:00
committed by GitHub
parent e618cbee44
commit d9f7e7ac81
3 changed files with 185 additions and 0 deletions

View File

@ -278,6 +278,38 @@ if [ ! -f "$HERMES_HOME/auth.json" ] && [ -n "${HERMES_AUTH_JSON_BOOTSTRAP:-}" ]
chmod 600 "$HERMES_HOME/auth.json"
fi
# gateway_state.json: declare the gateway's INITIAL supervised state on a
# fresh volume. Same first-boot-only env-seed pattern as auth.json above.
#
# On a blank volume there is no gateway_state.json, so the boot reconciler
# (cont-init.d/02-reconcile-profiles → container_boot.reconcile_profile_gateways)
# registers the gateway-default s6 slot but leaves it DOWN — it only
# auto-starts when the last recorded state was "running". That means a
# freshly-provisioned container comes up with the gateway down until
# someone starts it (e.g. from the dashboard). An orchestrator that
# provisions a fresh volume and wants the gateway running from first boot
# can set HERMES_GATEWAY_BOOTSTRAP_STATE=running; we seed the state file
# here, BEFORE 02-reconcile-profiles runs (cont-init.d scripts run in
# lexicographic order), so the reconciler sees prior_state=running and
# brings the supervised slot up on the very first boot.
#
# This is a generic container contract, not specific to any host: it seeds
# the SAME gateway_state.json the reconciler already consults, exactly as
# HERMES_AUTH_JSON_BOOTSTRAP seeds auth.json. The [ ! -f ] guard is the
# load-bearing part — on every subsequent boot the persisted state wins,
# so a gateway the operator deliberately stopped stays stopped across
# restarts and we never clobber real runtime state.
#
# Only a literal "running" is honoured (the sole value in the reconciler's
# _AUTOSTART_STATES); any other value is ignored so a typo can't write a
# bogus state the reconciler would treat as "no prior state" anyway.
if [ ! -f "$HERMES_HOME/gateway_state.json" ] && \
[ "${HERMES_GATEWAY_BOOTSTRAP_STATE:-}" = "running" ]; then
printf '{"gateway_state":"running"}\n' > "$HERMES_HOME/gateway_state.json"
chown hermes:hermes "$HERMES_HOME/gateway_state.json" 2>/dev/null || true
chmod 644 "$HERMES_HOME/gateway_state.json"
fi
# --- Sync bundled skills ---
# Invoke the venv's python by absolute path so we don't need a `sh -c`
# wrapper to source the activate script. This is safe because