Resolves the explicit "Known follow-up" left by commit 2f8ceeab9 and the resulting CI failures in tests/docker/test_dashboard.py and tests/docker/test_s6_profile_gateway_integration.py. The product gap --------------- Every hermes runtime operation inside the container runs as the hermes user (UID 10000) via s6-setuidgid. But s6-supervise — spawned by s6-svscan running as PID 1 — creates each service's supervise/ and top-level event/ directories with mode 0700 owned by its effective UID (root). That left every s6-svc / s6-svstat / s6-svwait call from hermes hitting EACCES on the supervise/control FIFO and supervise/status — i.e. the entire S6ServiceManager lifecycle (register, start, stop, unregister) was inert in production. The 2f8ceeab9 commit message called this out and deferred the fix. The audit changes that landed alongside it (defaulting docker_exec to -u hermes) made the integration tests reproduce the bug deterministically; the fix below resolves it. The fix: pre-create the supervise/ skeleton hermes-owned ---------------------------------------------------------- Reading s6's source (src/supervision/s6-supervise.c::trymkdir + control_init), the mkdir and mkfifo calls that build the supervise tree are EEXIST-safe: if the directory or FIFO is already present, s6-supervise reuses it and skips the chown/chmod fix-up that would normally make event/ 03730 root:root. So if we lay the skeleton down with hermes ownership before triggering s6-svscanctl -a, s6-supervise inherits our layout and never touches it. The death_tally / lock / status regular files written later by s6-supervise (still as root) land mode 0644 — world-readable — which is all s6-svstat needs. New module-level helper _seed_supervise_skeleton(svc_dir) in hermes_cli/service_manager.py lays down: svc_dir/event/ hermes:hermes 03730 svc_dir/supervise/ hermes:hermes 0755 svc_dir/supervise/event/ hermes:hermes 03730 svc_dir/supervise/control hermes:hermes 0660 (FIFO) svc_dir/log/event/ hermes:hermes 03730 (if log/ present) svc_dir/log/supervise/ hermes:hermes 0755 svc_dir/log/supervise/event/ hermes:hermes 03730 svc_dir/log/supervise/control hermes:hermes 0660 (FIFO) The log/ branch matters because the logger is a second s6-supervise instance — without it, unregister rmtree races on the logger's root-owned supervise dir even after the parent slot's supervise/ is hermes-owned. The helper is idempotent and swallows PermissionError on chown so it works equally well when called from root (cont-init.d) or hermes (runtime register). Wiring ------ 1. S6ServiceManager.register_profile_gateway calls _seed_supervise_skeleton(tmp_dir) just before publishing the slot via Path.replace. Runtime-registered profile gateways are set up by hermes. 2. container_boot._register_service does the same in the cont-init.d reconciliation path so boot-time-restored profile slots inherit the same layout. 3. New cont-init.d/015-supervise-perms script chowns the supervise/ and event/ trees for STATIC s6-rc services (dashboard, main-hermes). These are spawned by s6-rc before cont-init.d gets to run, so the EEXIST-trick doesn't apply; we chown the already-existing tree instead. s6-supervise keeps using the same files; it never re-asserts ownership on a running service. The script skips s6-overlay internal services (s6rc-*, s6-linux-*) so the supervision tree itself stays root-only. 015- slot is intentional: lex-sorts between 01-hermes-setup and 02-reconcile-profiles in the container's C-locale, so the chown finishes before the reconciler walks the scandir. Unregister teardown reordering ------------------------------ S6ServiceManager.unregister_profile_gateway now fires s6-svscanctl -an BEFORE rmtree (with a 200ms grace), so s6-svscan reaps the supervise child and releases its file handles on supervise/lock + supervise/status before we try to remove the directory. Previously rmtree raced s6-supervise on a set of files inside the supervise dir, and even with the parent supervise/ now hermes-owned, the contained files (death_tally, lock, status, written by root) could still be in use. Dashboard down-state redesign ----------------------------- The original PR #30136 review fix wrote a 'down' marker file into /run/service/dashboard/ via cont-init.d/03-dashboard-toggle. That approach was broken in two ways: (a) /run/service/dashboard is a symlink to a TRANSIENT /run/s6-rc:s6-rc-init:<tmpdir>/ directory while s6-rc is mid-transaction; the touch landed in a soon-to-be-discarded tmp. (b) Even when written to the final /run/s6-rc/servicedirs/ location, the 'down' file is only consulted by s6-supervise at slot startup. s6-rc's user-bundle explicitly transitions 'dashboard' to 'up' on every boot, overriding any down marker. The right fix is the canonical s6 pattern: when HERMES_DASHBOARD is unset, the dashboard run script exits 0 and a companion finish script exits 125. Per s6-supervise(8), exit code 125 from the finish script is the 'permanent failure, do not restart' marker — equivalent to s6-svc -O. The slot reports as 'down' to s6-svstat, matching the reality that no dashboard process is running. When HERMES_DASHBOARD IS truthy, finish exits 0 and restart-on-crash semantics apply. 03-dashboard-toggle is removed (its function is now subsumed by the run/finish pair). Tests ----- Adds four unit tests for _seed_supervise_skeleton covering the produced layout, the log/ subservice case, the skip-when-no-log case, and idempotency. The live-container verification continues to live in tests/docker/test_s6_profile_gateway_integration.py and tests/docker/test_dashboard.py — both now pass against the rebuilt image. References ---------- * Skarnet skaware mailing list 2020-02-02 (Laurent Bercot + Guillermo Diaz Hartusch) on unprivileged s6 tool semantics: http://skarnet.org/lists/skaware/1424.html * just-containers/s6-overlay#130 — same EEXIST-preseed pattern, community-validated 2016 onward * https://skarnet.org/software/s6/servicedir.html — exit-code 125 semantics in finish scripts (cherry picked from commit c41f908ad46043728d884f4b1929435636cf1bcb)
887 lines
35 KiB
Python
887 lines
35 KiB
Python
"""Abstract service manager interface.
|
|
|
|
Wraps the existing systemd (Linux host), launchd (macOS host), Windows
|
|
Scheduled Task (native Windows host), and s6 (container) backends behind
|
|
a common Protocol. Only the s6 backend supports runtime registration
|
|
(for per-profile gateways) — host backends raise NotImplementedError
|
|
from those methods, and callers MUST check supports_runtime_registration()
|
|
before invoking them.
|
|
|
|
Host-side call sites (setup wizard, uninstall, status) continue to use
|
|
the existing module-level functions in hermes_cli.gateway and
|
|
hermes_cli.gateway_windows directly. This protocol is a thin facade
|
|
used by new code that needs to be backend-agnostic — specifically the
|
|
profile create/delete hooks (Phase 4) and the s6 dispatch path in
|
|
``hermes gateway start/stop/restart`` when running inside a container.
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
import re
|
|
from pathlib import Path
|
|
from typing import Literal, Protocol, runtime_checkable
|
|
|
|
ServiceManagerKind = Literal["systemd", "launchd", "windows", "s6", "none"]
|
|
|
|
# Profile name → service directory mapping. Profile names must be safe
|
|
# as filesystem directory names because the s6 backend creates a service
|
|
# directory at ``<scandir>/gateway-<profile>/``. We reject anything that
|
|
# could traverse paths, span filesystems, or break s6's own naming rules.
|
|
_VALID_PROFILE_RE = re.compile(r"^[a-z0-9][a-z0-9_-]*$")
|
|
_MAX_PROFILE_LEN = 251 # s6-svscan default name_max
|
|
|
|
|
|
def validate_profile_name(name: str) -> None:
|
|
"""Raise ValueError if ``name`` is not usable as a profile name.
|
|
|
|
Profile names are used as s6 service directory names, so they must
|
|
match a conservative subset of filesystem-safe characters. Reject
|
|
empty strings, uppercase, paths-traversal sequences, and anything
|
|
longer than s6's default ``name_max``.
|
|
"""
|
|
if not name:
|
|
raise ValueError("profile name must not be empty")
|
|
if len(name) > _MAX_PROFILE_LEN:
|
|
raise ValueError(
|
|
f"profile name too long ({len(name)} > {_MAX_PROFILE_LEN})"
|
|
)
|
|
if not _VALID_PROFILE_RE.match(name):
|
|
raise ValueError(
|
|
f"profile name must match [a-z0-9][a-z0-9_-]*, got {name!r}"
|
|
)
|
|
|
|
|
|
@runtime_checkable
|
|
class ServiceManager(Protocol):
|
|
"""Abstract interface for init-system-specific service operations.
|
|
|
|
Lifecycle methods (start / stop / restart / is_running) are
|
|
implemented by every backend. Runtime registration
|
|
(register_profile_gateway / unregister_profile_gateway /
|
|
list_profile_gateways) is implemented only by the s6 backend —
|
|
callers MUST check ``supports_runtime_registration()`` before
|
|
invoking the registration methods.
|
|
"""
|
|
|
|
kind: ServiceManagerKind
|
|
|
|
# Lifecycle of a pre-declared service.
|
|
def start(self, name: str) -> None: ...
|
|
def stop(self, name: str) -> None: ...
|
|
def restart(self, name: str) -> None: ...
|
|
def is_running(self, name: str) -> bool: ...
|
|
|
|
# Runtime registration (s6 only).
|
|
def supports_runtime_registration(self) -> bool: ...
|
|
def register_profile_gateway(
|
|
self,
|
|
profile: str,
|
|
*,
|
|
extra_env: dict[str, str] | None = None,
|
|
) -> None: ...
|
|
def unregister_profile_gateway(self, profile: str) -> None: ...
|
|
def list_profile_gateways(self) -> list[str]: ...
|
|
|
|
|
|
def detect_service_manager() -> ServiceManagerKind:
|
|
"""Detect which service manager is available in this environment.
|
|
|
|
Returns:
|
|
"s6" — inside a container when /init is s6-svscan (Phase 2+)
|
|
"windows" — native Windows host
|
|
"launchd" — macOS host
|
|
"systemd" — Linux host with a working user/system bus
|
|
"none" — anything else (Termux, sandbox shells, etc.)
|
|
|
|
This function does NOT replace ``supports_systemd_services()`` —
|
|
host call sites continue to use that. It exists for new backend-
|
|
agnostic code (profile create/delete hooks, the s6 dispatch path
|
|
in ``hermes gateway start/stop/restart``).
|
|
"""
|
|
# Imports deferred so importing this module doesn't drag in the
|
|
# whole gateway dependency graph for callers that only need the
|
|
# Protocol type or validate_profile_name().
|
|
from hermes_constants import is_container
|
|
from hermes_cli.gateway import (
|
|
is_macos,
|
|
is_windows,
|
|
supports_systemd_services,
|
|
)
|
|
|
|
if is_container() and _s6_running():
|
|
return "s6"
|
|
if is_windows():
|
|
return "windows"
|
|
if is_macos():
|
|
return "launchd"
|
|
if supports_systemd_services():
|
|
return "systemd"
|
|
return "none"
|
|
|
|
|
|
def _s6_running() -> bool:
|
|
"""True when s6-svscan is running as PID 1 in this container.
|
|
|
|
Detection has to work for **both** root and the unprivileged hermes
|
|
user (UID 10000). The obvious probe — ``Path('/proc/1/exe').resolve()``
|
|
— only works as root: for any other UID, the symlink at
|
|
``/proc/1/exe`` is unreadable and ``resolve()`` silently returns the
|
|
path unchanged, so the resolved name is the literal ``"exe"`` and
|
|
detection always fails. Since every Hermes runtime call inside the
|
|
container drops to hermes via ``s6-setuidgid``, that silent failure
|
|
made the entire service-manager runtime-registration path inert in
|
|
production (PR #30136 review).
|
|
|
|
Probe instead via:
|
|
* ``/proc/1/comm`` — world-readable, contains the process comm
|
|
(``s6-svscan`` when s6-overlay is PID 1).
|
|
* ``/run/s6/basedir`` — s6-overlay-specific directory created by
|
|
stage1. World-readable. More specific than ``/run/s6`` (which
|
|
other tools occasionally create).
|
|
|
|
Both signals are required; either alone could false-positive
|
|
(e.g. a container with the s6 binaries installed but a different
|
|
init, or an unrelated process named ``s6-svscan``).
|
|
"""
|
|
try:
|
|
comm = Path("/proc/1/comm").read_text(encoding="utf-8").strip()
|
|
except OSError:
|
|
return False
|
|
if comm != "s6-svscan":
|
|
return False
|
|
return Path("/run/s6/basedir").is_dir()
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Backend wrappers
|
|
#
|
|
# These adapters are thin facades over the existing module-level functions
|
|
# in ``hermes_cli.gateway`` (systemd/launchd) and ``hermes_cli.gateway_windows``
|
|
# (Windows Scheduled Tasks). The protocol's ``name`` parameter is currently
|
|
# unused for host backends — they operate on whichever profile is currently
|
|
# active (set via the ``hermes -p <profile>`` flag before the call). This
|
|
# matches existing host-side semantics; the parameter shape is designed
|
|
# for s6 where each profile maps to a distinct service directory.
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
class _RegistrationUnsupportedMixin:
|
|
"""Mixin for host backends that don't support runtime registration."""
|
|
|
|
def supports_runtime_registration(self) -> bool:
|
|
return False
|
|
|
|
def register_profile_gateway(
|
|
self,
|
|
profile: str,
|
|
*,
|
|
extra_env: dict[str, str] | None = None,
|
|
) -> None:
|
|
raise NotImplementedError(
|
|
f"{type(self).__name__} does not support runtime profile "
|
|
"gateway registration (container-only feature)"
|
|
)
|
|
|
|
def unregister_profile_gateway(self, profile: str) -> None:
|
|
raise NotImplementedError(
|
|
f"{type(self).__name__} does not support runtime profile "
|
|
"gateway unregistration (container-only feature)"
|
|
)
|
|
|
|
def list_profile_gateways(self) -> list[str]:
|
|
return []
|
|
|
|
|
|
class SystemdServiceManager(_RegistrationUnsupportedMixin):
|
|
"""Thin wrapper around the ``systemd_*`` functions in hermes_cli.gateway.
|
|
|
|
Existing host call sites continue to use those functions directly;
|
|
this wrapper exists for new code that needs to be backend-agnostic
|
|
(the Phase 4 profile create/delete hooks).
|
|
"""
|
|
|
|
kind: ServiceManagerKind = "systemd"
|
|
|
|
def start(self, name: str) -> None:
|
|
from hermes_cli.gateway import systemd_start
|
|
systemd_start()
|
|
|
|
def stop(self, name: str) -> None:
|
|
from hermes_cli.gateway import systemd_stop
|
|
systemd_stop()
|
|
|
|
def restart(self, name: str) -> None:
|
|
from hermes_cli.gateway import systemd_restart
|
|
systemd_restart()
|
|
|
|
def is_running(self, name: str) -> bool:
|
|
from hermes_cli.gateway import _probe_systemd_service_running
|
|
_, running = _probe_systemd_service_running()
|
|
return running
|
|
|
|
|
|
class LaunchdServiceManager(_RegistrationUnsupportedMixin):
|
|
"""Thin wrapper around the ``launchd_*`` functions in hermes_cli.gateway."""
|
|
|
|
kind: ServiceManagerKind = "launchd"
|
|
|
|
def start(self, name: str) -> None:
|
|
from hermes_cli.gateway import launchd_start
|
|
launchd_start()
|
|
|
|
def stop(self, name: str) -> None:
|
|
from hermes_cli.gateway import launchd_stop
|
|
launchd_stop()
|
|
|
|
def restart(self, name: str) -> None:
|
|
from hermes_cli.gateway import launchd_restart
|
|
launchd_restart()
|
|
|
|
def is_running(self, name: str) -> bool:
|
|
from hermes_cli.gateway import _probe_launchd_service_running
|
|
return _probe_launchd_service_running()
|
|
|
|
|
|
class WindowsServiceManager(_RegistrationUnsupportedMixin):
|
|
"""Thin wrapper around ``hermes_cli.gateway_windows`` (Scheduled Task /
|
|
Startup-folder fallback).
|
|
|
|
The native Windows backend uses a Scheduled Task rather than a true
|
|
init-system service, but for protocol purposes the lifecycle is the
|
|
same: start / stop / restart / is_running. ``install`` accepts a
|
|
handful of Windows-specific kwargs (start_now, start_on_login,
|
|
elevated_handoff) that are passed straight through — non-Windows
|
|
callers should never invoke ``install`` on this wrapper.
|
|
"""
|
|
|
|
kind: ServiceManagerKind = "windows"
|
|
|
|
def install(
|
|
self,
|
|
*,
|
|
force: bool = False,
|
|
start_now: bool | None = None,
|
|
start_on_login: bool | None = None,
|
|
elevated_handoff: bool = False,
|
|
) -> None:
|
|
from hermes_cli import gateway_windows
|
|
gateway_windows.install(
|
|
force=force,
|
|
start_now=start_now,
|
|
start_on_login=start_on_login,
|
|
elevated_handoff=elevated_handoff,
|
|
)
|
|
|
|
def start(self, name: str) -> None:
|
|
from hermes_cli import gateway_windows
|
|
gateway_windows.start()
|
|
|
|
def stop(self, name: str) -> None:
|
|
from hermes_cli import gateway_windows
|
|
gateway_windows.stop()
|
|
|
|
def restart(self, name: str) -> None:
|
|
from hermes_cli import gateway_windows
|
|
gateway_windows.restart()
|
|
|
|
def is_running(self, name: str) -> bool:
|
|
from hermes_cli import gateway_windows
|
|
from hermes_cli.gateway import find_gateway_pids
|
|
if not gateway_windows.is_installed():
|
|
return False
|
|
return bool(find_gateway_pids())
|
|
|
|
|
|
def get_service_manager() -> ServiceManager:
|
|
"""Return the ServiceManager instance for the current environment.
|
|
|
|
Raises:
|
|
RuntimeError: when no supported backend is available.
|
|
"""
|
|
kind = detect_service_manager()
|
|
if kind == "systemd":
|
|
return SystemdServiceManager()
|
|
if kind == "launchd":
|
|
return LaunchdServiceManager()
|
|
if kind == "windows":
|
|
return WindowsServiceManager()
|
|
if kind == "s6":
|
|
return S6ServiceManager()
|
|
raise RuntimeError("no supported service manager detected")
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# S6ServiceManager (container-only)
|
|
#
|
|
# Per-profile gateways are registered dynamically when `hermes profile create`
|
|
# runs inside the container (Phase 4). Static services (main-hermes, dashboard)
|
|
# live in /etc/s6-overlay/s6-rc.d/ and are NOT managed by this class — they're
|
|
# part of the image, not runtime-created.
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
# s6-overlay's dynamic scandir for runtime-registered services. Lives on
|
|
# tmpfs and is the directory s6-svscan watches. Writes here trigger
|
|
# automatic supervision on the next rescan.
|
|
S6_DYNAMIC_SCANDIR = Path("/run/service")
|
|
S6_SERVICE_PREFIX = "gateway-"
|
|
|
|
# s6-overlay installs its binaries under /command/ and only adds that
|
|
# directory to PATH for processes started under the supervision tree
|
|
# (services started by s6-svscan, cont-init.d scripts, etc.). Code
|
|
# that runs via `docker exec` or any other out-of-tree entry point —
|
|
# notably our Phase 4 profile create/delete hooks — inherits the
|
|
# container's base PATH which does NOT include /command/.
|
|
#
|
|
# Rather than asking every caller to fix up its environment, the
|
|
# S6ServiceManager calls s6-* binaries by absolute path via this
|
|
# constant. We don't use `/usr/bin/s6-…` symlinks because the
|
|
# s6-overlay-symlinks-noarch tarball only links a subset, and we
|
|
# want every s6 invocation to be guaranteed-findable.
|
|
_S6_BIN_DIR = "/command"
|
|
|
|
|
|
# UID/GID of the in-image ``hermes`` user. Hardcoded to match what
|
|
# ``stage2-hook.sh`` enforces (the runtime invariant — see also
|
|
# tests/docker/test_uid_remap.py). The container starts s6-supervise
|
|
# under root and immediately drops to this UID via ``s6-setuidgid``.
|
|
_HERMES_UID = 10000
|
|
_HERMES_GID = 10000
|
|
|
|
|
|
def _seed_supervise_skeleton(svc_dir: Path) -> None:
|
|
"""Pre-create the ``supervise/`` and top-level ``event/`` skeleton
|
|
inside a service directory, owned by the hermes user.
|
|
|
|
Why this exists
|
|
---------------
|
|
When s6-supervise spawns a service it tries to ``mkdir`` two
|
|
directories: ``<svc>/event`` and ``<svc>/supervise``, both with mode
|
|
``0700``. It also ``mkfifo``s ``<svc>/supervise/control`` with mode
|
|
``0600``. Because s6-supervise runs as PID 1's effective UID (root)
|
|
these dirs end up root-owned mode 0700, and an unprivileged client
|
|
(the ``hermes`` user — UID 10000 — running every Hermes runtime
|
|
operation via ``s6-setuidgid``) gets ``EACCES`` on any ``s6-svc``,
|
|
``s6-svstat``, or ``s6-svwait`` invocation against the slot.
|
|
|
|
The PR #30136 review surfaced this as a real product gap: the
|
|
entire S6ServiceManager lifecycle (``register/start/stop/unregister
|
|
_profile_gateway``) was inert in production because every operation
|
|
is dispatched as the hermes user.
|
|
|
|
Why this works
|
|
--------------
|
|
Reading s6's source (src/supervision/s6-supervise.c::trymkdir +
|
|
control_init): the ``mkdir`` and ``mkfifo`` calls both treat
|
|
``EEXIST`` as success. If the directory is already present, the
|
|
chown/chmod fix-up that would normally make event/ ``03730
|
|
root:root`` is **skipped** entirely — s6-supervise just opens the
|
|
pre-existing FIFOs and proceeds. So if we lay the skeleton down
|
|
with hermes ownership before triggering ``s6-svscanctl -a``,
|
|
s6-supervise inherits our layout and never touches it.
|
|
|
|
Layout produced
|
|
---------------
|
|
``svc_dir/`` hermes:hermes, 0755 (parent must already exist)
|
|
``svc_dir/event/`` hermes:hermes, 03730 (setgid + g+rwx + sticky)
|
|
``svc_dir/supervise/`` hermes:hermes, 0755
|
|
``svc_dir/supervise/event/`` hermes:hermes, 03730
|
|
``svc_dir/supervise/control`` hermes:hermes, 0660 (FIFO)
|
|
|
|
The ``death_tally``, ``lock``, and ``status`` regular files end up
|
|
written by s6-supervise itself (as root), but those land mode 0644 —
|
|
world-readable — and ``s6-svstat`` only needs read access, so the
|
|
hermes user reads them fine.
|
|
|
|
If ``svc_dir/log/`` is present (the canonical s6 logger pattern —
|
|
one s6-supervise instance per service, plus a second for its
|
|
logger), the same skeleton is seeded under ``log/`` as well:
|
|
``log/event/``, ``log/supervise/``, ``log/supervise/event/``,
|
|
``log/supervise/control``. Without this, unregister teardown
|
|
would EACCES on the logger's supervise dir even after the parent
|
|
slot's supervise/ was hermes-owned.
|
|
|
|
Idempotency
|
|
-----------
|
|
Safe to call against a directory where the skeleton already exists.
|
|
Existing entries are left untouched (the helper doesn't try to
|
|
re-chown / re-chmod live FIFOs that s6-supervise may have already
|
|
opened).
|
|
|
|
Reference
|
|
---------
|
|
Discussed at length on the skarnet `skaware` mailing list in 2020
|
|
(`<http://skarnet.org/lists/skaware/1424.html>`_); see also
|
|
just-containers/s6-overlay#130. The pre-creation pattern was
|
|
historically called out as forward-compatibility-fragile, but the
|
|
EEXIST handling in s6-supervise has been stable since 2015 — it's
|
|
the same pattern ``s6-svperms`` and ``fix-attrs.d`` rely on.
|
|
"""
|
|
import os
|
|
|
|
def _mkdir_owned(path: Path, mode: int) -> None:
|
|
if path.exists():
|
|
return
|
|
path.mkdir(parents=False, exist_ok=False)
|
|
path.chmod(mode)
|
|
try:
|
|
os.chown(path, _HERMES_UID, _HERMES_GID)
|
|
except PermissionError:
|
|
# Running as the hermes user already — directory is hermes-
|
|
# owned by default. The chown is a no-op in that case, so
|
|
# swallowing this keeps both root and unprivileged callers
|
|
# on one code path.
|
|
pass
|
|
|
|
# Top-level event/ dir (this is the s6-svlisten1 event-subscription
|
|
# dir at the service root, distinct from supervise/event/).
|
|
_mkdir_owned(svc_dir / "event", 0o3730)
|
|
|
|
# supervise/ dir + its inner event/ dir.
|
|
supervise = svc_dir / "supervise"
|
|
_mkdir_owned(supervise, 0o755)
|
|
_mkdir_owned(supervise / "event", 0o3730)
|
|
|
|
# supervise/control FIFO. Same EEXIST-safe pattern: if it's already
|
|
# there (s6-supervise has already started against this slot), leave
|
|
# it alone. The explicit chmod after mkfifo is required because
|
|
# mkfifo honors the process umask, which can strip group-write
|
|
# (e.g. the default 0022 on most dev hosts → 0o660 becomes 0o640).
|
|
# The container runs with umask 0 inside s6-overlay's stage2, but
|
|
# being defensive here keeps the helper consistent under any
|
|
# invocation context.
|
|
control = supervise / "control"
|
|
if not control.exists():
|
|
os.mkfifo(control, 0o660)
|
|
control.chmod(0o660)
|
|
try:
|
|
os.chown(control, _HERMES_UID, _HERMES_GID)
|
|
except PermissionError:
|
|
pass
|
|
|
|
# If a log/ subdir is present (the canonical s6 logger pattern —
|
|
# see servicedir(7)), it gets its own s6-supervise instance and
|
|
# needs the same skeleton. Without this, unregister teardown
|
|
# would EACCES on the logger's root-owned supervise/ dir even
|
|
# when the parent slot's supervise/ is hermes-owned.
|
|
log_dir = svc_dir / "log"
|
|
if log_dir.is_dir():
|
|
_mkdir_owned(log_dir / "event", 0o3730)
|
|
log_supervise = log_dir / "supervise"
|
|
_mkdir_owned(log_supervise, 0o755)
|
|
_mkdir_owned(log_supervise / "event", 0o3730)
|
|
log_control = log_supervise / "control"
|
|
if not log_control.exists():
|
|
os.mkfifo(log_control, 0o660)
|
|
log_control.chmod(0o660)
|
|
try:
|
|
os.chown(log_control, _HERMES_UID, _HERMES_GID)
|
|
except PermissionError:
|
|
pass
|
|
|
|
|
|
class S6Error(RuntimeError):
|
|
"""Base error for S6ServiceManager lifecycle failures.
|
|
|
|
Concrete subclasses carry the slot name (and, where useful, the
|
|
underlying subprocess output) so the CLI can render an actionable
|
|
message instead of leaking a raw ``CalledProcessError`` traceback.
|
|
"""
|
|
|
|
def __init__(self, message: str, *, service: str | None = None) -> None:
|
|
super().__init__(message)
|
|
self.service = service
|
|
|
|
|
|
class GatewayNotRegisteredError(S6Error):
|
|
"""Raised when a lifecycle method targets a slot that doesn't exist.
|
|
|
|
Most commonly: ``hermes -p typo gateway start`` when no profile
|
|
``typo`` exists. Carries the unprefixed profile name (not the
|
|
full ``gateway-<profile>`` service-dir name) so callers can phrase
|
|
a user-facing message like "no such gateway 'typo'".
|
|
"""
|
|
|
|
def __init__(self, profile: str) -> None:
|
|
self.profile = profile
|
|
super().__init__(
|
|
f"no such gateway {profile!r}: register it with "
|
|
f"`hermes profile create {profile}` first, or pass "
|
|
"an existing profile name via `-p <name>`",
|
|
service=f"gateway-{profile}",
|
|
)
|
|
|
|
|
|
class S6CommandError(S6Error):
|
|
"""Raised when an s6 command fails for a reason other than a
|
|
missing slot — e.g. permission denied on the supervise control
|
|
FIFO, or s6-svc returning a non-zero exit for an unexpected
|
|
reason. Carries the stderr from the failing command so callers
|
|
can surface it.
|
|
"""
|
|
|
|
def __init__(
|
|
self, *, service: str, action: str, returncode: int, stderr: str,
|
|
) -> None:
|
|
self.action = action
|
|
self.returncode = returncode
|
|
self.stderr = stderr
|
|
message = (
|
|
f"s6-svc {action} on {service!r} failed (rc={returncode})"
|
|
)
|
|
if stderr.strip():
|
|
message += f": {stderr.strip()}"
|
|
super().__init__(message, service=service)
|
|
|
|
|
|
class S6ServiceManager:
|
|
"""Per-profile gateway supervision via s6-overlay.
|
|
|
|
Only handles runtime-registered services under
|
|
``S6_DYNAMIC_SCANDIR``. Static services (main-hermes, dashboard)
|
|
are managed by s6-rc at image-build time and are out of scope.
|
|
"""
|
|
|
|
kind: ServiceManagerKind = "s6"
|
|
|
|
def __init__(self, scandir: Path = S6_DYNAMIC_SCANDIR) -> None:
|
|
self.scandir = scandir
|
|
|
|
# -- internal helpers --------------------------------------------------
|
|
|
|
def _service_dir(self, profile: str) -> Path:
|
|
validate_profile_name(profile)
|
|
return self.scandir / f"{S6_SERVICE_PREFIX}{profile}"
|
|
|
|
def _service_name(self, profile: str) -> str:
|
|
return f"{S6_SERVICE_PREFIX}{profile}"
|
|
|
|
@staticmethod
|
|
def _render_run_script(
|
|
profile: str,
|
|
extra_env: dict[str, str],
|
|
) -> str:
|
|
"""Generate the run script for a profile-gateway s6 service.
|
|
|
|
The script:
|
|
1. Sources HERMES_HOME (and any extra env) via with-contenv —
|
|
so e.g. ``-e HERMES_HOME=/data/hermes`` is honored at run
|
|
time, not Python-substituted at registration time (OQ8-C).
|
|
2. Activates the bundled venv.
|
|
3. Drops to the hermes user and exec's
|
|
``hermes -p <profile> gateway run`` (or just ``hermes
|
|
gateway run`` for the default profile — see below).
|
|
|
|
Special case: ``profile == "default"`` emits ``hermes gateway
|
|
run`` with **no** ``-p`` flag. This is the sentinel for "the
|
|
root HERMES_HOME profile" (the implicit profile that exists at
|
|
the top of $HERMES_HOME, not under profiles/). It must be
|
|
spelled this way because ``_profile_suffix()`` returns the
|
|
empty string for the root profile, and the dispatcher in
|
|
``hermes_cli.gateway`` maps that empty string to the
|
|
``gateway-default`` service slot. Passing ``-p default`` here
|
|
would instead look up ``$HERMES_HOME/profiles/default/`` — a
|
|
completely different (and almost always nonexistent) profile.
|
|
|
|
Port selection: the gateway picks its bind port from the
|
|
profile's ``config.yaml`` (``[gateway] port = ...``) — that
|
|
is the single source of truth. Previously this method took a
|
|
``port`` parameter that was passed in but never substituted
|
|
into the rendered script (it was carried in for "API parity"
|
|
with a deterministic SHA-256 allocator in
|
|
``hermes_cli.profiles._allocate_gateway_port``). PR #30136
|
|
review item I5 retired both the allocator and the parameter
|
|
because they were dead code through the entire stack.
|
|
"""
|
|
import shlex
|
|
lines = [
|
|
"#!/command/with-contenv sh",
|
|
"# shellcheck shell=sh",
|
|
"set -e",
|
|
"cd /opt/data",
|
|
". /opt/hermes/.venv/bin/activate",
|
|
]
|
|
for k, v in sorted(extra_env.items()):
|
|
lines.append(f"export {k}={shlex.quote(v)}")
|
|
if profile == "default":
|
|
lines.append("exec s6-setuidgid hermes hermes gateway run")
|
|
else:
|
|
lines.append(
|
|
f"exec s6-setuidgid hermes hermes -p {shlex.quote(profile)} gateway run"
|
|
)
|
|
return "\n".join(lines) + "\n"
|
|
|
|
@staticmethod
|
|
def _render_log_run(profile: str) -> str:
|
|
"""Generate the log/run script for a profile-gateway service.
|
|
|
|
OQ8-C: persist to ``${HERMES_HOME}/logs/gateways/<profile>/``.
|
|
CRITICAL: the HERMES_HOME path is sourced from the runtime env
|
|
via with-contenv — NOT Python-substituted at registration time
|
|
— so a container started with ``-e HERMES_HOME=/data/hermes``
|
|
gets its logs under /data/hermes/logs/..., not the build-time
|
|
default.
|
|
"""
|
|
import shlex
|
|
prof = shlex.quote(profile)
|
|
return (
|
|
f"#!/command/with-contenv sh\n"
|
|
f"# shellcheck shell=sh\n"
|
|
f': "${{HERMES_HOME:=/opt/data}}"\n'
|
|
f'log_dir="$HERMES_HOME/logs/gateways/{prof}"\n'
|
|
f'mkdir -p "$log_dir"\n'
|
|
f'chown -R hermes:hermes "$log_dir" 2>/dev/null || true\n'
|
|
f'exec s6-setuidgid hermes s6-log n10 s1000000 T "$log_dir"\n'
|
|
)
|
|
|
|
# -- lifecycle ---------------------------------------------------------
|
|
|
|
def _run_svc(self, action_flag: str, action_label: str, name: str) -> None:
|
|
"""Shared lifecycle dispatch for start / stop / restart.
|
|
|
|
Translates the two failure modes operators care about into
|
|
named errors:
|
|
|
|
* ``GatewayNotRegisteredError`` — the service directory at
|
|
``<scandir>/<name>/`` doesn't exist. ``s6-svc`` would
|
|
exit non-zero with a fairly opaque message; we pre-empt
|
|
it with a clear "no such gateway 'X'" tied to the profile
|
|
name (without the ``gateway-`` prefix).
|
|
* ``S6CommandError`` — anything else (EACCES on the
|
|
supervise control FIFO, timeout, etc.). Carries the
|
|
subprocess return code and stderr so callers can render
|
|
them inline.
|
|
|
|
``action_flag`` is the ``s6-svc`` flag (``-u`` / ``-d`` /
|
|
``-t``); ``action_label`` is the human verb (``start`` /
|
|
``stop`` / ``restart``) used in error messages.
|
|
"""
|
|
import subprocess
|
|
|
|
service_dir = self.scandir / name
|
|
if not service_dir.is_dir():
|
|
# Strip the gateway- prefix back off so the message
|
|
# matches what the user typed on the CLI (``-p <profile>``).
|
|
profile = (
|
|
name[len(S6_SERVICE_PREFIX):]
|
|
if name.startswith(S6_SERVICE_PREFIX)
|
|
else name
|
|
)
|
|
raise GatewayNotRegisteredError(profile)
|
|
|
|
try:
|
|
subprocess.run(
|
|
[f"{_S6_BIN_DIR}/s6-svc", action_flag, str(service_dir)],
|
|
check=True, capture_output=True, text=True, timeout=5,
|
|
)
|
|
except subprocess.CalledProcessError as exc:
|
|
raise S6CommandError(
|
|
service=name,
|
|
action=action_label,
|
|
returncode=exc.returncode,
|
|
stderr=exc.stderr or "",
|
|
) from exc
|
|
|
|
def start(self, name: str) -> None:
|
|
"""Bring up a registered service (``s6-svc -u``).
|
|
|
|
Raises:
|
|
GatewayNotRegisteredError: no service directory for ``name``.
|
|
S6CommandError: s6-svc exited non-zero for any other reason
|
|
(permission denied on the supervise FIFO, timeout, etc.).
|
|
"""
|
|
self._run_svc("-u", "start", name)
|
|
|
|
def stop(self, name: str) -> None:
|
|
"""Bring down a registered service (``s6-svc -d``).
|
|
|
|
Raises:
|
|
GatewayNotRegisteredError: no service directory for ``name``.
|
|
S6CommandError: s6-svc exited non-zero for any other reason.
|
|
"""
|
|
self._run_svc("-d", "stop", name)
|
|
|
|
def restart(self, name: str) -> None:
|
|
"""Restart a registered service (``s6-svc -t`` = SIGTERM).
|
|
|
|
Raises:
|
|
GatewayNotRegisteredError: no service directory for ``name``.
|
|
S6CommandError: s6-svc exited non-zero for any other reason.
|
|
"""
|
|
self._run_svc("-t", "restart", name)
|
|
|
|
def is_running(self, name: str) -> bool:
|
|
"""True iff ``s6-svstat`` reports the service as up."""
|
|
import subprocess
|
|
result = subprocess.run(
|
|
[f"{_S6_BIN_DIR}/s6-svstat", str(self.scandir / name)],
|
|
capture_output=True, text=True, timeout=5,
|
|
)
|
|
return result.returncode == 0 and "up " in result.stdout
|
|
|
|
# -- runtime registration ---------------------------------------------
|
|
|
|
def supports_runtime_registration(self) -> bool:
|
|
return True
|
|
|
|
def register_profile_gateway(
|
|
self,
|
|
profile: str,
|
|
*,
|
|
extra_env: dict[str, str] | None = None,
|
|
) -> None:
|
|
"""Create the s6 service directory for a profile gateway.
|
|
|
|
Triggers ``s6-svscanctl -a`` so s6-svscan picks the new directory
|
|
up immediately. The service is created in the *up* state — to
|
|
register without auto-starting, follow up with ``stop(profile)``
|
|
(or pass the start flag via the future ``start_now=False`` arg,
|
|
which the Phase 4 reconciliation path uses via a ``down``
|
|
marker file written directly).
|
|
|
|
Raises:
|
|
ValueError: if the profile name is invalid or the service
|
|
directory already exists.
|
|
RuntimeError: if ``s6-svscanctl`` fails.
|
|
"""
|
|
import shutil
|
|
import subprocess
|
|
|
|
svc_dir = self._service_dir(profile)
|
|
if svc_dir.exists():
|
|
raise ValueError(
|
|
f"profile gateway {profile!r} already registered at {svc_dir}"
|
|
)
|
|
|
|
# Build the service directory atomically: write to a sibling
|
|
# temp dir, then rename. Avoids s6-svscan observing a half-
|
|
# populated directory on a fast rescan.
|
|
tmp_dir = svc_dir.with_name(svc_dir.name + ".tmp")
|
|
if tmp_dir.exists():
|
|
shutil.rmtree(tmp_dir, ignore_errors=True)
|
|
tmp_dir.mkdir(parents=True)
|
|
|
|
try:
|
|
(tmp_dir / "type").write_text("longrun\n")
|
|
|
|
run_script = self._render_run_script(profile, extra_env or {})
|
|
run_path = tmp_dir / "run"
|
|
run_path.write_text(run_script)
|
|
run_path.chmod(0o755)
|
|
|
|
# Persistent log rotation (OQ8-C).
|
|
log_subdir = tmp_dir / "log"
|
|
log_subdir.mkdir()
|
|
log_run = log_subdir / "run"
|
|
log_run.write_text(self._render_log_run(profile))
|
|
log_run.chmod(0o755)
|
|
|
|
# Pre-create the supervise/ skeleton with hermes ownership
|
|
# BEFORE we publish the slot. s6-supervise will EEXIST our
|
|
# dirs/FIFOs and inherit the ownership, so the runtime
|
|
# s6-svc / s6-svstat / s6-svwait calls (all dispatched as
|
|
# the hermes user) won't hit EACCES on root-owned 0700
|
|
# dirs. See ``_seed_supervise_skeleton`` for the full
|
|
# rationale.
|
|
_seed_supervise_skeleton(tmp_dir)
|
|
|
|
tmp_dir.rename(svc_dir)
|
|
except Exception:
|
|
shutil.rmtree(tmp_dir, ignore_errors=True)
|
|
raise
|
|
|
|
# Trigger rescan so s6-svscan picks up the new service.
|
|
result = subprocess.run(
|
|
[f"{_S6_BIN_DIR}/s6-svscanctl", "-a", str(self.scandir)],
|
|
capture_output=True, text=True, timeout=5,
|
|
)
|
|
if result.returncode != 0:
|
|
# Clean up: rescan failed, leave the directory in place would
|
|
# be confusing (no supervisor watching it).
|
|
shutil.rmtree(svc_dir, ignore_errors=True)
|
|
raise RuntimeError(
|
|
f"s6-svscanctl failed: {result.stderr or result.stdout}"
|
|
)
|
|
|
|
def unregister_profile_gateway(self, profile: str) -> None:
|
|
"""Stop the profile gateway service and remove its directory.
|
|
|
|
Idempotent: absent services are a no-op. Best-effort stop +
|
|
wait-for-down before removal so the running gateway process
|
|
gets a chance to shut down cleanly before its service dir
|
|
disappears.
|
|
|
|
Teardown ordering matters: ``s6-svscanctl -an`` is fired
|
|
**before** ``rmtree`` so s6-svscan reaps the supervise child
|
|
process (releasing its handle on ``supervise/lock`` and the
|
|
regular files inside the supervise dir), giving us a clean
|
|
directory to remove. Without the reap-first ordering, the
|
|
rmtree races s6-supervise on a set of root-owned files inside
|
|
the supervise dir and the dir is left half-removed.
|
|
"""
|
|
import shutil
|
|
import subprocess
|
|
import time
|
|
|
|
svc_dir = self._service_dir(profile)
|
|
if not svc_dir.exists():
|
|
return
|
|
|
|
# Stop the service (best effort — service may already be down).
|
|
subprocess.run(
|
|
[f"{_S6_BIN_DIR}/s6-svc", "-d", str(svc_dir)],
|
|
capture_output=True, text=True, timeout=5,
|
|
check=False,
|
|
)
|
|
# Wait for it to actually go down (up to 10s).
|
|
subprocess.run(
|
|
[f"{_S6_BIN_DIR}/s6-svwait", "-D", "-t", "10000", str(svc_dir)],
|
|
capture_output=True, text=True, timeout=15,
|
|
check=False,
|
|
)
|
|
|
|
# Reap the supervise child FIRST: -n tells s6-svscan to drop
|
|
# any supervise processes whose service dir is gone (which
|
|
# includes any service dir we're about to remove). This
|
|
# releases the file handles s6-supervise holds against the
|
|
# supervise/lock + supervise/status + supervise/death_tally
|
|
# files inside the slot, so the upcoming rmtree doesn't race.
|
|
subprocess.run(
|
|
[f"{_S6_BIN_DIR}/s6-svscanctl", "-an", str(self.scandir)],
|
|
capture_output=True, text=True, timeout=5,
|
|
check=False,
|
|
)
|
|
# Give s6-svscan a moment to reap. There's no synchronous
|
|
# "scan completed" handshake — the -a/-n trigger just sets a
|
|
# flag s6-svscan reads on its next loop iteration. 200ms is
|
|
# comfortably above the loop's resolution but well under any
|
|
# user-perceived latency.
|
|
time.sleep(0.2)
|
|
|
|
# Now the supervise dir's files are no longer held open by a
|
|
# live s6-supervise, so rmtree can remove them. Files inside
|
|
# supervise/ are root-owned (death_tally, lock, status, written
|
|
# by s6-supervise itself) — but the parent supervise/ directory
|
|
# is hermes-owned (see ``_seed_supervise_skeleton``), and on
|
|
# POSIX you only need write+execute on the parent to remove
|
|
# contained files regardless of file ownership.
|
|
shutil.rmtree(svc_dir, ignore_errors=True)
|
|
|
|
def list_profile_gateways(self) -> list[str]:
|
|
"""Return the profile names of all currently-registered gateway services.
|
|
|
|
Filters the scandir to entries that match the ``gateway-`` prefix.
|
|
Other services (e.g. ``s6-linux-init-shutdownd``) are ignored.
|
|
"""
|
|
if not self.scandir.exists():
|
|
return []
|
|
profiles: list[str] = []
|
|
for entry in self.scandir.iterdir():
|
|
if entry.name.startswith("."):
|
|
continue
|
|
if not entry.is_dir():
|
|
continue
|
|
if not entry.name.startswith(S6_SERVICE_PREFIX):
|
|
continue
|
|
profiles.append(entry.name[len(S6_SERVICE_PREFIX):])
|
|
return profiles
|