fix(file-safety): add sandbox-mirror soft guard for writes to per-task .hermes mirrors (#32213)

#32049 reports that under terminal.backend: docker, write_file / patch calls to authoritative profile state (SOUL.md, memories, etc.) land on the sandbox-local mirror at ``<HERMES_HOME>/profiles/<name>/sandboxes/<backend>/<task>/home/.hermes/...`` — a path the host Hermes process never reads. The tool reports success, the user sees no behavior change, and on disk two divergent copies of SOUL.md (or any other profile file) accumulate. The existing classify_cross_profile_target guard does not catch this: its parts[2] check sees "sandboxes" and returns None, and the path is in-profile from the inner-mirror perspective so even a fixed version would not fire. Add a parallel sandbox-mirror classifier in agent/file_safety: * classify_sandbox_mirror_target() detects the ``…/sandboxes/<backend>/<task>/home/.hermes/…`` shape via path parts. Detection is path-shape only — backend-agnostic, does not require the file to exist, and works regardless of which HERMES_HOME resolves. * get_sandbox_mirror_warning() returns a model-facing warning that names the mirror root and the inner authoritative path the agent likely meant. Wire both detectors through tools/file_tools._check_cross_profile_path so the existing write_file and v4a patch call sites pick up the new guard with no API change. The bypass kwarg (``cross_profile=True``) remains shared between the two guards — same "I know what I'm doing" escape valve after explicit user direction. This is the defense-in-depth piece of the proposal in #32049 ("any …/sandboxes/<backend>/…/home/…hermes/… path as sandbox-mirror"). It catches the host-side speculation case where the agent writes a literal sandbox-mirror path. The inner-container case (where the bind mount strips the ``sandboxes/`` prefix from the agent's path view) is out of scope for this surgical change — that requires either a dispatch-layer host-side check before the container handoff, or the host-side ``profile_state`` / ``soul`` tool the issue also proposes. Soft guard, NOT a security boundary — matches the existing classify_cross_profile_target contract. Co-authored-by: briandevans <252620095+briandevans@users.noreply.github.com> Co-authored-by: Ben Barclay <ben@nousresearch.com>
2026-06-02 02:29:24 +01:00
parent 4f7fe9bcff
commit 162c7856ca
3 changed files with 360 additions and 11 deletions
--- a/agent/file_safety.py
+++ b/agent/file_safety.py
@ -451,3 +451,113 @@ def get_cross_profile_warning(path: str) -> Optional[str]:
        f"``cross_profile=True``. (Defense-in-depth — not a security "
        f"boundary; the terminal tool can still bypass.)"
    )
+
+
+# ---------------------------------------------------------------------------
+# Sandbox-mirror write guard (#32049)
+#
+# Non-local terminal backends (Docker, Daytona, etc.) bind a sandbox-local
+# directory to the container's ``$HOME``. The on-disk layout looks like
+#
+#   <HERMES_HOME>/profiles/<name>/sandboxes/<backend>/<task>/home/.hermes/...
+#
+# When the agent (running host-side) speculates that authoritative profile
+# state lives at one of those sandbox-mirror paths, the write lands on the
+# mirror — never read by the host process — while the host file is left
+# untouched. The agent reports success, the user sees no change, and on
+# disk two divergent copies accumulate. See #32049 for evidence.
+#
+# This guard is path-shape-only: it detects the
+# ``…/sandboxes/<backend>/<task>/home/.hermes/…`` segment and warns
+# regardless of which Hermes profile is active. It does NOT cover the
+# inner-container case where the bind mount strips the ``sandboxes/`` prefix
+# (the agent's view inside the container is plain ``/root/.hermes/...``);
+# that case needs a separate dispatch-layer or host-side ``profile_state``
+# tool.
+# ---------------------------------------------------------------------------
+
+
+def _find_sandbox_mirror_segments(parts: tuple) -> Optional[int]:
+    """Return the index of the inner ``.hermes`` part in a sandbox-mirror path.
+
+    Matches ``…/sandboxes/<backend>/<task>/home/.hermes/…`` and returns the
+    index where the inner Hermes-state portion starts. Returns ``None`` for
+    paths that do not contain the sandbox-mirror shape.
+    """
+    for i, part in enumerate(parts):
+        if part != "sandboxes":
+            continue
+        # Need at least: sandboxes / <backend> / <task> / home / .hermes / <thing>
+        if i + 5 >= len(parts):
+            continue
+        if parts[i + 3] == "home" and parts[i + 4] == ".hermes":
+            return i + 4
+    return None
+
+
+def classify_sandbox_mirror_target(path: str) -> Optional[dict]:
+    """Classify a write target as a sandbox-mirror of authoritative Hermes state.
+
+    Returns ``None`` when the path does not match the sandbox-mirror shape.
+    Otherwise returns a dict with:
+
+      * ``target_path``: the resolved path string
+      * ``mirror_root``: the ``…/sandboxes/<backend>/<task>/home/.hermes``
+        prefix (so callers can show users which sandbox owns the mirror)
+      * ``inner_path``: the portion under the mirror's ``.hermes`` (what the
+        agent likely meant to address on the host)
+
+    Detection is path-shape-only — does not require any Hermes resolver to
+    succeed, so it works correctly even when called from contexts where
+    HERMES_HOME resolution would be ambiguous.
+    """
+    try:
+        target = Path(os.path.expanduser(str(path))).resolve()
+    except (OSError, RuntimeError):
+        return None
+
+    parts = target.parts
+    inner_idx = _find_sandbox_mirror_segments(parts)
+    if inner_idx is None:
+        return None
+
+    mirror_root = str(Path(*parts[: inner_idx + 1]))
+    inner_path = str(Path(*parts[inner_idx + 1 :])) if inner_idx + 1 < len(parts) else ""
+
+    return {
+        "target_path": str(target),
+        "mirror_root": mirror_root,
+        "inner_path": inner_path,
+    }
+
+
+def get_sandbox_mirror_warning(path: str) -> Optional[str]:
+    """Return a model-facing warning when ``path`` lands in a sandbox mirror.
+
+    Returns ``None`` when the path is not a sandbox-mirror target. Caller
+    is expected to surface the warning to the agent as a tool-result
+    error. The bypass kwarg (``cross_profile=True``) is shared with the
+    cross-profile guard: both are soft "I know what I'm doing" overrides
+    a user can authorise.
+
+    Defense-in-depth, NOT a security boundary: the terminal tool runs as
+    the same OS user and can write the mirror path directly. The guard
+    exists to surface the misclassification before the silent-success +
+    divergent-copy footgun in #32049 fires.
+    """
+    info = classify_sandbox_mirror_target(path)
+    if info is None:
+        return None
+    return (
+        f"Sandbox-mirror write blocked by soft guard: {info['target_path']} "
+        f"sits under {info['mirror_root']!r}, which is a per-task mirror "
+        f"created by a non-local terminal backend (docker/daytona/etc.). "
+        f"Writes here land on a copy that the host Hermes process never "
+        f"reads — the authoritative file is likely {info['inner_path']!r} "
+        f"under the real HERMES_HOME. Use the host-side tool for "
+        f"authoritative state (e.g. ``memory`` for memories), or address "
+        f"the host path directly. To bypass this guard after explicit "
+        f"user direction, retry the call with ``cross_profile=True``. "
+        f"(Defense-in-depth — not a security boundary; the terminal tool "
+        f"can still bypass.)"
+    )