The per-session icon picker added more noise than value — rip it out end
to end (sessions.icon column, set_session_icon, the PATCH field, the
picker UI, and the SessionInfo.icon type).
The cross-profile session aggregator now opens each profile's state.db
read-only (mode=ro, no schema init), so listing other profiles on every
sidebar refresh never DDLs or takes a write lock on their live DBs. The
single-profile hot path stays on par with /api/sessions.
Add first-class profile support to the desktop app without app reloads.
- Swap the single live gateway onto a session's profile lazily (spawned on
demand by the Electron backend pool), so one backend serves the active
profile and others stay cold — no OOM with many profiles.
- Aggregate sessions across profiles by reading each profile's state.db
read-only; unified "All profiles" view groups sessions per profile with
per-profile pagination, while the default view stays scoped to one profile.
- Add an Arc-style profile rail at the sidebar foot: a default<->all toggle
pinned left, colored named-profile squares scrolling between, Manage pinned
right. Profile identity is a deterministic per-name color.
- Route profile-scoped REST (config/env/skills/tools/model) to the active
gateway profile and invalidate React Query caches on swap. Single-profile
users never trigger a swap, so their path is unchanged.
Backend:
- web_server: profile-aware active/list endpoints + per-profile session
totals; hermes_state: session_count(exclude_children); main.py: honor
--profile over HERMES_HOME env for pooled backends.
UI primitives:
- Add a position-aware Tip tooltip (instant, themed) as a drop-in for native
title=, and strip redundant tooltips from self-descriptive chrome.
The salvaged detector validated each cached electron-*.zip with
zipfile.testzip() and only purged ones it judged corrupt. But stdlib
zipfile reads from the end-of-central-directory backward, so it silently
tolerates prepended/concatenated junk — which is exactly the corruption
the bug report names ('86257938 extra bytes at beginning or within
zipfile', a partial download resumed into the same file). testzip()
returns clean on those zips, so the self-heal never fired for the
reported failure mode.
Drop the self-rolled validator: on any packaged-build failure, purge the
version's cached zips AND the half-written unpacked dir, then retry once.
@electron/get re-downloads with its own SHASUM verification — the real
source of truth, which catches prepend/concat/truncate alike. An
unrelated failure just costs one clean re-download and fails the same way.
Verified empirically: zipfile.testzip() returns None (clean) on a
prepended-junk zip; the unconditional purge removes it correctly.
hermes desktop failed on Linux with an ENOENT renaming
release/linux-unpacked/electron -> Hermes. Root cause is a corrupt
cached Electron zip (~/.cache/electron/electron-*.zip): app-builder
unpack-electron extracts a partial tree from the bad zip that is
missing the electron binary, so electron-builder dies on the final
rename. Re-running repeats the broken extraction, leaving the desktop
app permanently unlaunchable until the cache is manually purged.
- Add _electron_download_cache_dirs() + _purge_corrupt_electron_cache()
to hermes_cli/main.py: validate every electron-*.zip via
zipfile.testzip() and delete corrupt ones; honor electron_config_cache
/ ELECTRON_CACHE overrides with per-OS defaults.
- Wire purge + single retry into cmd_gui packaged-build failure path so
a poisoned download self-heals (electron re-downloads clean).
- Add beforePack hook (apps/desktop/scripts/before-pack.cjs) to wipe the
target unpacked dir before staging, making packaging idempotent across
interrupted runs. Cross-platform, best-effort.
- Tests: corrupt-zip detector, cmd_gui purge/retry/launch path,
no-retry-when-clean path, and node --test for the cleanup helper.
Two complementary fixes for a silent partial-install failure that bit
``hermes update`` in the wild: a fresh checkout pulled 145 commits,
``rebuild_venv`` failed to recreate the venv on Windows because
``shutil.rmtree(ignore_errors=True)`` couldn't delete files held open by
the running ``hermes.exe`` shim. ``uv venv`` then refused with
"A directory already exists at: venv" and the update fell back to
installing on top of the stale venv. The resulting partial install
missed exactly one newly-added base dep — ``pathspec==1.1.1`` — which
``hermes desktop --build-only`` imports at the top of its content-hash
check. The desktop rebuild died with ModuleNotFoundError and the parent
update only logged "⚠ Desktop build failed (non-fatal)". Same root cause
made the "default: sync failed" line in the skill-sync stage, because
that sync subprocess hit the same missing import.
Fix 1: ``rebuild_venv`` retries with ``--clear``
------------------------------------------------
If ``uv venv`` fails with "already exists" in stderr (which is what uv
prints, and what uv's own hint tells you to fix with --clear), retry
once with ``--clear``. Only this specific failure pattern triggers the
retry — disk-full / interpreter-download failures still surface as
before so we don't mask real problems.
Fix 2: post-install dep verification
------------------------------------
Belt-and-suspenders so future uv resolver quirks (or any other cause of
partial installs) surface immediately instead of hours later in a
downstream subprocess. After ``_install_python_dependencies_with_optional_fallback``
runs, ``_verify_core_dependencies_installed``:
1. Reads ``[project.dependencies]`` straight from pyproject.toml
(so we don't trust the venv's stale metadata).
2. Filters by environment markers via ``packaging.requirements.Requirement``
so cross-platform exclusions (``ptyprocess ; sys_platform != 'win32'``)
don't false-positive on Windows.
3. Runs ``importlib.metadata.version()`` for each remaining dep inside
the *target* venv interpreter (resolved from ``VIRTUAL_ENV``, not
``sys.executable``).
4. If anything is missing, reinstalls the base group with
``--reinstall`` to force re-resolution. If a second probe still
reports missing deps, force-installs each one with its pinned spec.
5. Treats final failure as a warning rather than a hard error — a
single broken-on-PyPI dep shouldn't block an otherwise-successful
update — but the message points at ``hermes update --force`` and
names the missing packages so the user knows what's wrong.
Tests
-----
- ``TestRebuildVenv::test_retries_with_clear_when_dir_already_exists`` —
simulates the rmtree-couldn't-delete-it failure mode and asserts the
``--clear`` retry path is taken and succeeds.
- ``TestRebuildVenv::test_does_not_retry_when_first_failure_is_not_dir_exists``
— guards against masking real failures (disk full, etc.).
- ``test_verify_core_dependencies.py`` — 7 tests covering the happy
path, the regression (missing pathspec triggers --reinstall), the
per-package fallback when --reinstall doesn't help, the platform-
marker filter so Windows doesn't try to install ptyprocess, the
missing-pyproject noop, and the VIRTUAL_ENV resolver.
Co-authored-by: Kyssta <218078013+kyssta-exe@users.noreply.github.com>
hermes auth add qwen-oauth called pool.add_entry() but never wrote to
providers["qwen-oauth"] or set active_provider in auth.json.
_model_section_has_credentials() checks get_active_provider() first; with
active_provider unset and no api_key_env_vars configured for oauth_external
providers, the setup wizard reported "No inference provider configured" even
after a successful Qwen CLI OAuth login.
Add _mark_qwen_oauth_active() in auth.py: writes a minimal provider state
entry (base_url for display only) and calls _save_provider_state() to set
active_provider. The function deliberately does not copy the api_key — that
lives in the Qwen CLI credential file managed by _save_qwen_cli_tokens /
resolve_qwen_runtime_credentials and must not be duplicated in auth.json
where it would become stale.
pool.add_entry() is retained so "hermes auth list" continues to show the entry.
Runtime credential resolution continues to use resolve_qwen_runtime_credentials.
Mirrors the fix applied to openai-codex (#37517) and xai-oauth (#37576).
hermes auth add xai-oauth called pool.add_entry() directly, writing only the
credential-pool entry (source "manual:xai_pkce") without touching
providers["xai-oauth"] or setting active_provider in auth.json.
_model_section_has_credentials() checks get_active_provider() first; with
active_provider unset and no api_key_env_vars configured for oauth_external
providers, the setup wizard reported "No inference provider configured" even
after a successful OAuth login.
Use _save_xai_oauth_tokens() — the canonical path already called from the
hermes model xAI login flow — which writes providers["xai-oauth"]["tokens"]
(setting active_provider) and lets _seed_from_singletons seed the pool with
a "loopback_pkce" entry on the next load_pool() call.
Mirrors the fix applied to openai-codex in #37517.
hermes auth add google-gemini-cli called pool.add_entry() but never wrote
to providers["google-gemini-cli"] or set active_provider in auth.json.
_model_section_has_credentials() checks get_active_provider() first; with
active_provider unset and no api_key_env_vars configured for oauth_external
providers, the setup wizard reported "No inference provider configured" even
after a successful OAuth login.
Add _mark_google_gemini_cli_active() in auth.py: writes a minimal provider
state entry (email for display only) and calls _save_provider_state() to set
active_provider. The function deliberately does not copy access_token or
refresh_token — those are managed by agent.google_oauth in the Google
credential file and must not be duplicated in auth.json where they would
become stale.
pool.add_entry() is retained so "hermes auth list" continues to show the entry.
Runtime credential resolution continues to use agent.google_oauth directly.
Mirrors the fix applied to openai-codex (#37517) and xai-oauth (#37576).
Replace KeepAlive.SuccessfulExit=false dict with <key>KeepAlive</key><true/>
so launchd restarts hermes-gateway on any exit, matching the documented
drain-then-exit restart protocol used by --graceful-restart.
The hermes tools save summary printed '- kanban' (and would print
'+ kanban') for a platform even though kanban is never offered as a
checklist option. kanban is a check_fn-gated toolset whose tools are a
subset of the platform composite, so _get_platform_tools resolves it as
enabled, but _prompt_toolset_checklist only renders CONFIGURABLE_TOOLSETS
— so it can never survive into the returned selection. The added/removed
diff (current_enabled - new_enabled) then surfaced kanban as removed.
Scope the printed diff to the checklist's actual universe via the new
_checklist_toolset_keys() helper at all three diff sites (first-install,
all-platforms, per-platform). The persisted config is unaffected —
_save_platform_tools already preserves non-configurable entries; this was
purely a false-signal in the UI.
The gated dashboard verifies a session cookie by trying each registered
DashboardAuthProvider's verify_session in turn (the session cookie stores
only the access token, not which provider issued it). A provider that
doesn't recognise a token returns None; a provider whose IDP/JWKS is
unreachable raises ProviderError.
The loop used to return HTTP 503 on the FIRST ProviderError, before any
later provider got a turn. With multiple providers stacked, that means an
unreachable IDP for a session you didn't even use blocks login through a
different, reachable provider.
Concrete repro: a self-hosted-OIDC session hits the 'nous' provider first
(registered earlier); nous tries to reach Nous Portal's JWKS, which is
unreachable in a self-hosted deployment, so it raises — and the gate
503s before the 'self-hosted' provider can verify the token. Hit live
while testing the new self-hosted OIDC plugin against a local Keycloak.
Fix: a ProviderError from one provider is logged and the loop continues
to the next. A 503 is returned only if NO provider verified the token
AND at least one was unreachable — distinguishing a transient IDP outage
(don't force a needless re-login) from a token that's genuinely invalid
(fall through to refresh/relogin). Single-provider behaviour is
unchanged.
Tests: adds an _UnreachableProvider stub and three cases — unreachable
provider first must not block a working second; all-unreachable still
503s; reachable-but-unrecognised falls through to 401/relogin (not 503).
Mutation-tested: reverting the fix makes the first case fail with the
exact 503 bug.
The dashboard's embedded Chat surface (/chat, /api/ws, /api/pty) was gated
behind `hermes dashboard --tui` / HERMES_DASHBOARD_TUI=1. The desktop app and
the dashboard's own Chat tab both drive the agent over the /api/ws + /api/pty
WebSockets, so a dashboard started without the flag would pass the /api/status
health check but slam the chat WebSocket shut with WS code 4403 — the app
connects, reports "ready", and chat stays dead. This was the root cause behind
multiple user reports of the desktop app failing to connect to a self-hosted
gateway/dashboard, and it bit Docker and host installs alike.
Make the embedded chat unconditional:
- web_server.py: _DASHBOARD_EMBEDDED_CHAT_ENABLED defaults to True; drop the
embedded_chat parameter and the runtime reassignment from start_server().
The WS gates still read the constant (now always true) so the seam — and its
"rejects when disabled" contract test — stays meaningful.
- main.py: remove the `--tui` argument from the dashboard subparser and the
`embedded_chat = args.tui or HERMES_DASHBOARD_TUI==1` derivation.
- web/: isDashboardEmbeddedChatEnabled() returns true unconditionally; drop the
deprecated __HERMES_DASHBOARD_TUI__ alias and the dead LEGACY_TUI_RE scrape in
the vite dev-token plugin.
- apps/desktop/electron/main.cjs: drop `--tui` from the spawned dashboardArgs
(it would now error with "unrecognized arguments: --tui") and the redundant
HERMES_DASHBOARD_TUI env injection.
- Docker: no s6 run-script change needed — the script never passed --tui; the
HERMES_DASHBOARD_TUI env var is now simply a no-op, so the image works out of
the box with no extra var.
- Docs: remove every dashboard --tui / HERMES_DASHBOARD_TUI reference across the
CLI reference, env-var reference, docker/desktop/web-dashboard guides, in-app
tips, and the zh-Hans translations. The terminal `hermes --tui` / HERMES_TUI
references are intentionally left untouched.
Tests: 270 passing across web_server, dashboard lifecycle, host-header,
auth-gate, and docker-override-scripts suites.
Root installs on Linux (FHS layout, #15608) put the `hermes` command in
`/usr/local/bin` (on PATH) but symlinked the bundled node/npm/npx into
`~/.local/bin`, which isn't on PATH for a stock root shell. `node`/`npm`
were 'command not found' and `hermes dashboard` failed with 'npm is not
available' because its build-on-demand fallback couldn't find npm.
Fix: `install_node()` now symlinks into `get_command_link_dir()` — the same
helper the `hermes` command link already uses — so node/npm/npx land
wherever the command does (`/usr/local/bin` on FHS root, `~/.local/bin`
otherwise, `$PREFIX/bin` on Termux). Non-root and Termux installs are
unchanged.
Also fixes:
- `scripts/lib/node-bootstrap.sh`: adds `_nb_get_link_dir()` mirroring
the same root/Termux/user logic for the standalone bootstrap path
(used by `hermes update`, TUI node bootstrap, etc.)
- `hermes_cli/uninstall.py`: `remove_node_symlinks()` now checks all
candidate directories (`~/.local/bin`, `/usr/local/bin`, `$PREFIX/bin`)
so root FHS uninstalls don't leave orphan symlinks
Regression from #15608, which created the FHS path for the command but
left `install_node` pointed at the legacy user-local dir.
When 'hermes update' rebuilds the project venv (rmtree + uv venv on the
first managed-uv migration), the desktop-rebuild and profile-skills-sync
steps that follow both spawn sys.executable. Firing while the venv is
mid-rewrite makes the child interpreter abort with the bare stderr line
'No pyvenv.cfg file', surfacing as a spurious 'Desktop build failed' /
'default: sync failed' on an update that actually succeeded.
Add _wait_for_interpreter_venv_ready(): resolve the venv hosting
sys.executable and poll briefly for pyvenv.cfg to (re)appear before each
of those subprocess steps. No-op when the interpreter isn't venv-hosted.
The desktop rebuild also retries once after re-waiting, and keeps
streaming its output live (no capture). Best-effort throughout — callers
proceed regardless, so a genuinely broken venv still surfaces the real
error.
A bundled, zero-infrastructure 'just put a password on my dashboard'
provider that uses the supports_password extension point. No external IDP,
no database: sessions are stateless HMAC-signed tokens the provider mints
and verifies itself, and passwords are hashed with stdlib scrypt (no
third-party dependency — deliberately avoids bcrypt to keep the dep
surface unchanged).
- plugins/dashboard_auth/basic: BasicAuthProvider (scrypt verify with a
constant-time dummy-hash path for unknown users so the endpoint is not
a username-timing oracle; access/refresh tokens carry a 'kind' claim
that verify/refresh enforce; cross-secret tokens are rejected). The
register() entry point mirrors the Nous plugin's config/env precedence
(env wins; empty treated as unset) and LAST_SKIP_REASON channel.
- config.py: document the canonical dashboard.basic_auth.* surface
(username / password_hash / password / secret / session_ttl_seconds).
Activates only when username + (password or password_hash) are set, so
OAuth users and loopback/--insecure operators are unaffected. Without an
explicit secret a random per-process key is generated (logged): fine for a
single process, but sessions then don't survive restart or span workers.
The dashboard auth gate was OAuth-only: a DashboardAuthProvider could
authenticate only via a redirect to an IDP (start_login -> /auth/callback
-> complete_login). There was no first-class path for username/password
auth, so self-hosters who just want a password on their dashboard had no
clean option short of an external OAuth IDP.
Extend the provider framework with a parallel, non-redirect front door
that converges on the same Session + cookie + refresh machinery:
- base.py: add the optional supports_password flag and
complete_password_login(username, password) -> Session (default
raises NotImplementedError so an OAuth-only provider that forgets the
flag fails loudly). Add InvalidCredentialsError. OAuth providers are
unaffected (flag defaults False; the method is never called).
- routes.py: add POST /auth/password-login, mirroring the cookie-minting
tail of /auth/callback but skipping PKCE/state/code. Returns JSON
{ok, next} (the form POSTs via fetch). Generic 401 for both unknown
user and wrong password (no enumeration oracle); 404 hides whether a
provider exists or supports passwords; per-IP sliding-window rate
limit (10/min -> 429). /api/auth/providers now reports
supports_password so the login page can branch.
- middleware.py: allowlist /auth/password-login (a bootstrap route).
verify/refresh/revoke/ws-tickets/logout need zero changes — a password
session is just a Session with provider-minted opaque tokens.
- login_page.py: render a credential form (instead of a redirect button)
for supports_password providers, wired by a small inline script that
POSTs to /auth/password-login and navigates on success. OAuth-only
pages stay script-free.
* Port from google-gemini/gemini-cli#21541: back up corrupted config.yaml
When config.yaml fails to parse, load_config() silently falls back to
DEFAULT_CONFIG and leaves the broken file on disk. If the user then re-runs
the setup wizard or hermes config set (both rewrite config.yaml), their
broken-but-recoverable overrides are lost for good.
Adapts the policy-file recovery from gemini-cli#21541: on the first parse
warning for a given broken file, snapshot it to config.yaml.corrupt.<ts>.bak
(best-effort, symlink-guarded, size-deduped) and tell the user where it
landed. Unlike Gemini's version we deliberately do NOT reset config.yaml to a
clean state — hermes never silently mutates user config, and leaving it means
a hand-fixed file is re-read on the next load.
Tests: 3 new cases (backup created + content preserved + original untouched;
same-size backup dedup; symlink not copied). E2E verified with isolated
HERMES_HOME and a real tab-indented broken config.
* fix(dashboard): explain WHY a chat WS connection was refused
The embedded-chat PTY WebSocket (/api/pty) collapsed every rejection
into a bare close code: 4401 for any auth failure, 4403 for three
unrelated failures (host mismatch, origin mismatch, peer-IP). Neither
the server log nor the browser said which gate fired or why, so a
"chat won't connect" report was undiagnosable without a repro.
Server (web_server.py):
- _ws_auth_reason / _ws_host_origin_reason / _ws_client_reason return a
short machine-parseable reason; old bool wrappers kept for callers/tests.
- pty_ws splits the overloaded 4403 into 4401 (auth), 4403 (host/origin),
4408 (peer not allowed), 4404 (chat disabled), and sends the reason on
the close frame (clamped to the 123-byte RFC6455 limit).
- Each path logs one line: 'pty auth rejected reason=.. mode=.. cred=.. peer=..'
/ 'pty refused: <reason> ..'. Accepted path logs 'pty accepted peer=..
mode=.. cred=..' so an audit shows HOW a peer authed, not just that it did.
tui_gateway/ws.py:
- 'ws send/write failed' now logs error_type=<ExcName> so an exception
whose str() is empty (closed-transport sends) no longer logs 'error='.
web/src/pages/ChatPage.tsx:
- console.warn the real close code + server reason on every close.
- Map 4404/4408 to specific banners; 4401/4403 banners echo the server
reason; [session ended] prints the close code.
E2E verified all five reject paths + accepted path produce matching
close code, wire reason, and server log line.
The register command resolved the portal base URL purely from the stored
login, ignoring any override. That meant `HERMES_DASHBOARD_PORTAL_URL` (and
the absence of any flag) gave no way to point registration at a staging or
preview portal — the request always hit the login's portal, returning 404
against a branch that wasn't deployed there.
- _resolve_portal_base_url now takes an optional override (precedence:
override > stored login portal > prod default).
- New --portal-url flag; falls back to HERMES_DASHBOARD_PORTAL_URL env.
- Documents that the access token must be valid at the overridden portal
(it's minted by whoever you logged into).
- 3 new tests for override precedence.
Verified live against the PR #324 Vercel preview: CLI -> preview endpoint ->
real agent:{id} client_id written to .env.
Adds a CLI command that registers this install as a self-hosted dashboard
with the user's Nous Portal account, automating the manual browser flow on
/local-dashboards.
- New hermes_cli/dashboard_register.py: resolves a fresh Nous access token
from auth.json (fast-fails with a `hermes setup` hint when not logged in),
POSTs to {portal}/api/oauth/self-hosted-client, and writes
HERMES_DASHBOARD_OAUTH_CLIENT_ID into ~/.hermes/.env idempotently.
- Docker-style adjective_noun auto-naming; --name and --redirect-uri overrides.
- Persists HERMES_DASHBOARD_PORTAL_URL only when non-default and unset (so a
Vercel preview / staging portal sticks, prod default stays implicit).
- Refuses in managed/hosted installs (the orchestrator stamps the client_id).
- Post-register hint explains the OAuth gate only engages on a non-loopback bind.
- Nested 'register' subparser leaves bare `hermes dashboard` unchanged.
- 9 unit tests (name gen, fast-fails, POST shape, env writes, redirect URI,
portal-URL persistence, 401/403 mapping); dashboard lifecycle tests still green.
Depends on NousResearch/nous-account-service#324 (the portal endpoint).
`_install_dependencies` (hermes memory setup) hard-aborted with
"uv not found — cannot install dependencies" whenever `uv` was not on
PATH, even when a perfectly good `pip` was available. Slim container
images and some CI environments don't ship uv, so memory-provider
dependency installation dead-ended there for no good reason.
Now: use `uv pip install` when uv is present, otherwise fall back to
`<python> -m pip install` when pip3/pip is available, and only abort
(with the uv install hint) when neither is found. The "Run manually:"
hints reflect whichever installer was selected.
Salvages #5954 by @MustafaKara7. Their patch added redundant local
`import subprocess` / `import sys` (both are already in scope — module
-level `sys`, function-top `subprocess`); this salvage drops those and
adds a regression test (TestInstallDependenciesRunner) covering all
three paths (uv / pip-fallback / abort). Verified adversarially: the
pip-fallback test fails against origin/main's unfixed code with the
exact dead-end symptom and passes with the fix.
Closes#5954.
Co-authored-by: MustafaKara7 <186085093+MustafaKara7@users.noreply.github.com>
Both POST /api/model/set and the profile-model writer hand-rolled the same
provider/default/base_url/context_length reconciliation. Extract it into
_apply_main_model_assignment so the custom-vs-hosted base_url logic lives in
one place — removing the future-drift risk where one site learns about
custom base_url persistence and the other forgets.
Behavior unchanged; pinned with a direct helper unit test.
* Port from google-gemini/gemini-cli#21541: back up corrupted config.yaml
When config.yaml fails to parse, load_config() silently falls back to
DEFAULT_CONFIG and leaves the broken file on disk. If the user then re-runs
the setup wizard or hermes config set (both rewrite config.yaml), their
broken-but-recoverable overrides are lost for good.
Adapts the policy-file recovery from gemini-cli#21541: on the first parse
warning for a given broken file, snapshot it to config.yaml.corrupt.<ts>.bak
(best-effort, symlink-guarded, size-deduped) and tell the user where it
landed. Unlike Gemini's version we deliberately do NOT reset config.yaml to a
clean state — hermes never silently mutates user config, and leaving it means
a hand-fixed file is re-read on the next load.
Tests: 3 new cases (backup created + content preserved + original untouched;
same-size backup dedup; symlink not copied). E2E verified with isolated
HERMES_HOME and a real tab-indented broken config.
* feat(dashboard): add Debug Share to the System page
Surface `hermes debug share` in the dashboard. The System > Operations
section gets a dedicated card that uploads a redacted report + full logs
and returns the paste URLs as real, copyable links instead of a log tail.
- debug.py: factor a pure build_debug_share() returning structured
{urls, failures, redacted, auto_delete_seconds}; run_debug_share now
calls it (CLI output unchanged).
- web_server.py: POST /api/ops/debug-share runs the share core in a
worker thread and returns the structured payload synchronously (the
URLs are the whole point — not a backgrounded action).
- api.ts: runDebugShare() + DebugShareResponse.
- SystemPage.tsx: share card with a redaction toggle (on by default),
per-link + copy-all buttons, and the 6h auto-delete countdown.
- tests: build_debug_share core + endpoint (redact toggle, failure 502,
token gate).
Salvage of #35508 (@dchenk), rebased onto current main. Resolved the
tests/tools/test_stage2_hook_puid_pgid.py conflict (kept both the
envdir-creation regression test on main and the new config-migration
tests).
Docker image upgrades replace code under $INSTALL_DIR but preserve
$HERMES_HOME on the mounted volume, so the persisted config.yaml never
received the schema migrations that non-Docker `hermes update` runs
(#35406). This adds scripts/docker_config_migrate.py, invoked from
stage2-hook after first-boot seeding and before gateway services start:
it backs up config.yaml + .env, runs migrate_config(interactive=False),
and honors HERMES_SKIP_CONFIG_MIGRATION=1 for manual control.
Also fixes a latent bug in check_config_version(): it called load_config()
which deep-merges DEFAULT_CONFIG, so a legacy config with no raw
_config_version falsely reported as already-current. It now reads the raw
on-disk file so legacy configs are correctly detected for migration.
Differs from #35508 as submitted (Option B cleanup): dropped the
`_config_version` line added to cli-config.yaml.example and removed the
accompanying test_cli_config_example_declares_latest_version change-detector
test. The example is a copy-template and has no business asserting a schema
version; check_config_version() reads the user's real config.yaml, not the
example. This removes a second sync point that drifts on every version bump.
Closes#35508. Fixes#35406.
Co-authored-by: Dmitriy Cherchenko <17372886+dchenk@users.noreply.github.com>
`hermes mcp add --auth header` built `Authorization: Bearer ${MCP_X_API_KEY}`
and passed it straight to the discovery probe without interpolation, so the
probe sent the literal placeholder and auth-requiring servers (e.g. n8n)
returned 401. Runtime tool loading worked because `_load_mcp_config()`
interpolates, but the four CLI probe call sites (add/test/login/configure)
all used unresolved config.
Resolve `${ENV}` inside `_probe_single_server` via a new
`_resolve_mcp_server_config()` (load_hermes_dotenv + _interpolate_env_vars),
mirroring runtime loading. This covers all four call sites, not just add.
Also strip a leading `Bearer ` from pasted tokens before saving to
`MCP_*_API_KEY`, so a token pasted with the prefix doesn't produce
`Bearer Bearer <jwt>` (also a 401).
Reported with a precise root-cause analysis in #37792.
Co-authored-by: ThyFriendlyFox <116314616+ThyFriendlyFox@users.noreply.github.com>
The runtime resolver reads model.base_url from config and ignores the
OPENAI_BASE_URL env var, so a self-hosted endpoint could not be configured
from the GUI. Two changes enable it:
- POST /api/model/set accepts an optional base_url and persists it as
model.base_url when provider=custom (still clearing stale base_url for
hosted providers).
- POST /api/providers/validate now returns the model ids a custom endpoint
advertises at /v1/models, so the GUI can auto-pick a default without
asking the user to type a model name.
Refs desktop onboarding "Local / custom endpoint" bug.
The setup-flow provider list showed two Anthropic/Claude entries with
ambiguous labels ('Anthropic (Claude API)' and 'Claude Code (subscription)')
in no deliberate order. Relabel and reorder so the distinction and the
subscription caveat are explicit:
- 'Anthropic API Key' (PKCE, API path)
- 'Anthropic OAuth: Required Extra Usage Credits to Use Subscription' (external)
- Both Anthropic entries moved to the bottom of the list.
- 'OpenAI Codex (ChatGPT)' -> 'OpenAI OAuth (ChatGPT)', now first after Nous.
Applied consistently to the backend OAuth catalog (web_server.py) and the
desktop onboarding overlay's PROVIDER_DISPLAY title/order map; test
assertions updated to the new titles.
The TUI hardcoded --max-old-space-size=8192. V8 is not cgroup-aware, so in a
Docker/k8s container capped below ~9-10GB the heap grows past the container
limit and the cgroup OOM-killer SIGKILLs the Node parent BEFORE V8's own heap
monitor fires. SIGKILL runs no JS handler, writes no [tui-parent] breadcrumb,
and closes the gateway child's stdin — the user sees only a bare gateway
'stdin EOF'. Complements #38224 (trail-text cap), which reduced pressure but
left the 8GB-vs-container mismatch in place.
- _read_cgroup_memory_limit(): read cgroup v2 (memory.max) then v1
(memory.limit_in_bytes); handle 'max', the v1 unlimited sentinel, blank/zero,
and >=1PB as unconstrained.
- _resolve_tui_heap_mb(): unconstrained -> 8192; constrained -> 75% of the
cgroup limit (headroom for non-heap RSS + the Python child sharing the
cgroup), floored at 1536MB, never above 8192.
- NODE_OPTIONS block uses the sized value; still respects a user-supplied
--max-old-space-size.
Net: V8 now GCs/exits gracefully (onCritical breadcrumb fires) instead of being
reaped silently. Display/transport only — no agent context or behavior change.
Tests: tests/hermes_cli/test_tui_heap_sizing.py (20 tests).
The stash/restore cycle in the update path was observed to clobber
freshly-pulled source files (apps/desktop/ deletion -> Vite
'[UNRESOLVED_ENTRY] Cannot resolve entry module index.html'). On a
managed clone the user never edits the source tree, so any 'dirty' state
is pure git artifact (CRLF renormalization, npm lockfile churn, files
left behind when a directory was deleted upstream such as
apps/bootstrap-installer/). Stashing that and re-applying it after a pull
is fragile and unnecessary.
- hermes update (hermes_cli/main.py): on a non-fork (managed) clone,
discard working-tree dirt via reset --hard HEAD + clean -fd instead of
stash/apply. Forks keep the stash machinery so intentional edits
survive. Also pin core.autocrlf=false on Windows so the dirt is never
created (mirrors install.ps1 #38239).
- install.sh: replace the update-path stash/restore dance with a hard
reset to origin/<branch>; the installer is a managed-only entry point.
- install.sh + install.ps1 desktop stage: prefer 'npm ci' (wipes and
reinstalls node_modules from the lockfile) over bare 'npm install',
which can report 'up to date' against a stale marker while node_modules
is empty -- leaving tsc unresolved so 'npm run pack' fails.
Tests: managed clone cleans instead of stashing; fork still stashes;
existing stash tests force the stash path explicitly.
Self-review of #38465 surfaced three real items:
1. SystemExit escape (defense): `_login_nous` raises SystemExit(130)/(1) on
cancel/failure. The logged-out login path inside `_model_flow_nous` catches
it, but the expired-session re-login path (main.py) only catches Exception,
so a Ctrl-C during re-auth could propagate past `_run_portal_one_shot` and
kill the CLI. Add SystemExit to the portal handler so all cancel/abort cases
end with the graceful 'Setup cancelled / retry later' message.
2. Doc sweep: the model-pick step was only added to the bare-`hermes portal`
prose. Propagate it to the surfaces describing `hermes setup --portal`
behavior that still omitted model selection:
- `--portal` argparse help (main.py)
- nous-portal.md intro + the numbered 'what it does' step list (EN + zh-Hans)
- run-hermes-with-nous-portal.md 'default model after setup --portal' line,
which was now contradictory (there's a picker, not a forced default) (EN + zh)
3. Test coverage: add parametrized regression test asserting the portal handler
swallows KeyboardInterrupt / EOFError / SystemExit (returns None, no escape).
Note on 'Skip (keep current)': delegating to _model_flow_nous means picking
Skip preserves the prior provider instead of force-switching to nous — this is
intentional and matches quick setup exactly; docs now say 'sets Nous as your
provider (when you pick a model)' rather than unconditionally.
`hermes portal` / `hermes setup --portal` previously logged in and set
provider=nous but left the model UNSELECTED (blank -> runtime default) and
never showed a picker — unlike the first-time quick setup, which runs the
model picker.
Route `_run_portal_one_shot` through `_model_flow_nous` — the exact same
routine quick setup (`_run_first_time_quick_setup`) and `hermes model` -> Nous
use. It handles both the logged-out path (device-code OAuth, which picks a
model internally) and the logged-in path (curated Nous model picker), then
offers the Tool Gateway opt-in and sets provider=nous. Net effect: `hermes
portal` now offers a model picker every time and is a true single-command
collapse of quick setup's Nous step.
Removes the hand-rolled auth_add_command + manual provider write + separate
Tool Gateway prompt (now a single source of truth). Re-syncs the in-memory
config from disk afterward so a caller's later save_config can't clobber the
model/provider written by the login flow.
Docs (CLI help, portal_cli docstrings, nous-portal EN + zh-Hans) updated to
mention model selection. New regression test asserts `_run_portal_one_shot`
delegates to `_model_flow_nous`.
Verified live: `hermes portal` now shows the 27-model curated picker, 'Skip
(keep current)' preserves prior provider/model.
The two retry hints inside _run_portal_one_shot (shown when the OAuth login
fails) still suggested `hermes auth add nous --type oauth`. Since this path
backs both `hermes portal` and `hermes setup --portal`, point users at the
new human-readable `hermes portal` for consistency.
`hermes portal` (no subcommand) now runs the one-shot Nous Portal onboarding
— OAuth login, switch provider to Nous, offer Tool Gateway — identical to
`hermes setup --portal` and the human-readable alias for
`hermes auth add nous --type oauth` (which still works).
The prior status default moves to `hermes portal info`; `status` is kept as a
hidden back-compat alias. `open`/`tools` subcommands are unchanged.
User-facing hints and docs (status.py, conversation_loop 401 guidance,
SystemPage, README, website docs + zh-Hans) now point at `hermes portal` /
`hermes portal info`. `--manual-paste` references keep the explicit auth
command since `hermes portal` does not expose that flag.
Avoid stale WebSocket events from an old reconnect attempt flipping the gateway state after a newer socket opens. Also limit session-search dedupe to compression edges so branch-specific hits still open the branch instead of collapsing to the parent.
Four related desktop session-management bugs:
- Pins lost until refresh: pinned sessions are joined against the
paginated in-memory session list, so a pinned chat that aged off the
most-recent page got evicted on the next refresh (every message.complete
triggers one) and the Pinned section went empty. mergeWorkingSessions ->
mergeSessionPage now also preserves pinned rows (matched by live id or
lineage root). Pin id checks in the chat header, command center, and
delete/archive are normalized to the durable sessionPinId so pins survive
auto-compression.
- Stuck on "Starting Hermes" after sleep: macOS sleep drops the renderer
WebSocket; nothing reconnected on wake so the composer stayed disabled.
The gateway boot hook now auto-reconnects with backoff on close/error and
on wake signals (powerMonitor resume/unlock-screen IPC, window online,
visibilitychange). connect() gains an open timeout so a hung reconnect
can't deadlock in 'connecting'. Composer placeholder distinguishes
"Reconnecting to Hermes" from a cold start.
- Loses chats from itself: the same hard-replace that dropped pins also
dropped loaded sessions; mergeSessionPage keeps them.
- Multiple copies/branches in search: /api/sessions/search deduped only by
raw session_id, so compression segments and branches surfaced as separate
hits. It now dedupes by lineage root and returns the live compression tip,
matching the session_search tool's behavior.
_exec_schtasks ran schtasks.exe with text=True but no encoding/errors, so
localized Windows (e.g. Chinese) output in the console code page raised
UnicodeDecodeError tracebacks from subprocess' reader threads during
`hermes gateway status`. Decode with the locale's preferred encoding and
errors="replace" so non-UTF-8 status output is read cleanly.
Fixes#38172
* fix(dashboard): clamp PTY resize dimensions for WSL2 winsize garbage
WSL2 reports columns=131072, rows=1 from a broken winsize probe. The
dashboard /chat tab forwards xterm.js dimensions through PtyBridge.resize(),
which packs them as unsigned short via struct.pack. 131072 > 65535 raised
struct.error — uncaught (only OSError was handled) — breaking the resize
path and leaving the TUI laid out for a one-row, absurdly-wide screen, which
surfaces as blank/disappearing text.
Clamp cols/rows to a sane [1, 2000]x[1, 1000] range before packing.
Non-finite/non-integer probes fall back to the minimum so nothing can reach
struct.pack and raise.
* test(dashboard): de-flake pub/events broadcast test
test_pub_broadcasts_to_events_subscribers round-tripped a frame through
two nested Starlette TestClient WebSocket portals within a 10s wall-clock
budget. Under heavy parallel CI load a starved ASGI thread occasionally
blew that budget even though the server logic is correct, producing
intermittent 'broadcast not received within 10s' failures.
Drive _broadcast_event directly under asyncio with fake subscribers
instead. Same fan-out contract (verbatim delivery to every subscriber on
the channel, nothing to other channels), zero scheduling surface. Runs in
~0.3s, deterministic across 10 consecutive runs.
* feat(desktop): enrich profiles dashboard and de-dupe channel env vars
Add active-profile switching, role descriptions (manual + auto-generate
via the auxiliary LLM), per-profile model selection, and gateway-running
/ distribution badges to the GUI Profiles page. New profile creation
gains clone-all, optional description and model assignment.
Hide messaging-platform credentials (channel_managed) from the Keys/Env
page since the Channels page is the canonical surface for them, and
relabel the trimmed "messaging" category as "Gateway".
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(desktop): address review feedback on profiles/env changes
- ProfilesPage: scope the action-menu outside-click handler to the menu's
own container via a ref so opening one card's menu no longer leaves
others open.
- EnvPage: route the "Gateway" label and hint through i18n
(t.common.gateway / gatewayHint) instead of hard-coded English, with an
English fallback for untranslated locales.
- web_server: only report description_auto=true when auto-generation
actually succeeded.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(desktop): address second-round review on profiles
- ProfilesPage: treat describe-auto success by null-checking the
description and trust the response's description_auto flag instead of
assuming true; disable the model-editor Save button unless the selected
choice resolves to a real /api/model/options entry (avoids silent
no-op saves).
- tests: cover the new profile endpoints (active get/set + 404,
description round-trip + 404, model round-trip + 400 validation, and
describe-auto success/failure contracts).
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(desktop): more profiles review fixes (toggles, races, tests)
- ProfilesPage: use the canonical `active` returned by setActiveProfile;
make the SOUL/description/model action-menu items toggle their editor
closed when already open; guard description save/auto-describe against
stale responses via an activeDescRequest ref so a late reply can't
clobber a different open editor.
- tests: assert /api/env channel_managed classification matches
_channel_managed_env_keys().
Co-authored-by: Cursor <cursoragent@cursor.com>
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(doctor): detect + repair stale HERMES_MAX_ITERATIONS .env ghost shadowing config.yaml
hermes doctor now flags when ~/.hermes/.env carries a HERMES_MAX_ITERATIONS
value that disagrees with agent.max_turns in config.yaml, and 'hermes doctor
--fix' removes the stale .env line so config.yaml is authoritative. 'hermes
config show' surfaces the same drift inline under Max turns.
The setup wizard stopped dual-writing this value, but users who edited only
config.yaml from a pre-fix install keep a .env ghost. The gateway bridge
normally overrides it at startup, but if the bridge bails on any earlier
config-parse error the ghost silently wins — config says 400 while the
gateway activity line reads N/90.
The detector reads the .env FILE directly (load_env), not get_env_value/
os.environ, since the startup bridge may already have overwritten os.environ
with the config value.
Closes#17534.
* fix(config): stop offering HERMES_MAX_ITERATIONS as an editable env var
Removes HERMES_MAX_ITERATIONS from OPTIONAL_ENV_VARS so the dashboard env
editor (PUT /api/env) and any env-var prompt no longer let a user write it
to .env — which would recreate the stale ghost that shadows config.yaml's
agent.max_turns (issue #17534). The iteration budget is configured only via
config.yaml; the env var stays a read-only backward-compat fallback in the
gateway/CLI, never a promoted write target.
Regression test asserts it is absent from OPTIONAL_ENV_VARS.
Adds backend-neutral observer hooks for plugins: session, turn, API
request, tool, approval, and subagent lifecycle events with stable
correlation IDs (session_id, task_id, turn_id, api_request_id,
tool_call_id, parent/child subagent ids). Extends VALID_HOOKS with
api_request_error and subagent_start.
Hot path is zero-cost when no plugin subscribes: has_hook()/presence
checks gate all payload construction, request payloads are returned
by reference when no middleware rewrites, and the sanitized response
payload no longer embeds raw response objects.
Bundles the optional NeMo-Relay observability plugin
(plugins/observability/nemo_relay) as an in-repo consumer of the new
hooks, peer to the existing langfuse plugin. Fails open when the
optional nemo-relay package is not installed.
Authored-by: Bryan Bednarski <bbednarski@nvidia.com>
Salvaged from #29722 onto current main.
A kanban worker that exhausted its retries purely on a provider rate
limit / quota wall (e.g. opencode-go's 5-hour window) exited with code 1.
The dispatcher counted that as a crash, and with DEFAULT_FAILURE_LIMIT=2
two quota-wall hits permanently blocked the card. Fanning out many
workers against one shared quota made this routine.
Now a rate-limited worker exits with EX_TEMPFAIL (75); the dispatcher
classifies that as a 'rate_limited' exit, releases the task back to
'ready' WITHOUT incrementing consecutive_failures (the breaker can't trip
on a transient throttle), and the respawn guard defers the next attempt
on a cooldown (default 5min, HERMES_KANBAN_RATE_LIMIT_COOLDOWN_SECONDS)
until the quota window clears. Genuine crashes still count and trip the
breaker as before. The 120s Retry-After cap is unchanged — no worker
parks for hours holding a slot.
- conversation_loop.py: surface failure_reason in the exhaustion return
- cli.py: kanban worker picks exit 75 on rate_limit/billing failure
- kanban_db.py: rate_limited exit kind, no-count requeue, cooldown guard
The dashboard's update button ran 'hermes update' immediately with no
preview. Now the System page shows whether an update is available and
asks the user to confirm before applying it.
- New GET /api/hermes/update/check: reports install method, current
version, and commits-behind (via banner.check_for_updates, 6h-cached;
?force=1 busts the cache). Soft-fails to behind=null on network error;
marks docker/nix/homebrew as can_apply=false with the out-of-band cmd.
- System page: update-status badge on the Hermes version row (latest /
N behind), a Check-for-updates button, and an Update-now button that
opens a ConfirmDialog showing the commit count before POST /api/hermes/
update fires. Cached status loads with the rest of the page.
- Docs + 5 endpoint tests (git/up-to-date/docker/soft-failure + auth gate).
The Electron desktop app writes boot failures, backend spawn output, and
Python tracebacks to HERMES_HOME/logs/desktop.log, but debug-share only
captured agent/errors/gateway — so desktop boot issues never made it into
shared debug reports.
- logs.py: register desktop -> desktop.log (enables 'hermes logs desktop')
- debug.py: capture desktop snapshot, add to summary report, upload full
desktop.log in 'share', update privacy notice
- gateway /debug inherits the desktop tail via collect_debug_report()
- main.py + docs: help text and log-name table (also adds missing gui row)
- tests: desktop seed in fixture, new report test, three_pastes -> four_pastes
get_mcp_status() treated every non-connected server as a failure, so a
server configured with enabled: false rendered as red '— failed' in the
startup banner even though it was intentionally off. Add a 'disabled'
field derived from the enabled flag and render disabled servers dim as
'— disabled' instead.
Follow-up to Ben's PR #37892. Adds a TestInternalCredential block to
test_dashboard_auth_ws_tickets.py exercising the mint-once stability,
multi-use, unminted-rejection, empty-value, wrong-value, reset-and-remint,
and ticket-store-independence branches directly (previously only covered
indirectly via _ws_auth_ok, which left the unminted and empty-value
branches unexercised).
Also corrects the consume_internal_credential docstring: the returned
identity dict is discarded by the current _ws_auth_ok caller (which only
needs the boolean outcome), so the prior 'carry it into its session log'
wording over-promised.
The embedded-TUI PTY child attaches to two server-internal WebSockets:
/api/ws (its primary JSON-RPC gateway backend) and /api/pub (the event
sidecar). Both URLs are built server-side in web_server.py and handed to
the child via its environment.
In OAuth-gated mode (auth_required=true, every hosted Fly agent), _ws_auth_ok
unconditionally rejects the legacy ?token=<_SESSION_TOKEN> path — a leaked
session token must not grant WS access once the gate is engaged. But
_build_gateway_ws_url() still only emitted ?token=, with no gated-mode
branch (its sibling _build_sidecar_url had been given a ticket branch; the
gateway-url builder was missed). So the TUI child's /api/ws upgrade was
rejected 4401 -> 'gateway websocket connection failed' -> 'gateway startup
timeout', leaving the embedded chat unusable on every gated deployment.
A single-use 30s browser ticket is the wrong shape for this link: the child
reads its attach URL once at startup and reuses it on every reconnect, and
on a slow cold boot it may not dial within the TTL. (_build_sidecar_url's
own docstring already flagged this fragility.)
Fix: add a process-lifetime, multi-use internal credential to
dashboard_auth.ws_tickets (internal_ws_credential / consume_internal_credential),
minted once per process and NEVER injected into the SPA — it only leaves the
process via a spawned child's env, so browser-side XSS can't read it, and a
leak grants no more than a ticket already does. _ws_auth_ok accepts it via
?internal= in gated mode only. Both _build_gateway_ws_url and
_build_sidecar_url now use it, so the child can reconnect both sockets.
Loopback / --insecure behavior is unchanged (still ?token=).
Needs review: touches _ws_auth_ok + dashboard_auth (core auth surface).