hermes-agent

Author	SHA1	Message	Date
Brooklyn Nicholson	cf9dc366dd	refactor(desktop): drop per-session icons, read-only cross-profile reads The per-session icon picker added more noise than value — rip it out end to end (sessions.icon column, set_session_icon, the PATCH field, the picker UI, and the SessionInfo.icon type). The cross-profile session aggregator now opens each profile's state.db read-only (mode=ro, no schema init), so listing other profiles on every sidebar refresh never DDLs or takes a write lock on their live DBs. The single-profile hot path stays on par with /api/sessions.	2026-06-04 18:24:35 -05:00
Brooklyn Nicholson	b94b3622b5	feat(desktop): per-session profile switching + cross-profile sessions Add first-class profile support to the desktop app without app reloads. - Swap the single live gateway onto a session's profile lazily (spawned on demand by the Electron backend pool), so one backend serves the active profile and others stay cold — no OOM with many profiles. - Aggregate sessions across profiles by reading each profile's state.db read-only; unified "All profiles" view groups sessions per profile with per-profile pagination, while the default view stays scoped to one profile. - Add an Arc-style profile rail at the sidebar foot: a default<->all toggle pinned left, colored named-profile squares scrolling between, Manage pinned right. Profile identity is a deterministic per-name color. - Route profile-scoped REST (config/env/skills/tools/model) to the active gateway profile and invalidate React Query caches on swap. Single-profile users never trigger a swap, so their path is unchanged. Backend: - web_server: profile-aware active/list endpoints + per-profile session totals; hermes_state: session_count(exclude_children); main.py: honor --profile over HERMES_HOME env for pooled backends. UI primitives: - Add a position-aware Tip tooltip (instant, themed) as a drop-in for native title=, and strip redundant tooltips from self-descriptive chrome.	2026-06-04 16:35:34 -05:00
worlldz	081694c111	fix(kanban): isolate board override per concurrent call	2026-06-04 07:39:53 -07:00
Teknium	fef04a197e	fix(desktop): purge electron cache unconditionally, not via stdlib zipfile gate The salvaged detector validated each cached electron-*.zip with zipfile.testzip() and only purged ones it judged corrupt. But stdlib zipfile reads from the end-of-central-directory backward, so it silently tolerates prepended/concatenated junk — which is exactly the corruption the bug report names ('86257938 extra bytes at beginning or within zipfile', a partial download resumed into the same file). testzip() returns clean on those zips, so the self-heal never fired for the reported failure mode. Drop the self-rolled validator: on any packaged-build failure, purge the version's cached zips AND the half-written unpacked dir, then retry once. @electron/get re-downloads with its own SHASUM verification — the real source of truth, which catches prepend/concat/truncate alike. An unrelated failure just costs one clean re-download and fails the same way. Verified empirically: zipfile.testzip() returns None (clean) on a prepended-junk zip; the unconditional purge removes it correctly.	2026-06-04 07:17:33 -07:00
Harry Riddle	f583c6ebd5	fix(desktop): recover from corrupt cached Electron download on build hermes desktop failed on Linux with an ENOENT renaming release/linux-unpacked/electron -> Hermes. Root cause is a corrupt cached Electron zip (~/.cache/electron/electron-.zip): app-builder unpack-electron extracts a partial tree from the bad zip that is missing the electron binary, so electron-builder dies on the final rename. Re-running repeats the broken extraction, leaving the desktop app permanently unlaunchable until the cache is manually purged. - Add _electron_download_cache_dirs() + _purge_corrupt_electron_cache() to hermes_cli/main.py: validate every electron-.zip via zipfile.testzip() and delete corrupt ones; honor electron_config_cache / ELECTRON_CACHE overrides with per-OS defaults. - Wire purge + single retry into cmd_gui packaged-build failure path so a poisoned download self-heals (electron re-downloads clean). - Add beforePack hook (apps/desktop/scripts/before-pack.cjs) to wipe the target unpacked dir before staging, making packaging idempotent across interrupted runs. Cross-platform, best-effort. - Tests: corrupt-zip detector, cmd_gui purge/retry/launch path, no-retry-when-clean path, and node --test for the cleanup helper.	2026-06-04 07:17:33 -07:00
Frowtek	3858cf4307	fix(cli): honor global-root active_provider fallback for named profiles	2026-06-04 07:08:30 -07:00
ethernet	a6a0a5b1b0	fix(desktop): detect linux arm64 binary	2026-06-04 09:51:26 -04:00
teknium1	c136eb4de1	fix(update): harden venv rebuild + verify core deps after install Two complementary fixes for a silent partial-install failure that bit ``hermes update`` in the wild: a fresh checkout pulled 145 commits, ``rebuild_venv`` failed to recreate the venv on Windows because ``shutil.rmtree(ignore_errors=True)`` couldn't delete files held open by the running ``hermes.exe`` shim. ``uv venv`` then refused with "A directory already exists at: venv" and the update fell back to installing on top of the stale venv. The resulting partial install missed exactly one newly-added base dep — ``pathspec==1.1.1`` — which ``hermes desktop --build-only`` imports at the top of its content-hash check. The desktop rebuild died with ModuleNotFoundError and the parent update only logged "⚠ Desktop build failed (non-fatal)". Same root cause made the "default: sync failed" line in the skill-sync stage, because that sync subprocess hit the same missing import. Fix 1: ``rebuild_venv`` retries with ``--clear`` ------------------------------------------------ If ``uv venv`` fails with "already exists" in stderr (which is what uv prints, and what uv's own hint tells you to fix with --clear), retry once with ``--clear``. Only this specific failure pattern triggers the retry — disk-full / interpreter-download failures still surface as before so we don't mask real problems. Fix 2: post-install dep verification ------------------------------------ Belt-and-suspenders so future uv resolver quirks (or any other cause of partial installs) surface immediately instead of hours later in a downstream subprocess. After ``_install_python_dependencies_with_optional_fallback`` runs, ``_verify_core_dependencies_installed``: 1. Reads ``[project.dependencies]`` straight from pyproject.toml (so we don't trust the venv's stale metadata). 2. Filters by environment markers via ``packaging.requirements.Requirement`` so cross-platform exclusions (``ptyprocess ; sys_platform != 'win32'``) don't false-positive on Windows. 3. Runs ``importlib.metadata.version()`` for each remaining dep inside the target venv interpreter (resolved from ``VIRTUAL_ENV``, not ``sys.executable``). 4. If anything is missing, reinstalls the base group with ``--reinstall`` to force re-resolution. If a second probe still reports missing deps, force-installs each one with its pinned spec. 5. Treats final failure as a warning rather than a hard error — a single broken-on-PyPI dep shouldn't block an otherwise-successful update — but the message points at ``hermes update --force`` and names the missing packages so the user knows what's wrong. Tests ----- - ``TestRebuildVenv::test_retries_with_clear_when_dir_already_exists`` — simulates the rmtree-couldn't-delete-it failure mode and asserts the ``--clear`` retry path is taken and succeeds. - ``TestRebuildVenv::test_does_not_retry_when_first_failure_is_not_dir_exists`` — guards against masking real failures (disk full, etc.). - ``test_verify_core_dependencies.py`` — 7 tests covering the happy path, the regression (missing pathspec triggers --reinstall), the per-package fallback when --reinstall doesn't help, the platform- marker filter so Windows doesn't try to install ptyprocess, the missing-pyproject noop, and the VIRTUAL_ENV resolver. Co-authored-by: Kyssta <218078013+kyssta-exe@users.noreply.github.com>	2026-06-04 06:05:41 -07:00
AhmetArif0	cd68b8f0e8	fix(auth): set active_provider after hermes auth add qwen-oauth hermes auth add qwen-oauth called pool.add_entry() but never wrote to providers["qwen-oauth"] or set active_provider in auth.json. _model_section_has_credentials() checks get_active_provider() first; with active_provider unset and no api_key_env_vars configured for oauth_external providers, the setup wizard reported "No inference provider configured" even after a successful Qwen CLI OAuth login. Add _mark_qwen_oauth_active() in auth.py: writes a minimal provider state entry (base_url for display only) and calls _save_provider_state() to set active_provider. The function deliberately does not copy the api_key — that lives in the Qwen CLI credential file managed by _save_qwen_cli_tokens / resolve_qwen_runtime_credentials and must not be duplicated in auth.json where it would become stale. pool.add_entry() is retained so "hermes auth list" continues to show the entry. Runtime credential resolution continues to use resolve_qwen_runtime_credentials. Mirrors the fix applied to openai-codex (#37517) and xai-oauth (#37576).	2026-06-04 05:58:33 -07:00
AhmetArif0	5f62ba8e4b	fix(auth): use _save_xai_oauth_tokens in auth_commands to set active_provider hermes auth add xai-oauth called pool.add_entry() directly, writing only the credential-pool entry (source "manual:xai_pkce") without touching providers["xai-oauth"] or setting active_provider in auth.json. _model_section_has_credentials() checks get_active_provider() first; with active_provider unset and no api_key_env_vars configured for oauth_external providers, the setup wizard reported "No inference provider configured" even after a successful OAuth login. Use _save_xai_oauth_tokens() — the canonical path already called from the hermes model xAI login flow — which writes providers["xai-oauth"]["tokens"] (setting active_provider) and lets _seed_from_singletons seed the pool with a "loopback_pkce" entry on the next load_pool() call. Mirrors the fix applied to openai-codex in #37517.	2026-06-04 05:48:50 -07:00
AhmetArif0	34a2903527	fix(auth): set active_provider after hermes auth add google-gemini-cli hermes auth add google-gemini-cli called pool.add_entry() but never wrote to providers["google-gemini-cli"] or set active_provider in auth.json. _model_section_has_credentials() checks get_active_provider() first; with active_provider unset and no api_key_env_vars configured for oauth_external providers, the setup wizard reported "No inference provider configured" even after a successful OAuth login. Add _mark_google_gemini_cli_active() in auth.py: writes a minimal provider state entry (email for display only) and calls _save_provider_state() to set active_provider. The function deliberately does not copy access_token or refresh_token — those are managed by agent.google_oauth in the Google credential file and must not be duplicated in auth.json where they would become stale. pool.add_entry() is retained so "hermes auth list" continues to show the entry. Runtime credential resolution continues to use agent.google_oauth directly. Mirrors the fix applied to openai-codex (#37517) and xai-oauth (#37576).	2026-06-04 05:44:22 -07:00
ashishpatel26	c9b62061d4	fix(cli): launchd KeepAlive unconditional restart (#37388 ) Replace KeepAlive.SuccessfulExit=false dict with <key>KeepAlive</key><true/> so launchd restarts hermes-gateway on any exit, matching the documented drain-then-exit restart protocol used by --graceful-restart.	2026-06-04 05:38:12 -07:00
Teknium	df9fb8e5e6	fix(tools): stop hermes tools reporting kanban as removed (#38918 ) The hermes tools save summary printed '- kanban' (and would print '+ kanban') for a platform even though kanban is never offered as a checklist option. kanban is a check_fn-gated toolset whose tools are a subset of the platform composite, so _get_platform_tools resolves it as enabled, but _prompt_toolset_checklist only renders CONFIGURABLE_TOOLSETS — so it can never survive into the returned selection. The added/removed diff (current_enabled - new_enabled) then surfaced kanban as removed. Scope the printed diff to the checklist's actual universe via the new _checklist_toolset_keys() helper at all three diff sites (first-install, all-platforms, per-platform). The persisted config is unaffected — _save_platform_tools already preserves non-configurable entries; this was purely a false-signal in the UI.	2026-06-04 03:31:43 -07:00
Ben	616c0a36b6	fix(dashboard-auth): don't abort verify chain on one provider's ProviderError The gated dashboard verifies a session cookie by trying each registered DashboardAuthProvider's verify_session in turn (the session cookie stores only the access token, not which provider issued it). A provider that doesn't recognise a token returns None; a provider whose IDP/JWKS is unreachable raises ProviderError. The loop used to return HTTP 503 on the FIRST ProviderError, before any later provider got a turn. With multiple providers stacked, that means an unreachable IDP for a session you didn't even use blocks login through a different, reachable provider. Concrete repro: a self-hosted-OIDC session hits the 'nous' provider first (registered earlier); nous tries to reach Nous Portal's JWKS, which is unreachable in a self-hosted deployment, so it raises — and the gate 503s before the 'self-hosted' provider can verify the token. Hit live while testing the new self-hosted OIDC plugin against a local Keycloak. Fix: a ProviderError from one provider is logged and the loop continues to the next. A 503 is returned only if NO provider verified the token AND at least one was unreachable — distinguishing a transient IDP outage (don't force a needless re-login) from a token that's genuinely invalid (fall through to refresh/relogin). Single-provider behaviour is unchanged. Tests: adds an _UnreachableProvider stub and three cases — unreachable provider first must not block a working second; all-unreachable still 503s; reachable-but-unrecognised falls through to 401/relogin (not 503). Mutation-tested: reverting the fix makes the first case fail with the exact 503 bug.	2026-06-04 03:23:45 -07:00
Ben	cae6b5486f	feat(dashboard): always enable embedded chat; remove dashboard --tui flag The dashboard's embedded Chat surface (/chat, /api/ws, /api/pty) was gated behind `hermes dashboard --tui` / HERMES_DASHBOARD_TUI=1. The desktop app and the dashboard's own Chat tab both drive the agent over the /api/ws + /api/pty WebSockets, so a dashboard started without the flag would pass the /api/status health check but slam the chat WebSocket shut with WS code 4403 — the app connects, reports "ready", and chat stays dead. This was the root cause behind multiple user reports of the desktop app failing to connect to a self-hosted gateway/dashboard, and it bit Docker and host installs alike. Make the embedded chat unconditional: - web_server.py: _DASHBOARD_EMBEDDED_CHAT_ENABLED defaults to True; drop the embedded_chat parameter and the runtime reassignment from start_server(). The WS gates still read the constant (now always true) so the seam — and its "rejects when disabled" contract test — stays meaningful. - main.py: remove the `--tui` argument from the dashboard subparser and the `embedded_chat = args.tui or HERMES_DASHBOARD_TUI==1` derivation. - web/: isDashboardEmbeddedChatEnabled() returns true unconditionally; drop the deprecated __HERMES_DASHBOARD_TUI__ alias and the dead LEGACY_TUI_RE scrape in the vite dev-token plugin. - apps/desktop/electron/main.cjs: drop `--tui` from the spawned dashboardArgs (it would now error with "unrecognized arguments: --tui") and the redundant HERMES_DASHBOARD_TUI env injection. - Docker: no s6 run-script change needed — the script never passed --tui; the HERMES_DASHBOARD_TUI env var is now simply a no-op, so the image works out of the box with no extra var. - Docs: remove every dashboard --tui / HERMES_DASHBOARD_TUI reference across the CLI reference, env-var reference, docker/desktop/web-dashboard guides, in-app tips, and the zh-Hans translations. The terminal `hermes --tui` / HERMES_TUI references are intentionally left untouched. Tests: 270 passing across web_server, dashboard lifecycle, host-header, auth-gate, and docker-override-scripts suites.	2026-06-04 03:03:35 -07:00
alt-glitch	aeec88c77f	fix(installer): symlink bundled node/npm into command bin dir for FHS root installs Root installs on Linux (FHS layout, #15608) put the `hermes` command in `/usr/local/bin` (on PATH) but symlinked the bundled node/npm/npx into `~/.local/bin`, which isn't on PATH for a stock root shell. `node`/`npm` were 'command not found' and `hermes dashboard` failed with 'npm is not available' because its build-on-demand fallback couldn't find npm. Fix: `install_node()` now symlinks into `get_command_link_dir()` — the same helper the `hermes` command link already uses — so node/npm/npx land wherever the command does (`/usr/local/bin` on FHS root, `~/.local/bin` otherwise, `$PREFIX/bin` on Termux). Non-root and Termux installs are unchanged. Also fixes: - `scripts/lib/node-bootstrap.sh`: adds `_nb_get_link_dir()` mirroring the same root/Termux/user logic for the standalone bootstrap path (used by `hermes update`, TUI node bootstrap, etc.) - `hermes_cli/uninstall.py`: `remove_node_symlinks()` now checks all candidate directories (`~/.local/bin`, `/usr/local/bin`, `$PREFIX/bin`) so root FHS uninstalls don't leave orphan symlinks Regression from #15608, which created the FHS path for the command but left `install_node` pointed at the legacy user-local dir.	2026-06-04 02:31:49 -07:00
Teknium	4ed63170e4	fix(update): don't fail desktop rebuild / skills sync on mid-rebuild venv (#38885 ) When 'hermes update' rebuilds the project venv (rmtree + uv venv on the first managed-uv migration), the desktop-rebuild and profile-skills-sync steps that follow both spawn sys.executable. Firing while the venv is mid-rewrite makes the child interpreter abort with the bare stderr line 'No pyvenv.cfg file', surfacing as a spurious 'Desktop build failed' / 'default: sync failed' on an update that actually succeeded. Add _wait_for_interpreter_venv_ready(): resolve the venv hosting sys.executable and poll briefly for pyvenv.cfg to (re)appear before each of those subprocess steps. No-op when the interpreter isn't venv-hosted. The desktop rebuild also retries once after re-waiting, and keeps streaming its output live (no capture). Best-effort throughout — callers proceed regardless, so a genuinely broken venv still surfaces the real error.	2026-06-04 02:20:11 -07:00
Ben	acb0e2bacb	feat(dashboard-auth): add BasicAuthProvider username/password plugin A bundled, zero-infrastructure 'just put a password on my dashboard' provider that uses the supports_password extension point. No external IDP, no database: sessions are stateless HMAC-signed tokens the provider mints and verifies itself, and passwords are hashed with stdlib scrypt (no third-party dependency — deliberately avoids bcrypt to keep the dep surface unchanged). - plugins/dashboard_auth/basic: BasicAuthProvider (scrypt verify with a constant-time dummy-hash path for unknown users so the endpoint is not a username-timing oracle; access/refresh tokens carry a 'kind' claim that verify/refresh enforce; cross-secret tokens are rejected). The register() entry point mirrors the Nous plugin's config/env precedence (env wins; empty treated as unset) and LAST_SKIP_REASON channel. - config.py: document the canonical dashboard.basic_auth.* surface (username / password_hash / password / secret / session_ttl_seconds). Activates only when username + (password or password_hash) are set, so OAuth users and loopback/--insecure operators are unaffected. Without an explicit secret a random per-process key is generated (logged): fine for a single process, but sessions then don't survive restart or span workers.	2026-06-04 01:02:25 -07:00
Ben	ed9e8ba097	feat(dashboard-auth): add pluggable password (non-redirect) login The dashboard auth gate was OAuth-only: a DashboardAuthProvider could authenticate only via a redirect to an IDP (start_login -> /auth/callback -> complete_login). There was no first-class path for username/password auth, so self-hosters who just want a password on their dashboard had no clean option short of an external OAuth IDP. Extend the provider framework with a parallel, non-redirect front door that converges on the same Session + cookie + refresh machinery: - base.py: add the optional supports_password flag and complete_password_login(username, password) -> Session (default raises NotImplementedError so an OAuth-only provider that forgets the flag fails loudly). Add InvalidCredentialsError. OAuth providers are unaffected (flag defaults False; the method is never called). - routes.py: add POST /auth/password-login, mirroring the cookie-minting tail of /auth/callback but skipping PKCE/state/code. Returns JSON {ok, next} (the form POSTs via fetch). Generic 401 for both unknown user and wrong password (no enumeration oracle); 404 hides whether a provider exists or supports passwords; per-IP sliding-window rate limit (10/min -> 429). /api/auth/providers now reports supports_password so the login page can branch. - middleware.py: allowlist /auth/password-login (a bootstrap route). verify/refresh/revoke/ws-tickets/logout need zero changes — a password session is just a Session with provider-minted opaque tokens. - login_page.py: render a credential form (instead of a redirect button) for supports_password providers, wired by a small inline script that POSTs to /auth/password-login and navigates on success. OAuth-only pages stay script-free.	2026-06-04 01:02:25 -07:00
Teknium	6717914e0a	fix(dashboard): explain WHY a chat WS connection was refused (#38743 ) * Port from google-gemini/gemini-cli#21541: back up corrupted config.yaml When config.yaml fails to parse, load_config() silently falls back to DEFAULT_CONFIG and leaves the broken file on disk. If the user then re-runs the setup wizard or hermes config set (both rewrite config.yaml), their broken-but-recoverable overrides are lost for good. Adapts the policy-file recovery from gemini-cli#21541: on the first parse warning for a given broken file, snapshot it to config.yaml.corrupt.<ts>.bak (best-effort, symlink-guarded, size-deduped) and tell the user where it landed. Unlike Gemini's version we deliberately do NOT reset config.yaml to a clean state — hermes never silently mutates user config, and leaving it means a hand-fixed file is re-read on the next load. Tests: 3 new cases (backup created + content preserved + original untouched; same-size backup dedup; symlink not copied). E2E verified with isolated HERMES_HOME and a real tab-indented broken config. * fix(dashboard): explain WHY a chat WS connection was refused The embedded-chat PTY WebSocket (/api/pty) collapsed every rejection into a bare close code: 4401 for any auth failure, 4403 for three unrelated failures (host mismatch, origin mismatch, peer-IP). Neither the server log nor the browser said which gate fired or why, so a "chat won't connect" report was undiagnosable without a repro. Server (web_server.py): - _ws_auth_reason / _ws_host_origin_reason / _ws_client_reason return a short machine-parseable reason; old bool wrappers kept for callers/tests. - pty_ws splits the overloaded 4403 into 4401 (auth), 4403 (host/origin), 4408 (peer not allowed), 4404 (chat disabled), and sends the reason on the close frame (clamped to the 123-byte RFC6455 limit). - Each path logs one line: 'pty auth rejected reason=.. mode=.. cred=.. peer=..' / 'pty refused: <reason> ..'. Accepted path logs 'pty accepted peer=.. mode=.. cred=..' so an audit shows HOW a peer authed, not just that it did. tui_gateway/ws.py: - 'ws send/write failed' now logs error_type=<ExcName> so an exception whose str() is empty (closed-transport sends) no longer logs 'error='. web/src/pages/ChatPage.tsx: - console.warn the real close code + server reason on every close. - Map 4404/4408 to specific banners; 4401/4403 banners echo the server reason; [session ended] prints the close code. E2E verified all five reject paths + accepted path produce matching close code, wire reason, and server log line.	2026-06-04 00:36:03 -07:00
Ben	c2ca3f01ab	fix(dashboard): honor --portal-url / HERMES_DASHBOARD_PORTAL_URL override in register The register command resolved the portal base URL purely from the stored login, ignoring any override. That meant `HERMES_DASHBOARD_PORTAL_URL` (and the absence of any flag) gave no way to point registration at a staging or preview portal — the request always hit the login's portal, returning 404 against a branch that wasn't deployed there. - _resolve_portal_base_url now takes an optional override (precedence: override > stored login portal > prod default). - New --portal-url flag; falls back to HERMES_DASHBOARD_PORTAL_URL env. - Documents that the access token must be valid at the overridden portal (it's minted by whoever you logged into). - 3 new tests for override precedence. Verified live against the PR #324 Vercel preview: CLI -> preview endpoint -> real agent:{id} client_id written to .env.	2026-06-04 00:17:57 -07:00
Ben	bb291b6bbc	feat(dashboard): `hermes dashboard register` for self-hosted OAuth client Adds a CLI command that registers this install as a self-hosted dashboard with the user's Nous Portal account, automating the manual browser flow on /local-dashboards. - New hermes_cli/dashboard_register.py: resolves a fresh Nous access token from auth.json (fast-fails with a `hermes setup` hint when not logged in), POSTs to {portal}/api/oauth/self-hosted-client, and writes HERMES_DASHBOARD_OAUTH_CLIENT_ID into ~/.hermes/.env idempotently. - Docker-style adjective_noun auto-naming; --name and --redirect-uri overrides. - Persists HERMES_DASHBOARD_PORTAL_URL only when non-default and unset (so a Vercel preview / staging portal sticks, prod default stays implicit). - Refuses in managed/hosted installs (the orchestrator stamps the client_id). - Post-register hint explains the OAuth gate only engages on a non-loopback bind. - Nested 'register' subparser leaves bare `hermes dashboard` unchanged. - 9 unit tests (name gen, fast-fails, POST shape, env writes, redirect URI, portal-URL persistence, 401/403 mapping); dashboard lifecycle tests still green. Depends on NousResearch/nous-account-service#324 (the portal endpoint).	2026-06-04 00:17:57 -07:00
Ben Barclay	30c7b787d1	fix(memory): fall back to pip when uv is unavailable (salvage #5954 ) (#38668 ) `_install_dependencies` (hermes memory setup) hard-aborted with "uv not found — cannot install dependencies" whenever `uv` was not on PATH, even when a perfectly good `pip` was available. Slim container images and some CI environments don't ship uv, so memory-provider dependency installation dead-ended there for no good reason. Now: use `uv pip install` when uv is present, otherwise fall back to `<python> -m pip install` when pip3/pip is available, and only abort (with the uv install hint) when neither is found. The "Run manually:" hints reflect whichever installer was selected. Salvages #5954 by @MustafaKara7. Their patch added redundant local `import subprocess` / `import sys` (both are already in scope — module -level `sys`, function-top `subprocess`); this salvage drops those and adds a regression test (TestInstallDependenciesRunner) covering all three paths (uv / pip-fallback / abort). Verified adversarially: the pip-fallback test fails against origin/main's unfixed code with the exact dead-end symptom and passes with the fix. Closes #5954. Co-authored-by: MustafaKara7 <186085093+MustafaKara7@users.noreply.github.com>	2026-06-04 14:03:02 +10:00
Teknium	e45dd2b0e7	refactor(web): unify main-slot model assignment base_url/context handling (#38593 ) Both POST /api/model/set and the profile-model writer hand-rolled the same provider/default/base_url/context_length reconciliation. Extract it into _apply_main_model_assignment so the custom-vs-hosted base_url logic lives in one place — removing the future-drift risk where one site learns about custom base_url persistence and the other forgets. Behavior unchanged; pinned with a direct helper unit test.	2026-06-03 20:25:33 -07:00
Dusk1e	2059707fce	fix(gateway-windows): anchor detached/startup cwd at HERMES_HOME	2026-06-03 19:37:29 -07:00
Teknium	e3313c50a7	feat(dashboard): add Debug Share to the System page (#38600 ) * Port from google-gemini/gemini-cli#21541: back up corrupted config.yaml When config.yaml fails to parse, load_config() silently falls back to DEFAULT_CONFIG and leaves the broken file on disk. If the user then re-runs the setup wizard or hermes config set (both rewrite config.yaml), their broken-but-recoverable overrides are lost for good. Adapts the policy-file recovery from gemini-cli#21541: on the first parse warning for a given broken file, snapshot it to config.yaml.corrupt.<ts>.bak (best-effort, symlink-guarded, size-deduped) and tell the user where it landed. Unlike Gemini's version we deliberately do NOT reset config.yaml to a clean state — hermes never silently mutates user config, and leaving it means a hand-fixed file is re-read on the next load. Tests: 3 new cases (backup created + content preserved + original untouched; same-size backup dedup; symlink not copied). E2E verified with isolated HERMES_HOME and a real tab-indented broken config. * feat(dashboard): add Debug Share to the System page Surface `hermes debug share` in the dashboard. The System > Operations section gets a dedicated card that uploads a redacted report + full logs and returns the paste URLs as real, copyable links instead of a log tail. - debug.py: factor a pure build_debug_share() returning structured {urls, failures, redacted, auto_delete_seconds}; run_debug_share now calls it (CLI output unchanged). - web_server.py: POST /api/ops/debug-share runs the share core in a worker thread and returns the structured payload synchronously (the URLs are the whole point — not a backgrounded action). - api.ts: runDebugShare() + DebugShareResponse. - SystemPage.tsx: share card with a redaction toggle (on by default), per-link + copy-all buttons, and the 6h auto-delete countdown. - tests: build_debug_share core + endpoint (redact toggle, failure 502, token gate).	2026-06-03 19:37:04 -07:00
Ben Barclay	04d620d91f	fix(docker): run config migrations during container boot (salvage #35508 ) (#36627 ) Salvage of #35508 (@dchenk), rebased onto current main. Resolved the tests/tools/test_stage2_hook_puid_pgid.py conflict (kept both the envdir-creation regression test on main and the new config-migration tests). Docker image upgrades replace code under $INSTALL_DIR but preserve $HERMES_HOME on the mounted volume, so the persisted config.yaml never received the schema migrations that non-Docker `hermes update` runs (#35406). This adds scripts/docker_config_migrate.py, invoked from stage2-hook after first-boot seeding and before gateway services start: it backs up config.yaml + .env, runs migrate_config(interactive=False), and honors HERMES_SKIP_CONFIG_MIGRATION=1 for manual control. Also fixes a latent bug in check_config_version(): it called load_config() which deep-merges DEFAULT_CONFIG, so a legacy config with no raw _config_version falsely reported as already-current. It now reads the raw on-disk file so legacy configs are correctly detected for migration. Differs from #35508 as submitted (Option B cleanup): dropped the `_config_version` line added to cli-config.yaml.example and removed the accompanying test_cli_config_example_declares_latest_version change-detector test. The example is a copy-template and has no business asserting a schema version; check_config_version() reads the user's real config.yaml, not the example. This removes a second sync point that drifts on every version bump. Closes #35508. Fixes #35406. Co-authored-by: Dmitriy Cherchenko <17372886+dchenk@users.noreply.github.com>	2026-06-04 11:11:27 +10:00
Teknium	b0a52d74ac	fix(mcp): resolve ${ENV} in discovery probe so header auth works (#38571 ) `hermes mcp add --auth header` built `Authorization: Bearer ${MCP_X_API_KEY}` and passed it straight to the discovery probe without interpolation, so the probe sent the literal placeholder and auth-requiring servers (e.g. n8n) returned 401. Runtime tool loading worked because `_load_mcp_config()` interpolates, but the four CLI probe call sites (add/test/login/configure) all used unresolved config. Resolve `${ENV}` inside `_probe_single_server` via a new `_resolve_mcp_server_config()` (load_hermes_dotenv + _interpolate_env_vars), mirroring runtime loading. This covers all four call sites, not just add. Also strip a leading `Bearer ` from pasted tokens before saving to `MCP_*_API_KEY`, so a token pasted with the prefix doesn't produce `Bearer Bearer <jwt>` (also a 401). Reported with a precise root-cause analysis in #37792. Co-authored-by: ThyFriendlyFox <116314616+ThyFriendlyFox@users.noreply.github.com>	2026-06-03 17:49:39 -07:00
xxxigm	ca06715721	feat(web): wire local/custom endpoints into model assignment The runtime resolver reads model.base_url from config and ignores the OPENAI_BASE_URL env var, so a self-hosted endpoint could not be configured from the GUI. Two changes enable it: - POST /api/model/set accepts an optional base_url and persists it as model.base_url when provider=custom (still clearing stale base_url for hosted providers). - POST /api/providers/validate now returns the model ids a custom endpoint advertises at /v1/models, so the GUI can auto-pick a default without asking the user to type a model name. Refs desktop onboarding "Local / custom endpoint" bug.	2026-06-03 17:48:55 -07:00
Teknium	d50741af90	fix(onboarding): clarify Anthropic API vs OAuth provider entries and reorder (#38577 ) The setup-flow provider list showed two Anthropic/Claude entries with ambiguous labels ('Anthropic (Claude API)' and 'Claude Code (subscription)') in no deliberate order. Relabel and reorder so the distinction and the subscription caveat are explicit: - 'Anthropic API Key' (PKCE, API path) - 'Anthropic OAuth: Required Extra Usage Credits to Use Subscription' (external) - Both Anthropic entries moved to the bottom of the list. - 'OpenAI Codex (ChatGPT)' -> 'OpenAI OAuth (ChatGPT)', now first after Nous. Applied consistently to the backend OAuth catalog (web_server.py) and the desktop onboarding overlay's PROVIDER_DISPLAY title/order map; test assertions updated to the new titles.	2026-06-03 17:46:04 -07:00
Teknium	2f523a4691	fix(tui): cgroup-aware V8 heap cap so memory-limited containers stop dying silently (#38541 ) The TUI hardcoded --max-old-space-size=8192. V8 is not cgroup-aware, so in a Docker/k8s container capped below ~9-10GB the heap grows past the container limit and the cgroup OOM-killer SIGKILLs the Node parent BEFORE V8's own heap monitor fires. SIGKILL runs no JS handler, writes no [tui-parent] breadcrumb, and closes the gateway child's stdin — the user sees only a bare gateway 'stdin EOF'. Complements #38224 (trail-text cap), which reduced pressure but left the 8GB-vs-container mismatch in place. - _read_cgroup_memory_limit(): read cgroup v2 (memory.max) then v1 (memory.limit_in_bytes); handle 'max', the v1 unlimited sentinel, blank/zero, and >=1PB as unconstrained. - _resolve_tui_heap_mb(): unconstrained -> 8192; constrained -> 75% of the cgroup limit (headroom for non-heap RSS + the Python child sharing the cgroup), floored at 1536MB, never above 8192. - NODE_OPTIONS block uses the sized value; still respects a user-supplied --max-old-space-size. Net: V8 now GCs/exits gracefully (onCritical breadcrumb fires) instead of being reaped silently. Display/transport only — no agent context or behavior change. Tests: tests/hermes_cli/test_tui_heap_sizing.py (20 tests).	2026-06-03 16:40:28 -07:00
Teknium	8a19884bf3	fix(update): stop stash/restore from clobbering desktop source on managed clones (#38542 ) The stash/restore cycle in the update path was observed to clobber freshly-pulled source files (apps/desktop/ deletion -> Vite '[UNRESOLVED_ENTRY] Cannot resolve entry module index.html'). On a managed clone the user never edits the source tree, so any 'dirty' state is pure git artifact (CRLF renormalization, npm lockfile churn, files left behind when a directory was deleted upstream such as apps/bootstrap-installer/). Stashing that and re-applying it after a pull is fragile and unnecessary. - hermes update (hermes_cli/main.py): on a non-fork (managed) clone, discard working-tree dirt via reset --hard HEAD + clean -fd instead of stash/apply. Forks keep the stash machinery so intentional edits survive. Also pin core.autocrlf=false on Windows so the dirt is never created (mirrors install.ps1 #38239). - install.sh: replace the update-path stash/restore dance with a hard reset to origin/<branch>; the installer is a managed-only entry point. - install.sh + install.ps1 desktop stage: prefer 'npm ci' (wipes and reinstalls node_modules from the lockfile) over bare 'npm install', which can report 'up to date' against a stale marker while node_modules is empty -- leaving tsc unresolved so 'npm run pack' fails. Tests: managed clone cleans instead of stashing; fork still stashes; existing stash tests force the stash path explicitly.	2026-06-03 16:40:13 -07:00
kshitijk4poor	26a57467a8	fix(cli): harden `hermes portal` SystemExit handling + finish model-pick doc sweep Self-review of #38465 surfaced three real items: 1. SystemExit escape (defense): `_login_nous` raises SystemExit(130)/(1) on cancel/failure. The logged-out login path inside `_model_flow_nous` catches it, but the expired-session re-login path (main.py) only catches Exception, so a Ctrl-C during re-auth could propagate past `_run_portal_one_shot` and kill the CLI. Add SystemExit to the portal handler so all cancel/abort cases end with the graceful 'Setup cancelled / retry later' message. 2. Doc sweep: the model-pick step was only added to the bare-`hermes portal` prose. Propagate it to the surfaces describing `hermes setup --portal` behavior that still omitted model selection: - `--portal` argparse help (main.py) - nous-portal.md intro + the numbered 'what it does' step list (EN + zh-Hans) - run-hermes-with-nous-portal.md 'default model after setup --portal' line, which was now contradictory (there's a picker, not a forced default) (EN + zh) 3. Test coverage: add parametrized regression test asserting the portal handler swallows KeyboardInterrupt / EOFError / SystemExit (returns None, no escape). Note on 'Skip (keep current)': delegating to _model_flow_nous means picking Skip preserves the prior provider instead of force-switching to nous — this is intentional and matches quick setup exactly; docs now say 'sets Nous as your provider (when you pick a model)' rather than unconditionally.	2026-06-04 02:33:33 +05:30
kshitijk4poor	cd188b814e	feat(cli): make `hermes portal` run the full quick-setup Nous flow (model picker) `hermes portal` / `hermes setup --portal` previously logged in and set provider=nous but left the model UNSELECTED (blank -> runtime default) and never showed a picker — unlike the first-time quick setup, which runs the model picker. Route `_run_portal_one_shot` through `_model_flow_nous` — the exact same routine quick setup (`_run_first_time_quick_setup`) and `hermes model` -> Nous use. It handles both the logged-out path (device-code OAuth, which picks a model internally) and the logged-in path (curated Nous model picker), then offers the Tool Gateway opt-in and sets provider=nous. Net effect: `hermes portal` now offers a model picker every time and is a true single-command collapse of quick setup's Nous step. Removes the hand-rolled auth_add_command + manual provider write + separate Tool Gateway prompt (now a single source of truth). Re-syncs the in-memory config from disk afterward so a caller's later save_config can't clobber the model/provider written by the login flow. Docs (CLI help, portal_cli docstrings, nous-portal EN + zh-Hans) updated to mention model selection. New regression test asserts `_run_portal_one_shot` delegates to `_model_flow_nous`. Verified live: `hermes portal` now shows the 27-model curated picker, 'Skip (keep current)' preserves prior provider/model.	2026-06-04 02:20:31 +05:30
kshitijk4poor	9ba7e5b1b4	fix(setup): point Portal login-failure retry hints at `hermes portal` The two retry hints inside _run_portal_one_shot (shown when the OAuth login fails) still suggested `hermes auth add nous --type oauth`. Since this path backs both `hermes portal` and `hermes setup --portal`, point users at the new human-readable `hermes portal` for consistency.	2026-06-04 01:40:11 +05:30
kshitijk4poor	da4f407e51	feat(cli): make `hermes portal` the human-readable Portal onboarding alias `hermes portal` (no subcommand) now runs the one-shot Nous Portal onboarding — OAuth login, switch provider to Nous, offer Tool Gateway — identical to `hermes setup --portal` and the human-readable alias for `hermes auth add nous --type oauth` (which still works). The prior status default moves to `hermes portal info`; `status` is kept as a hidden back-compat alias. `open`/`tools` subcommands are unchanged. User-facing hints and docs (status.py, conversation_loop 401 guidance, SystemPage, README, website docs + zh-Hans) now point at `hermes portal` / `hermes portal info`. `--manual-paste` references keep the explicit auth command since `hermes portal` does not expose that flag.	2026-06-04 01:19:28 +05:30
Brooklyn Nicholson	1b89715e15	fix(desktop): guard reconnect sockets and keep branch search precise Avoid stale WebSocket events from an old reconnect attempt flipping the gateway state after a newer socket opens. Also limit session-search dedupe to compression edges so branch-specific hits still open the branch instead of collapsing to the parent.	2026-06-03 13:13:21 -05:00
Brooklyn Nicholson	93228d5299	fix(desktop): persist pins, reconnect after sleep, dedupe session search Four related desktop session-management bugs: - Pins lost until refresh: pinned sessions are joined against the paginated in-memory session list, so a pinned chat that aged off the most-recent page got evicted on the next refresh (every message.complete triggers one) and the Pinned section went empty. mergeWorkingSessions -> mergeSessionPage now also preserves pinned rows (matched by live id or lineage root). Pin id checks in the chat header, command center, and delete/archive are normalized to the durable sessionPinId so pins survive auto-compression. - Stuck on "Starting Hermes" after sleep: macOS sleep drops the renderer WebSocket; nothing reconnected on wake so the composer stayed disabled. The gateway boot hook now auto-reconnects with backoff on close/error and on wake signals (powerMonitor resume/unlock-screen IPC, window online, visibilitychange). connect() gains an open timeout so a hung reconnect can't deadlock in 'connecting'. Composer placeholder distinguishes "Reconnecting to Hermes" from a cold start. - Loses chats from itself: the same hard-replace that dropped pins also dropped loaded sessions; mergeSessionPage keeps them. - Multiple copies/branches in search: /api/sessions/search deduped only by raw session_id, so compression segments and branches surfaced as separate hits. It now dedupes by lineage root and returns the live compression tip, matching the session_search tool's behavior.	2026-06-03 12:39:31 -05:00
xxxigm	973decc050	fix(gateway): decode schtasks output with locale encoding on Windows _exec_schtasks ran schtasks.exe with text=True but no encoding/errors, so localized Windows (e.g. Chinese) output in the console code page raised UnicodeDecodeError tracebacks from subprocess' reader threads during `hermes gateway status`. Decode with the locale's preferred encoding and errors="replace" so non-UTF-8 status output is read cleanly. Fixes #38172	2026-06-03 09:29:19 -07:00
Teknium	9666305630	fix(dashboard): clamp PTY resize dimensions for WSL2 winsize garbage (#38200 ) * fix(dashboard): clamp PTY resize dimensions for WSL2 winsize garbage WSL2 reports columns=131072, rows=1 from a broken winsize probe. The dashboard /chat tab forwards xterm.js dimensions through PtyBridge.resize(), which packs them as unsigned short via struct.pack. 131072 > 65535 raised struct.error — uncaught (only OSError was handled) — breaking the resize path and leaving the TUI laid out for a one-row, absurdly-wide screen, which surfaces as blank/disappearing text. Clamp cols/rows to a sane [1, 2000]x[1, 1000] range before packing. Non-finite/non-integer probes fall back to the minimum so nothing can reach struct.pack and raise. * test(dashboard): de-flake pub/events broadcast test test_pub_broadcasts_to_events_subscribers round-tripped a frame through two nested Starlette TestClient WebSocket portals within a 10s wall-clock budget. Under heavy parallel CI load a starved ASGI thread occasionally blew that budget even though the server logic is correct, producing intermittent 'broadcast not received within 10s' failures. Drive _broadcast_event directly under asyncio with fake subscribers instead. Same fan-out contract (verbatim delivery to every subscriber on the channel, nothing to other channels), zero scheduling surface. Runs in ~0.3s, deterministic across 10 consecutive runs.	2026-06-03 09:00:16 -07:00
Austin Pickett	7fb8a6b5c5	feat(dashboard): enrich profiles dashboard and de-dupe channel env vars (#37872 ) * feat(desktop): enrich profiles dashboard and de-dupe channel env vars Add active-profile switching, role descriptions (manual + auto-generate via the auxiliary LLM), per-profile model selection, and gateway-running / distribution badges to the GUI Profiles page. New profile creation gains clone-all, optional description and model assignment. Hide messaging-platform credentials (channel_managed) from the Keys/Env page since the Channels page is the canonical surface for them, and relabel the trimmed "messaging" category as "Gateway". Co-authored-by: Cursor <cursoragent@cursor.com> * fix(desktop): address review feedback on profiles/env changes - ProfilesPage: scope the action-menu outside-click handler to the menu's own container via a ref so opening one card's menu no longer leaves others open. - EnvPage: route the "Gateway" label and hint through i18n (t.common.gateway / gatewayHint) instead of hard-coded English, with an English fallback for untranslated locales. - web_server: only report description_auto=true when auto-generation actually succeeded. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(desktop): address second-round review on profiles - ProfilesPage: treat describe-auto success by null-checking the description and trust the response's description_auto flag instead of assuming true; disable the model-editor Save button unless the selected choice resolves to a real /api/model/options entry (avoids silent no-op saves). - tests: cover the new profile endpoints (active get/set + 404, description round-trip + 404, model round-trip + 400 validation, and describe-auto success/failure contracts). Co-authored-by: Cursor <cursoragent@cursor.com> * fix(desktop): more profiles review fixes (toggles, races, tests) - ProfilesPage: use the canonical `active` returned by setActiveProfile; make the SOUL/description/model action-menu items toggle their editor closed when already open; guard description save/auto-describe against stale responses via an activeDescRequest ref so a late reply can't clobber a different open editor. - tests: assert /api/env channel_managed classification matches _channel_managed_env_keys(). Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-03 10:37:36 -04:00
Teknium	6ee046a72f	fix(doctor): detect + repair stale HERMES_MAX_ITERATIONS .env ghost shadowing config.yaml (#38222 ) * fix(doctor): detect + repair stale HERMES_MAX_ITERATIONS .env ghost shadowing config.yaml hermes doctor now flags when ~/.hermes/.env carries a HERMES_MAX_ITERATIONS value that disagrees with agent.max_turns in config.yaml, and 'hermes doctor --fix' removes the stale .env line so config.yaml is authoritative. 'hermes config show' surfaces the same drift inline under Max turns. The setup wizard stopped dual-writing this value, but users who edited only config.yaml from a pre-fix install keep a .env ghost. The gateway bridge normally overrides it at startup, but if the bridge bails on any earlier config-parse error the ghost silently wins — config says 400 while the gateway activity line reads N/90. The detector reads the .env FILE directly (load_env), not get_env_value/ os.environ, since the startup bridge may already have overwritten os.environ with the config value. Closes #17534. * fix(config): stop offering HERMES_MAX_ITERATIONS as an editable env var Removes HERMES_MAX_ITERATIONS from OPTIONAL_ENV_VARS so the dashboard env editor (PUT /api/env) and any env-var prompt no longer let a user write it to .env — which would recreate the stale ghost that shadows config.yaml's agent.max_turns (issue #17534). The iteration budget is configured only via config.yaml; the env var stays a read-only backward-compat fallback in the gateway/CLI, never a promoted write target. Regression test asserts it is absent from OPTIONAL_ENV_VARS.	2026-06-03 06:38:40 -07:00
Bryan Bednarski	0d9b7132ff	feat(observability): observer-grade telemetry hooks + NeMo-Relay plugin Adds backend-neutral observer hooks for plugins: session, turn, API request, tool, approval, and subagent lifecycle events with stable correlation IDs (session_id, task_id, turn_id, api_request_id, tool_call_id, parent/child subagent ids). Extends VALID_HOOKS with api_request_error and subagent_start. Hot path is zero-cost when no plugin subscribes: has_hook()/presence checks gate all payload construction, request payloads are returned by reference when no middleware rewrites, and the sanitized response payload no longer embeds raw response objects. Bundles the optional NeMo-Relay observability plugin (plugins/observability/nemo_relay) as an in-repo consumer of the new hooks, peer to the existing langfuse plugin. Fails open when the optional nemo-relay package is not installed. Authored-by: Bryan Bednarski <bbednarski@nvidia.com> Salvaged from #29722 onto current main.	2026-06-03 06:36:46 -07:00
Teknium	4c544b633d	fix(kanban): don't permanently block tasks that hit a provider rate limit (#38223 ) A kanban worker that exhausted its retries purely on a provider rate limit / quota wall (e.g. opencode-go's 5-hour window) exited with code 1. The dispatcher counted that as a crash, and with DEFAULT_FAILURE_LIMIT=2 two quota-wall hits permanently blocked the card. Fanning out many workers against one shared quota made this routine. Now a rate-limited worker exits with EX_TEMPFAIL (75); the dispatcher classifies that as a 'rate_limited' exit, releases the task back to 'ready' WITHOUT incrementing consecutive_failures (the breaker can't trip on a transient throttle), and the respawn guard defers the next attempt on a cooldown (default 5min, HERMES_KANBAN_RATE_LIMIT_COOLDOWN_SECONDS) until the quota window clears. Genuine crashes still count and trip the breaker as before. The 120s Retry-After cap is unchanged — no worker parks for hours holding a slot. - conversation_loop.py: surface failure_reason in the exhaustion return - cli.py: kanban worker picks exit 75 on rate_limit/billing failure - kanban_db.py: rate_limited exit kind, no-count requeue, cooldown guard	2026-06-03 06:19:32 -07:00
Teknium	c5d199eada	feat(dashboard): check-before-update flow on the System page (#38205 ) The dashboard's update button ran 'hermes update' immediately with no preview. Now the System page shows whether an update is available and asks the user to confirm before applying it. - New GET /api/hermes/update/check: reports install method, current version, and commits-behind (via banner.check_for_updates, 6h-cached; ?force=1 busts the cache). Soft-fails to behind=null on network error; marks docker/nix/homebrew as can_apply=false with the out-of-band cmd. - System page: update-status badge on the Hermes version row (latest / N behind), a Check-for-updates button, and an Update-now button that opens a ConfirmDialog showing the commit count before POST /api/hermes/ update fires. Cached status loads with the rest of the page. - Docs + 5 endpoint tests (git/up-to-date/docker/soft-failure + auth gate).	2026-06-03 05:57:15 -07:00
Teknium	1b302a0474	feat(debug): include desktop.log in hermes debug share / /debug / hermes logs (#38203 ) The Electron desktop app writes boot failures, backend spawn output, and Python tracebacks to HERMES_HOME/logs/desktop.log, but debug-share only captured agent/errors/gateway — so desktop boot issues never made it into shared debug reports. - logs.py: register desktop -> desktop.log (enables 'hermes logs desktop') - debug.py: capture desktop snapshot, add to summary report, upload full desktop.log in 'share', update privacy notice - gateway /debug inherits the desktop tail via collect_debug_report() - main.py + docs: help text and log-name table (also adds missing gui row) - tests: desktop seed in fixture, new report test, three_pastes -> four_pastes	2026-06-03 05:41:35 -07:00
Teknium	1d90b23982	fix(mcp): banner shows 'disabled' not 'failed' for enabled:false servers (#38204 ) get_mcp_status() treated every non-connected server as a failure, so a server configured with enabled: false rendered as red '— failed' in the startup banner even though it was intentionally off. Add a 'disabled' field derived from the enabled flag and render disabled servers dim as '— disabled' instead.	2026-06-03 05:41:13 -07:00
liuhao1024	192020992d	fix(cli): exclude desktop-managed backend from stale-dashboard kill Fixes #37532	2026-06-03 04:59:49 -07:00
kshitijk4poor	e114b31eda	test(dashboard): direct unit coverage for internal WS credential + docstring fix Follow-up to Ben's PR #37892. Adds a TestInternalCredential block to test_dashboard_auth_ws_tickets.py exercising the mint-once stability, multi-use, unminted-rejection, empty-value, wrong-value, reset-and-remint, and ticket-store-independence branches directly (previously only covered indirectly via _ws_auth_ok, which left the unminted and empty-value branches unexercised). Also corrects the consume_internal_credential docstring: the returned identity dict is discarded by the current _ws_auth_ok caller (which only needs the boolean outcome), so the prior 'carry it into its session log' wording over-promised.	2026-06-02 23:43:27 -07:00
Ben	fd1ec8033d	fix(dashboard): authenticate server-spawned PTY child WS with a process-internal credential The embedded-TUI PTY child attaches to two server-internal WebSockets: /api/ws (its primary JSON-RPC gateway backend) and /api/pub (the event sidecar). Both URLs are built server-side in web_server.py and handed to the child via its environment. In OAuth-gated mode (auth_required=true, every hosted Fly agent), _ws_auth_ok unconditionally rejects the legacy ?token=<_SESSION_TOKEN> path — a leaked session token must not grant WS access once the gate is engaged. But _build_gateway_ws_url() still only emitted ?token=, with no gated-mode branch (its sibling _build_sidecar_url had been given a ticket branch; the gateway-url builder was missed). So the TUI child's /api/ws upgrade was rejected 4401 -> 'gateway websocket connection failed' -> 'gateway startup timeout', leaving the embedded chat unusable on every gated deployment. A single-use 30s browser ticket is the wrong shape for this link: the child reads its attach URL once at startup and reuses it on every reconnect, and on a slow cold boot it may not dial within the TTL. (_build_sidecar_url's own docstring already flagged this fragility.) Fix: add a process-lifetime, multi-use internal credential to dashboard_auth.ws_tickets (internal_ws_credential / consume_internal_credential), minted once per process and NEVER injected into the SPA — it only leaves the process via a spawned child's env, so browser-side XSS can't read it, and a leak grants no more than a ticket already does. _ws_auth_ok accepts it via ?internal= in gated mode only. Both _build_gateway_ws_url and _build_sidecar_url now use it, so the child can reconnect both sockets. Loopback / --insecure behavior is unchanged (still ?token=). Needs review: touches _ws_auth_ok + dashboard_auth (core auth surface).	2026-06-02 23:43:27 -07:00

1 2 3 4 5 ...

2498 Commits