Commit Graph

4913 Commits

Author SHA1 Message Date
93b5df3189 fix(test): patch async_is_safe_url in web-provider SSRF mocks
web_tools.is_safe_url was replaced by async_is_safe_url, but three
web-provider test files still monkeypatched the old sync name, raising
AttributeError. Patch the async variant with an async lambda.
2026-06-04 18:04:47 -07:00
c60952ba94 fix(web): run URL SSRF checks off the event loop in async paths
Add async_is_safe_url() wrapping is_safe_url via asyncio.to_thread, and route
all async SSRF call sites through it: web_extract_tool, the vision/video
preflight checks, and both download redirect guards. socket.getaddrinfo blocks;
calling it inline from async tool paths froze the event loop for the duration of
DNS resolution.

vision_tools: split _validate_image_url into _image_url_shape_ok (no DNS) +
sync _validate_image_url (for sync callers/tests) + async _validate_image_url_async.

Widened beyond the original PR #3691 to sibling async sites that also blocked
the loop (second redirect guard, video preflight).

Salvage of #3691 by @Kewe63 — surgically re-applied onto current main because
the original branch was too stale to cherry-pick cleanly (would have reverted
the web_crawl_tool refactor).

Co-authored-by: Kewe63 <kewe.3217@gmail.com>
2026-06-04 18:04:47 -07:00
19db9cd076 fix(acp): replace direct db._lock/_conn access with public update_session_meta()
session.py _persist() bypassed SessionDB's thread-safe write path by
accessing private internals db._lock and db._conn directly:

    with db._lock:
        db._conn.execute("UPDATE sessions SET model_config = ? ...")
        db._conn.commit()

This was fragile for three reasons:
1. It bypassed _execute_write()'s BEGIN IMMEDIATE + jitter-retry logic,
   so concurrent writes could hit SQLite BUSY without retrying.
2. It called db._conn.commit() manually, breaking the transactional
   contract that _execute_write() enforces.
3. Any internal rename of _lock or _conn would silently break this
   call site with an AttributeError at runtime.

Fix:
- Add SessionDB.update_session_meta(session_id, model_config_json, model)
  to hermes_state.py. Routes through _execute_write() for the standard
  BEGIN IMMEDIATE + lock + jitter-retry guarantee. Uses COALESCE so
  passing model=None leaves the stored model column unchanged.
- Replace the db._lock / db._conn block in session.py _persist() with
  a single db.update_session_meta() call.

Tests (tests/acp/test_session_db_private_access.py, 11 tests):
- Unit tests for update_session_meta: updates model_config, updates
  model, preserves existing model on None, routes through _execute_write,
  no-op on non-existent session.
- AST checks: db._lock and db._conn not referenced in session.py;
  _persist() calls update_session_meta().
- Integration round-trips: cwd and model persisted correctly; COALESCE
  prevents overwriting an existing model with NULL.
2026-06-04 17:54:59 -07:00
4a4b9bd2dc fix(test): add platform guard for grp import
Tests in test_gateway_service.py imported grp inline without a
platform guard, causing ImportError on systems where grp is
unavailable (e.g. macOS, WSL without grp module).

Added pytest.importorskip('grp') at module level alongside the
existing pwd guard, and removed three redundant inline import grp
statements.

Fixes #24531
2026-06-04 17:52:50 -07:00
74e845c000 fix(slack): pass thread_ts in standalone send_message tool path
The standalone `_send_slack()` function used by the send_message tool
and cron delivery fallback was not passing `thread_ts` to the Slack API,
causing messages to post to the top-level channel instead of inside
threads.

- Add `thread_ts` parameter to `_send_slack()`
- Include `thread_ts` in the chat.postMessage payload when present
- Pass `thread_id` from `_send_to_platform()` to `_send_slack()`

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-06-04 17:42:10 -07:00
c14c37d46b fix(openviking): add missing /agent/{agent}/ segment to memory URI — fixes #36969
_build_memory_uri produced URIs of the form:
  viking://user/{user}/memories/{subdir}/mem_{slug}.md

The /agent/{agent}/ segment was missing, causing every agent under
the same user to write into the same flat namespace. In multi-agent
deployments agents silently overwrite each other's memories and
vector retrieval cross-pollinates results.

self._agent was already populated correctly (from OPENVIKING_AGENT
env var, default 'hermes') and sent via X-OpenViking-Agent header —
it was simply not interpolated into the URI.

Fix: add the missing segment so URIs follow the documented shape:
  viking://user/{user}/agent/{agent}/memories/{subdir}/mem_{slug}.md

Tests: 4 new regression tests in TestOpenVikingMemoryUriBuilder,
13/13 passed (9 existing + 4 new).
2026-06-04 17:40:33 -07:00
8a888441d7 fix(docker): recover from out-of-band container removal in persistent mode (salvage #36631) (#39415)
Salvage of #36631 (@annguyenNous), rebased onto current main with
regression tests added. Fixes #36266.

When a persistent Docker sandbox container is removed out-of-band (idle
reaper, `docker prune`, OOM kill, daemon restart), the gateway kept
issuing `docker exec` against the dead container ID, returning
"No such container" on every subsequent tool call — the agent was
permanently blocked until the gateway process restarted.

DockerEnvironment.execute() now detects the "No such container" /
"is not running" error after a non-zero exit (gated on
persist_across_processes) and calls _recreate_container(): it tries
label-based reuse first, falls back to a fresh container replaying the
same image + full all_run_args set, re-runs init_session(), and retries
the command once. A genuine non-zero exit is NOT misclassified as
container-gone.

Differs from #36631 as submitted: adds the tests the original lacked.
tests/tools/test_docker_environment.py covers _is_container_gone pattern
matching (incl. the negative/control case), the recover-and-retry path,
the persist_across_processes=False opt-out (no recovery), and the
ordinary-failure passthrough (no spurious recreation). _make_dummy_env
now forwards persist_across_processes.

Verified:
- Unit: 67/67 in test_docker_environment.py (4 new + existing).
- Live E2E against the real docker daemon: started a persistent
  container, `docker rm -f`'d it out-of-band, and the next execute()
  transparently recreated a fresh container and succeeded; a follow-up
  command worked in the recovered container; a real `exit N` passed
  through without triggering recovery.

Co-authored-by: annguyenNous <annguyenNous@users.noreply.github.com>
2026-06-05 10:33:44 +10:00
54cae7d1cb switch model order 2026-06-04 17:29:31 -07:00
82c157b267 fix(docker): clean up orphaned container when docker run fails (salvage #7440) (#39412)
When `docker run -d` fails after Docker has already created the container
object (e.g. exit 125 when the daemon isn't ready, or a timeout mid image
pull), the code raised before `self._container_id` was set — so the
container leaked permanently in "Created" state. Reported in #7439:
110+ orphaned containers accumulated over 3 days from hourly cron-
scheduled gateway sessions hitting a Docker Desktop startup race.

The orphan reaper added in #33645 (reap_orphan_containers) does NOT cover
this case: it filters `status=exited`, but a failed-create container is in
`Created` state, so it slips through and is never reaped.

Wrap the `docker run -d` call in try/except and `docker rm -f` the
container by its known name before re-raising.

Salvages #7440 by @Tranquil-Flow. Their branch predated the cross-process
reuse + labels rework on `main`, so a cherry-pick conflicted; reconstructed
the same intent (plus their two regression tests, adapted to mock the new
reuse `docker ps` probe) against current `main`.

Verified adversarially: reverted just the product change to origin/main's
`docker.py`, ran the two new tests -> both FAIL with
`assert 0 == 1 ("docker rm should be called once")`. With the fix applied,
both pass; full test_docker_environment.py is 65/65 green.

Closes #7440. Fixes #7439.

Co-authored-by: Evi Nova <66773372+Tranquil-Flow@users.noreply.github.com>
2026-06-05 10:19:08 +10:00
4690bbc363 fix(local): recognize unqualified hostnames as local endpoints (#9248)
Docker Compose service names (e.g. ollama, litellm, hermes-litellm)
are unqualified hostnames with no dots. These are always local — they
resolve via Docker DNS, /etc/hosts, or mDNS. Without this fix, the
stale stream timeout fires on local LLM proxies, causing infinite
reconnect loops.

Closes #7905
2026-06-05 10:18:10 +10:00
e7a7872a87 fix(tui_gateway): dedup re-queued process notifications flooding TUI
_ notification_poller_loop_ re-emits status.update every cycle
when a background process completes while the session is busy.
The same completion event gets re-queued and re-emitted to the
TUI every few ms, flooding the transcript with duplicate lines.

Add _notification_event_dedup_key(evt) that returns a tuple
identity for each notification event. Only emit status.update
on first sight per identity:
- completions: (sid, type) — one-shot per process session
- watch_match: (sid, type, command, pattern, output, ...)
- watch_overflow/disabled: (sid, type, command, message, ...)

The dedup key design was refined from an initial sid:type approach
after @lordbuffcloud identified that distinct watch_match events
(READY vs DONE) for the same process would be incorrectly collapsed.
Tests from @tymrtn cover distinct watch matches, exact replay
dedup, and completion one-shot behavior.

Co-authored-by: tymrtn <ty@tmrtn.com>
2026-06-04 16:56:34 -07:00
2f0c8e90e6 Add Telegram QR onboarding to dashboard 2026-06-04 16:55:27 -07:00
5300727a08 revert: keep Google Chat OAuth secret + active_provider profile-scoped (#39398)
* Revert "fix(gateway): anchor Google Chat OAuth client secret to default Hermes root"

This reverts commit fff0561441.

* Revert "fix(cli): honor global-root active_provider fallback for named profiles"

This reverts commit 3858cf4307.

* docs(google_chat): describe OAuth client secret as profile-scoped, not host-wide

The setup docs, oauth docstring, and the adapter's 'no credentials'
error message all described the Google Chat OAuth client secret as
host-wide shared infrastructure. That contradicts profile isolation:
profiles are separate auth boundaries, so two profiles can point at
different Google OAuth apps / accounts. Reword all three to say the
secret is profile-scoped and each profile registers its own.
2026-06-04 16:54:40 -07:00
495c3733d8 fix(config): bridge docker_volumes and docker_forward_env in config set (#38611)
Co-authored-by: Ben Barclay <ben@nousresearch.com>
2026-06-05 09:31:01 +10:00
acce1a2452 feat(desktop): polish credentials settings and messaging env routing (#39217)
* feat(desktop): polish credentials settings and messaging env routing

Align Provider API Keys and Tools & Keys with Advanced ListRow inputs,
add Tools & Keys sidebar subnav, move platform env vars to Messaging via
channel_managed discovery, strip toolset emojis, and condense cron actions.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(desktop): align Messaging credential inputs with settings ListRow style

Remove monospace inputs and use CREDENTIAL_CONTROL_CLASS + ListRow layout
to match Provider API Keys and Tools & Keys.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-04 14:01:15 -04:00
a3fb48b2ce fix(state): keep /branch sessions visible after parent reopen
/branch (aka /fork) sessions vanished from /resume and /sessions. Both
surfaces funnel through list_sessions_rich(include_children=False), which
hid any session with a parent_session_id unless identified as a branch via a
heuristic — parent.end_reason == 'branched' AND child.started_at >=
parent.ended_at.

Two ways that heuristic failed:
1. CLI/gateway branches: once the parent was reopened (e.g. resumed) and
   re-ended with a different end_reason (tui_shutdown overwriting 'branched'),
   the heuristic stopped matching and the branch was hidden permanently.
2. TUI branches (tui_gateway session.branch): the TUI never ends the parent
   as 'branched' — it creates the child while the parent is still live — so
   the heuristic NEVER matched and TUI branches were hidden from the moment
   they were created (this is the macOS desktop app's primary symptom).

Fix: persist a stable '_branched_from' marker in the branch session's
model_config at creation time across ALL THREE branch paths (CLI cli.py,
gateway gateway/run.py, and TUI tui_gateway/server.py), and OR a
json_extract(model_config, '$._branched_from') IS NOT NULL check into the
list_sessions_rich filter. The marker is immutable across the parent's
lifecycle, so the branch stays visible regardless of how/whether the parent
is ended. The legacy end_reason heuristic is kept (OR'd) so pre-existing
branches remain visible. Subagent/compression children (no marker, parent
not 'branched') stay correctly hidden. Fixes #20856.

Approach by liuhao1024 (PR #20864); reimplemented on current main, extended
to the TUI branch path (which the original missed), with regression tests for
the reopen+re-end scenario and the TUI marker persistence.
2026-06-04 10:07:20 -07:00
1f347ee543 fix(uv): move venv aside instead of gutting it in place on Windows rebuild
hermes update can brick a Windows install. When 'hermes update --force' runs
past the concurrent-process guard, rebuild_venv runs while the venv is still in
use: shutil.rmtree(ignore_errors=True) deletes site-packages + certifi's cert
bundle but can't remove the locked python.exe, leaving a half-gutted venv that
uv venv then refuses to overwrite. Every later HTTPS call dies with
FileNotFoundError for the missing cacert and there is no recovery.

--clear alone (the c136eb4de retry path) does not fix the real lock case: when
the locked interpreter is *inside* the venv being rebuilt, neither rmtree nor
uv venv --clear can delete it. os.replace of the parent directory *is* allowed
on Windows (a running .exe is tracked by handle, not path), so we move the old
venv aside atomically to <venv>.old, rebuild with --clear in its place, and the
still-running gateway/desktop keep using the moved-aside copy until they
restart. If the venv genuinely can't be moved, we abort cleanly and leave it
fully intact; if the rebuild fails, we restore the moved-aside copy.

Folds in the call-site guards from #38511 (@f3rs3n):
- rebuild_venv() returns False (and restores the backup) if uv exits 0 without
  producing an interpreter.
- both hermes update venv-rebuild call sites abort with RuntimeError instead of
  continuing into dependency install when rebuild_venv() returns False.

Also gitignore /venv.old/ so the update autostash (git stash --include-untracked)
doesn't sweep the moved-aside venv on every run.

Root-cause fix for #37881. Supersedes the --clear-only retry from c136eb4de.

Co-authored-by: f3rs3n <32328813+f3rs3n@users.noreply.github.com>
2026-06-04 12:18:38 -04:00
ee7948ea6e fix(deps): exclude dev tooling from all extra 2026-06-04 08:54:38 -07:00
8077e7d2fb fix(tui): narrow resume lock to avoid blocking session.close
The salvaged fix held _session_resume_lock across _make_agent (MCP discovery
+ AIAgent construction, seconds), serializing it against session.close. Since
session.close runs on the main RPC dispatch thread (not a _LONG_HANDLER), a
close racing a mid-build resume would stall all fast-path RPCs (approval.respond,
session.interrupt).

Restructure to double-checked locking: build the agent outside the lock, then
re-check _find_live_session_by_key under the lock before _init_session. A losing
concurrent resume discards its just-built agent (no worker/poller wired yet) and
reuses the winner. Updated the concurrent-resume regression test to assert the
real invariant (one surviving live session + loser agent closed) rather than the
implementation detail of a single _make_agent call.
2026-06-04 08:18:26 -07:00
bd6d098762 fix(tui): keep resumed live history current 2026-06-04 08:18:26 -07:00
98903d0313 fix(tui): reuse live session on resume 2026-06-04 08:18:26 -07:00
30412a9771 fix(cron): re-validate stale cron-output entries before deletion (#37721)
quick() and dry_run() previously trusted the stored category from
tracked.json without re-validating at delete time. Stale entries from
before #34840 could carry category="cron-output" for cron control-plane
paths (e.g. cron/jobs.json), causing quick() to delete the live
scheduler registry.

Fix:
- Fix guess_category() to only classify cron/output/** as cron-output
  (was classifying ALL cron/* paths, missing the #34840 fix).
- Re-validate cron-output entries via guess_category() at delete time
  in quick() and dry_run(); stale entries that are no longer classified
  as cron-output are skipped and removed from tracked.json.
- Add _is_protected_cron_path() as a hard defense-in-depth guard that
  blocks deletion of cron/cronjobs directories and known control-plane
  files (jobs.json, .tick.lock) regardless of stored category.
- Update test_cron_subtree_categorised to match fixed guess_category
  (only cron/output/* is cron-output, not all of cron/).

Tests: add 5 regression tests in TestStaleCronEntryMigration.
2026-06-04 07:52:04 -07:00
693f4c7e9c fix(gateway): clear zombie agent slot when session_reset races in-flight run
A session_reset (/new, /cc) that bumps the run generation while an agent
turn is in flight left the dead agent in the _running_agents slot: the
in-flight run's own release is generation-guarded and correctly returns
False, and the outer finally's sentinel-only check also missed the
leftover real agent. The session then silently dropped every subsequent
message as 'agent busy' until a full gateway restart. (#28686)

- _process_message_or_command outer finally now calls the unconditional,
  idempotent _release_running_agent_state(key) on all exit paths instead
  of the sentinel-vs-else branch that could strand a dead agent.
- _handle_reset_command evicts the slot right after bumping the
  generation, so the zombie is cleared at reset time regardless of how
  the in-flight run unwinds.

Co-authored-by: CryptoByz <cryptobyz.airdrop@gmail.com>
2026-06-04 07:50:45 -07:00
2982122be7 fix(gateway): deliver $HOME deliverables on root-run gateways
Root-run gateways have $HOME=/root, which is on the MEDIA system-path
denylist, so the gateway silently dropped agent-generated deliverables
under /root (e.g. /root/work/proposal.docx) — the user got a 'here is
your file' reply with nothing attached.

_path_under_denied_prefix now treats the running user's own home as
deliverable: the home tree itself is no longer denied, while the
more-specific denied paths inside it (~/.ssh, ~/.aws, ~/.hermes/.env,
auth.json, config.yaml) stay blocked because they are separate denylist
entries. The exception only matches when the denied prefix IS $HOME, so
a non-root gateway still can't deliver another user's home.

Diagnosis, reproduction, and the failing-case analysis are from
@GodsBoy (#38108 / #38106). Implemented here as the minimal denylist
fix rather than a staging/copy subsystem.

Co-authored-by: GodsBoy <dhuysamen@gmail.com>
2026-06-04 07:50:22 -07:00
580d924097 perf(desktop): make session-id search SQL-bounded, not O(n)
search_sessions_by_id previously fetched up to 10k sessions via
list_sessions_rich and filtered them in Python — O(n) per keystroke.
Push the id match into SQL instead.

- list_sessions_rich gains an optional id_query param: a case-insensitive
  LIKE pushed into the outer WHERE, matched against each surfaced row's id
  AND every id in its forward compression chain (via the existing chain
  CTE). Searching a compression root id or a tip id both resolve to the
  same projected conversation. LIKE wildcards in the needle are escaped.
- search_sessions_by_id now fetches only matching rows (limit*4) and ranks
  exact > prefix > substring in Python over that small set.
- web_server /api/sessions/search: route ID matches and content matches
  through one lineage-keyed dedup helper so an id-hit and a content-hit on
  the same conversation collapse to a single result (the contributor's
  version keyed ID hits by raw sid and content hits by root, which could
  double-list a compression tip).
- command-center haystack also matches _lineage_root_id for parity.

E2E verified against a real DB: exact match over 3000+ sessions
materializes 1 row in Python (was ~3000), 5ms; root-id resolves to tip;
LIKE-wildcard escaping holds.

Follow-up to @0xharryriddle's feat(desktop): search sessions by id.
2026-06-04 07:49:34 -07:00
9ecc331be8 feat(desktop): search sessions by id 2026-06-04 07:49:34 -07:00
081694c111 fix(kanban): isolate board override per concurrent call 2026-06-04 07:39:53 -07:00
fef04a197e fix(desktop): purge electron cache unconditionally, not via stdlib zipfile gate
The salvaged detector validated each cached electron-*.zip with
zipfile.testzip() and only purged ones it judged corrupt. But stdlib
zipfile reads from the end-of-central-directory backward, so it silently
tolerates prepended/concatenated junk — which is exactly the corruption
the bug report names ('86257938 extra bytes at beginning or within
zipfile', a partial download resumed into the same file). testzip()
returns clean on those zips, so the self-heal never fired for the
reported failure mode.

Drop the self-rolled validator: on any packaged-build failure, purge the
version's cached zips AND the half-written unpacked dir, then retry once.
@electron/get re-downloads with its own SHASUM verification — the real
source of truth, which catches prepend/concat/truncate alike. An
unrelated failure just costs one clean re-download and fails the same way.

Verified empirically: zipfile.testzip() returns None (clean) on a
prepended-junk zip; the unconditional purge removes it correctly.
2026-06-04 07:17:33 -07:00
f583c6ebd5 fix(desktop): recover from corrupt cached Electron download on build
hermes desktop failed on Linux with an ENOENT renaming
release/linux-unpacked/electron -> Hermes. Root cause is a corrupt
cached Electron zip (~/.cache/electron/electron-*.zip): app-builder
unpack-electron extracts a partial tree from the bad zip that is
missing the electron binary, so electron-builder dies on the final
rename. Re-running repeats the broken extraction, leaving the desktop
app permanently unlaunchable until the cache is manually purged.

- Add _electron_download_cache_dirs() + _purge_corrupt_electron_cache()
  to hermes_cli/main.py: validate every electron-*.zip via
  zipfile.testzip() and delete corrupt ones; honor electron_config_cache
  / ELECTRON_CACHE overrides with per-OS defaults.
- Wire purge + single retry into cmd_gui packaged-build failure path so
  a poisoned download self-heals (electron re-downloads clean).
- Add beforePack hook (apps/desktop/scripts/before-pack.cjs) to wipe the
  target unpacked dir before staging, making packaging idempotent across
  interrupted runs. Cross-platform, best-effort.
- Tests: corrupt-zip detector, cmd_gui purge/retry/launch path,
  no-retry-when-clean path, and node --test for the cleanup helper.
2026-06-04 07:17:33 -07:00
3858cf4307 fix(cli): honor global-root active_provider fallback for named profiles 2026-06-04 07:08:30 -07:00
b7169f9bbb fix(gateway): keep pending /update completion notifications until the target platform reconnects 2026-06-04 06:56:28 -07:00
fff0561441 fix(gateway): anchor Google Chat OAuth client secret to default Hermes root 2026-06-04 06:45:32 -07:00
07f5382675 fix(gateway): don't treat dm_policy: pairing as open access on own-policy adapters 2026-06-04 06:31:28 -07:00
dd4ba4c2c4 fix(vision): cap pixel dimensions proactively at embed time + declare Pillow
Follow-up to the salvaged #37727. That PR fixed the reactive recovery path
(classifier + post-failure shrinker) but left the PROACTIVE embed-time guard
in vision_tools byte-only — a tall small-byte screenshot (e.g. 1200x12000 at
0.06 MB) still baked into immutable history un-resized, relying on a failed
round-trip to trigger reactive shrink.

- vision_tools: add _image_exceeds_dimension() + _EMBED_MAX_DIMENSION (7900px);
  the embed-time cap now fires on bytes OR pixels and passes max_dimension to
  the resizer, so tall small-byte images are shrunk before they're embedded.
- vision_tools: best-effort lazy-install of Pillow (tool.vision) in the resize
  ImportError fallback so the soft dep self-heals (respects allow_lazy_installs).
- error_classifier: add two more Anthropic dimension-cap wording variants.
- pyproject + lazy_deps: declare Pillow as the [vision] extra / tool.vision
  lazy dep (it was undeclared everywhere; without it ALL resize recovery no-ops).
- tests: cover _image_exceeds_dimension (tall/small/edge/no-Pillow/corrupt).

Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com>
2026-06-04 06:16:45 -07:00
6bdbe30763 fix(vision): guard image pixel dimensions, not just bytes (#37677)
Anthropic enforces two independent ceilings per image:
1. 5 MB encoded byte size
2. 8000 px longest side

Hermes only guarded #1. A tall screenshot (e.g. 1200x12000 at 0.06 MB)
passes every byte check but fails the pixel check, returning a
non-retryable HTTP 400 that permanently bricks the conversation thread.

Fixes:
- error_classifier: add 'image dimensions exceed' pattern to
  _IMAGE_TOO_LARGE_PATTERNS so the 400 is classified as image_too_large
  and triggers the shrink/retry path instead of falling through to
  non-retryable error.
- conversation_compression: check pixel dimensions (via Pillow) even
  when byte size is under the 4 MB target. If max(dims) > 8000, force
  shrink.
- vision_tools._resize_image_for_vision: add optional max_dimension param.
  When set, images exceeding the pixel cap are downscaled even if they're
  under the byte budget. The resize loop now checks both byte AND pixel
  limits before accepting a candidate.

Closes #37677
2026-06-04 06:16:45 -07:00
7314757876 refactor(feishu): slim meeting-invite parser; add AUTHOR_MAP entry
Collapse the payload-shape normalization helpers into one _as_dict and
drop unused dataclass fields (user_type/user_role, duplicate id, bot) on
the meeting-invite handler. Module 274->212 LOC, behavior unchanged.

Add zhaolei.vc@bytedance.com -> zhaoleibd to release.py AUTHOR_MAP.
2026-06-04 06:15:23 -07:00
f3bbfda6d1 feat(gateway): handle Feishu meeting invitations
Change-Id: I8cf5638393dd9adb1d7be5e170ce5082b41f77fa
2026-06-04 06:15:23 -07:00
38d3c49aaf refactor(skills): clean up bundled skill set + add environments: relevance gate (#39028)
* refactor(skills): clean up bundled skill set + add environments: relevance gate

Bundled skills cleanup pass plus a new offer-time relevance gate.

Removals (redundant / dead):
- spotify (covered by the spotify plugin's 7 native tools)
- linear (covered by `hermes mcp install linear`)
- kanban-codex-lane, debugging-hermes-tui-commands
- empty category markers: diagramming, gifs, inference-sh,
  mlops/training, mlops/vector-databases
- domain (stale orphan dup of optional/research/domain-intel)

Bundled -> optional:
- baoyu-article-illustrator, baoyu-comic, creative-ideation, pixel-art
- dspy, subagent-driven-development
- minecraft-modpack-server, pokemon-player
- hermes-s6-container-supervision (-> optional/devops)

Consolidation:
- webhook-subscriptions + native-mcp folded into the hermes-agent skill
  as references/webhooks.md + references/native-mcp.md with SKILL.md pointers
- writing-plans merged into plan (v2.0.0); related_skills + prose refs updated

New: environments: frontmatter gate (agent/skill_utils.skill_matches_environment)
- Offer-time relevance filter (kanban / docker / s6), parallel to platforms:.
- Wired into the 3 OFFER surfaces only (prompt_builder skills index,
  skills_tool.list_skills, skill_commands slash discovery).
- Explicit loads (skill_view, --skills preload) intentionally BYPASS it, so
  load-bearing force-loads like the kanban dispatcher's `--skills kanban-worker`
  always resolve. Verified via E2E.
- kanban-orchestrator/kanban-worker tagged environments: [kanban];
  hermes-s6-container-supervision tagged environments: [s6] + platforms: [linux].

Validation: 8/8 E2E gating assertions (incl force-load invariant);
442 targeted tests green (agent, skills_tool, skill_commands, kanban worker).

* docs: regenerate skill catalogs + pages for the bundled cleanup

Regenerated per-skill doc pages, catalogs, and sidebar to match the skill
moves/removals in the parent commit. Moved skills' pages relocate
bundled -> optional (history preserved); removed skills' pages deleted;
edited skills' pages refreshed (hermes-agent now embeds the webhook +
native-mcp reference pointers). zh-Hans i18n mirror: stale bundled pages
and catalog rows for moved/removed skills pruned (new optional translations
land via the translation pipeline).

* test: drop regression test for removed kanban-codex-lane skill

The kanban-codex-lane skill was removed in the bundled-skills cleanup;
its dedicated regression test read the now-deleted SKILL.md and failed
with FileNotFoundError on CI shard 6.
2026-06-04 06:11:22 -07:00
c136eb4de1 fix(update): harden venv rebuild + verify core deps after install
Two complementary fixes for a silent partial-install failure that bit
``hermes update`` in the wild: a fresh checkout pulled 145 commits,
``rebuild_venv`` failed to recreate the venv on Windows because
``shutil.rmtree(ignore_errors=True)`` couldn't delete files held open by
the running ``hermes.exe`` shim. ``uv venv`` then refused with
"A directory already exists at: venv" and the update fell back to
installing on top of the stale venv. The resulting partial install
missed exactly one newly-added base dep — ``pathspec==1.1.1`` — which
``hermes desktop --build-only`` imports at the top of its content-hash
check. The desktop rebuild died with ModuleNotFoundError and the parent
update only logged "⚠ Desktop build failed (non-fatal)". Same root cause
made the "default: sync failed" line in the skill-sync stage, because
that sync subprocess hit the same missing import.

Fix 1: ``rebuild_venv`` retries with ``--clear``
------------------------------------------------
If ``uv venv`` fails with "already exists" in stderr (which is what uv
prints, and what uv's own hint tells you to fix with --clear), retry
once with ``--clear``. Only this specific failure pattern triggers the
retry — disk-full / interpreter-download failures still surface as
before so we don't mask real problems.

Fix 2: post-install dep verification
------------------------------------
Belt-and-suspenders so future uv resolver quirks (or any other cause of
partial installs) surface immediately instead of hours later in a
downstream subprocess. After ``_install_python_dependencies_with_optional_fallback``
runs, ``_verify_core_dependencies_installed``:

  1. Reads ``[project.dependencies]`` straight from pyproject.toml
     (so we don't trust the venv's stale metadata).
  2. Filters by environment markers via ``packaging.requirements.Requirement``
     so cross-platform exclusions (``ptyprocess ; sys_platform != 'win32'``)
     don't false-positive on Windows.
  3. Runs ``importlib.metadata.version()`` for each remaining dep inside
     the *target* venv interpreter (resolved from ``VIRTUAL_ENV``, not
     ``sys.executable``).
  4. If anything is missing, reinstalls the base group with
     ``--reinstall`` to force re-resolution. If a second probe still
     reports missing deps, force-installs each one with its pinned spec.
  5. Treats final failure as a warning rather than a hard error — a
     single broken-on-PyPI dep shouldn't block an otherwise-successful
     update — but the message points at ``hermes update --force`` and
     names the missing packages so the user knows what's wrong.

Tests
-----
- ``TestRebuildVenv::test_retries_with_clear_when_dir_already_exists`` —
  simulates the rmtree-couldn't-delete-it failure mode and asserts the
  ``--clear`` retry path is taken and succeeds.
- ``TestRebuildVenv::test_does_not_retry_when_first_failure_is_not_dir_exists``
  — guards against masking real failures (disk full, etc.).
- ``test_verify_core_dependencies.py`` — 7 tests covering the happy
  path, the regression (missing pathspec triggers --reinstall), the
  per-package fallback when --reinstall doesn't help, the platform-
  marker filter so Windows doesn't try to install ptyprocess, the
  missing-pyproject noop, and the VIRTUAL_ENV resolver.

Co-authored-by: Kyssta <218078013+kyssta-exe@users.noreply.github.com>
2026-06-04 06:05:41 -07:00
cd68b8f0e8 fix(auth): set active_provider after hermes auth add qwen-oauth
hermes auth add qwen-oauth called pool.add_entry() but never wrote to
providers["qwen-oauth"] or set active_provider in auth.json.
_model_section_has_credentials() checks get_active_provider() first; with
active_provider unset and no api_key_env_vars configured for oauth_external
providers, the setup wizard reported "No inference provider configured" even
after a successful Qwen CLI OAuth login.

Add _mark_qwen_oauth_active() in auth.py: writes a minimal provider state
entry (base_url for display only) and calls _save_provider_state() to set
active_provider. The function deliberately does not copy the api_key — that
lives in the Qwen CLI credential file managed by _save_qwen_cli_tokens /
resolve_qwen_runtime_credentials and must not be duplicated in auth.json
where it would become stale.

pool.add_entry() is retained so "hermes auth list" continues to show the entry.
Runtime credential resolution continues to use resolve_qwen_runtime_credentials.

Mirrors the fix applied to openai-codex (#37517) and xai-oauth (#37576).
2026-06-04 05:58:33 -07:00
71a9f44e80 fix(gateway): retry startup auto-resume when a failed platform reconnects 2026-06-04 05:56:45 -07:00
fa8e2f935b polish(minimax): address Copilot review comments on M3 default-aux fix
Three Copilot inline review comments on #37664, two worth landing
in a polish pass before merge:

1. auxiliary_client.py:270 — Copilot suggested keeping the
   minimax-* entries in _API_KEY_PROVIDER_AUX_MODELS_FALLBACK as
   a safety net for environments where the profile-based
   resolution can't import or run plugin discovery. **Declined.**
   The deepseek precedent (commit 773a0faca) explicitly removed
   deepseek from the same dict for the same reason — the profile
   layer is the source of truth and the dict is a legacy
   pre-profiles-system fallback. We do not want to fragment the
   codebase by provider: either the profile layer is authoritative
   or the dict is. The minimax PR picks profile (matching deepseek)
   and the dict stays cleaned up. The risk Copilot raises is
   real but theoretical — plugin discovery runs at import time of
   the providers module, which is the first thing any modern
   Hermes entrypoint imports.

2. tests/agent/test_minimax_provider.py:162 — Copilot flagged
   that the test class relies on _get_aux_model_for_provider()
   resolving via provider profiles but doesn't explicitly trigger
   plugin discovery. **Fixed.** Added 'import model_tools  # noqa:
   F401' at the top of both test_minimax_aux_is_standard and
   test_minimax_aux_not_highspeed. The fixtures in the parallel
   test_minimax_profile.py already did this; the legacy test in
   test_minimax_provider.py was order-dependent and would silently
   break if anyone reorganised the test ordering. Pinned the
   dependency explicitly so the test is order-independent.

3. tests/plugins/model_providers/test_minimax_profile.py:46 —
   Copilot flagged that the docstring referenced a hard-coded
   line number 'hermes_cli/models.py:298' that would go stale.
   **Fixed.** Replaced with the symbol reference
   'hermes_cli.models._PROVIDER_MODELS[\'minimax\']' which is
   stable under file edits and grep-friendly. The new docstring
   also reads more naturally — readers don't have to look up
   'what's at line 298' to follow the reasoning.

All 221 minimax-related tests still pass.
2026-06-04 05:53:35 -07:00
b531b5d12a fix(minimax): update AUTHOR_MAP entry + test_minimax_oauth_aux_model_registered
Two follow-ups to the M3 default-aux-model PR (#37664):

1. AUTHOR_MAP entry: add fearvox1015@gmail.com -> Fearvox so the
   check-attribution CI job recognises Nolan's real contributor
   email. The previous run of the attribution check on #37664
   failed because the commit was authored as nolan@0xvox.com
   (wrong local git config) which isn't in AUTHOR_MAP. The
   commit itself is now re-authored to fearvox1015@gmail.com
   so both the per-commit check and the AUTHOR_MAP lookup pass.

2. tests/hermes_cli/test_api_key_providers.py::TestMinimaxOAuthProvider
   ::test_minimax_oauth_aux_model_registered was pinning the aux
   model in the legacy _API_KEY_PROVIDER_AUX_MODELS dict, which
   the PR correctly removed (mirrors the deepseek cleanup in
   773a0faca). The test now asserts the new world order: the
   aux model comes from ProviderProfile.default_aux_model on
   the minimax-oauth profile, not the fallback dict. This is
   the same pattern that the profile-layer deepseek fix
   introduced.
2026-06-04 05:53:35 -07:00
3d1d0a49fe fix(minimax): align default_aux_model with M3 frontier on minimax + minimax-cn
The minimax / minimax-cn / minimax-oauth profiles still advertised
M2.7 (and M2.7-highspeed for OAuth) as their default_aux_model,
predating the M3 release (2026-06-01). The user-facing
_PROVIDER_MODELS['minimax'] catalog top entry is M3, and the
recommended config for a Token-Plan install now sets
model.default: MiniMax-M3, so the aux default was the only
remaining drift.

Updates:

  * minimax        default_aux_model: M2.7        -> M3
  * minimax-cn     default_aux_model: M2.7        -> M3
  * minimax-oauth  default_aux_model: M2.7-highspeed -> M2.7
                    (M3 is not on the OAuth / Coding Plan tier per
                    platform docs as of this PR; the highspeed
                    variant was the 2x-cost regression from #4082
                    that PR #6082 collapsed to plain M2.7 for
                    minimax / minimax-cn but missed OAuth)

  * agent/auxiliary_client.py: drop the three legacy
    _API_KEY_PROVIDER_AUX_MODELS_FALLBACK entries for the minimax
    family. _get_aux_model_for_provider() reads from
    ProviderProfile.default_aux_model first (line 250) and only
    falls back to the dict when the profile has no aux model or
    the profile import fails. With the profile now set, the dict
    entries are dead code and a drift hazard. Mirrors the deepseek
    cleanup in 773a0faca.

  * tests/agent/test_minimax_provider.py: update the existing
    TestMinimaxAuxModel assertions from MiniMax-M2.7 to MiniMax-M3
    (the intent — 'standard, not highspeed' — is unchanged; the
    pin value is).

  * tests/plugins/model_providers/test_minimax_profile.py: new
    file mirroring tests/plugins/model_providers/test_deepseek_profile.py.
    Pins each of the three profiles' default_aux_model and
    asserts _get_aux_model_for_provider() returns it. A second
    class guards against the highspeed regression coming back.

Refs:
  - Closes #36196 in spirit (M3 support — the catalog half of
    that issue is #36212; this PR covers the profile half)
  - Related: #4082 (M2.7-highspeed 2x-cost), #6082 (previous
    M2.7-highspeed -> M2.7 fix that missed OAuth + the
    auxiliary_client.py fallback dict)
  - Pattern: 773a0faca (same profile-layer fix for deepseek)
2026-06-04 05:53:35 -07:00
5f62ba8e4b fix(auth): use _save_xai_oauth_tokens in auth_commands to set active_provider
hermes auth add xai-oauth called pool.add_entry() directly, writing only the
credential-pool entry (source "manual:xai_pkce") without touching
providers["xai-oauth"] or setting active_provider in auth.json.

_model_section_has_credentials() checks get_active_provider() first; with
active_provider unset and no api_key_env_vars configured for oauth_external
providers, the setup wizard reported "No inference provider configured" even
after a successful OAuth login.

Use _save_xai_oauth_tokens() — the canonical path already called from the
hermes model xAI login flow — which writes providers["xai-oauth"]["tokens"]
(setting active_provider) and lets _seed_from_singletons seed the pool with
a "loopback_pkce" entry on the next load_pool() call.

Mirrors the fix applied to openai-codex in #37517.
2026-06-04 05:48:50 -07:00
34a2903527 fix(auth): set active_provider after hermes auth add google-gemini-cli
hermes auth add google-gemini-cli called pool.add_entry() but never wrote
to providers["google-gemini-cli"] or set active_provider in auth.json.
_model_section_has_credentials() checks get_active_provider() first; with
active_provider unset and no api_key_env_vars configured for oauth_external
providers, the setup wizard reported "No inference provider configured" even
after a successful OAuth login.

Add _mark_google_gemini_cli_active() in auth.py: writes a minimal provider
state entry (email for display only) and calls _save_provider_state() to set
active_provider. The function deliberately does not copy access_token or
refresh_token — those are managed by agent.google_oauth in the Google
credential file and must not be duplicated in auth.json where they would
become stale.

pool.add_entry() is retained so "hermes auth list" continues to show the entry.
Runtime credential resolution continues to use agent.google_oauth directly.

Mirrors the fix applied to openai-codex (#37517) and xai-oauth (#37576).
2026-06-04 05:44:22 -07:00
9fbfeb31b9 fix(cron): make sequential jobs non-blocking too + sweep MCP after jobs finish
Follow-up on the parallel-dispatch decoupling: the sequential pass for
workdir/profile jobs still ran inline in the ticker thread, so a long
workdir/profile job reintroduced the exact starvation #37312 describes,
just for env-mutating jobs. And the MCP orphan sweep ran immediately
after dispatch in sync=False mode — before jobs finished — defeating its
own 'runs after every job' contract and racing jobs still spawning MCP
children.

- Sequential jobs now queue to a persistent single-thread cron-seq pool
  (preserves one-at-a-time ordering across ticks, never blocks the tick).
- Same in-flight dedup guard now covers sequential jobs.
- MCP orphan sweep runs via a done-callback after the LAST dispatched job
  completes in async mode; inline after as_completed in sync mode.

Verified E2E: tick(sync=False) returns in ~1ms with a 1.5s sequential job
in flight; sweep fires only after that job ends.
2026-06-04 05:40:13 -07:00
eb9cde7346 fix(cron): decouple job dispatch from completion in tick()
PR #13021 fixed serial starvation by adding ThreadPoolExecutor to tick(),
but kept as_completed(timeout=600) which still blocks the ticker thread
until the slowest job finishes. This causes the same starvation pattern:
when one job runs long (15+ min), other jobs' next_run_at expires past the
grace window and they get perpetually fast-forwarded instead of running.

This PR decouples dispatch from completion:
- Persistent ThreadPoolExecutor (reused across ticks, no auto-join)
- Fire-and-forget dispatch: tick submits and returns immediately
- Running-job guard: prevents re-dispatching active jobs
- sync parameter: defaults to True (backward compatible), callers opt
  into sync=False for non-blocking behavior
- atexit shutdown handler for clean pool teardown
- gateway/run.py: production ticker opts into sync=False

Refs #33315 (complementary — that issue's PRs fix grace handling in
jobs.py; this PR prevents the grace from expiring in the first place)
2026-06-04 05:40:13 -07:00
c9b62061d4 fix(cli): launchd KeepAlive unconditional restart (#37388)
Replace KeepAlive.SuccessfulExit=false dict with <key>KeepAlive</key><true/>
so launchd restarts hermes-gateway on any exit, matching the documented
drain-then-exit restart protocol used by --graceful-restart.
2026-06-04 05:38:12 -07:00
153fe28474 fix(vision): use MiniMax type="video" block (not input_video) + tests
The salvaged conversion emitted type:"input_video", which MiniMax M3 rejects
just like the original video_url block. Per MiniMax's Anthropic-compat docs,
the video content block is type:"video" with an image-style source (base64 or
url). Fixes the block type, converts URL-based videos too, and adds 4 video
conversion tests (none shipped with the original PR).
2026-06-04 05:38:11 -07:00