`cron_list` read `job.get("repeat", {})`, but the dict-default only
applies to a MISSING key. A one-shot job persisted with `"repeat": null`
returns None, and the next `.get("times")` raised AttributeError, taking
down the whole `cron list` output. Coalesce with `or {}` so a
present-but-null repeat renders as ∞ like the other cron readers already
do. Adds a regression test.
Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
`os.getcwd()` raises FileNotFoundError when the process's working
directory was removed out from under it (e.g. a scratch workspace
cleaned up mid-session), crashing terminal env setup.
Extract a `_safe_getcwd()` helper that falls back to TERMINAL_CWD, then
the user's home, on FileNotFoundError, and route all three `os.getcwd()`
call sites in terminal_tool.py through it (local default_cwd, the Docker
cwd-passthrough source, and the debug-config print) so the same crash
can't resurface at a sibling site. Adds unit tests for the real-cwd path
and both fallback branches.
Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
Inside the published Docker image, both the `--tui` banner and the
dashboard-embedded TUI report `1 commit behind — run docker pull
nousresearch/hermes-agent:latest to update` even though the container
has no git repo and no way to compute a commit delta.
Root cause: two independent update-detection paths, only one of which
knows it's running in Docker.
- `recommended_update_command()` → `detect_install_method()` reads the
`.install_method` stamp that `docker/stage2-hook.sh` writes at boot →
returns "docker", so the *command string* correctly says `docker pull`.
- `banner.check_for_updates()` (the source of the "N commits behind"
*count*) has no notion of the docker install method. It only detects a
build via `HERMES_REVISION` (nix-only, unset in the image) or a `.git`
dir (excluded from the image by .dockerignore). Neither matches, so it
silently falls through to `check_via_pypi()`, whose PyPI-version
mismatch flag (1) is then rendered verbatim by the CLI banner
(build_welcome_banner), the Ink TUI badge (branding.tsx), and `hermes
version` as "1 commit behind" — a phantom count, no commit math
involved. `hermes update` already refuses to run in-place in the
container.
The dashboard's REST `/api/hermes/update/check` endpoint already
short-circuits docker (returns behind=None + the docker guidance). This
mirrors that guard inside `check_for_updates()` so the banner/TUI/version
surfaces agree: when `detect_install_method() == "docker"`, return None
before any git/pypi probe (and before writing a cache entry). None makes
the render guards (`typeof === 'number' && > 0`, `behind and behind > 0`)
stay false, so the badge/line disappears entirely — matching the System
page.
Fix is in one place (check_for_updates) because all three consumers route
through it via get_update_result()/_update_result.
Tests: test_check_for_updates_docker_returns_none asserts None + no
git/pypi probe + no cache write; test_check_for_updates_non_docker_still_checks
guards against over-broadening (pip still version-checks). Mutation-tested:
removing the guard fails the docker test.
Verified against a real `docker build` of the image — see PR description.
`_read_json_file` caught OSError but not UnicodeDecodeError, so a status
file holding binary/non-UTF-8 bytes (truncated or clobbered write) would
crash the gateway status path instead of being treated as unreadable.
UnicodeDecodeError is a ValueError subclass, not an OSError, so it
escaped the existing guard.
Widen the catch to (OSError, UnicodeDecodeError) at both read sites in
gateway/status.py — `_read_json_file` and the sibling `_read_pid_record`,
which had the identical gap. Adds tests covering binary input (returns
None) and valid input (still parses) for both.
Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
The LINE adapter classified every non-text inbound message as
`MessageType.IMAGE`, which doesn't exist on the enum — so any image,
video, audio, file, sticker, or location message raised AttributeError
the moment it was constructed.
Beyond fixing the crash, every non-text message was being collapsed onto
a single type. The gateway routes on MessageType (voice → STT, files →
document handling, etc.), so misclassification silently mishandled media.
Replace the inline ternary with a `_LINE_MESSAGE_TYPES` lookup that maps
each LINE webhook type to its proper enum member (audio → VOICE to match
how Telegram/WhatsApp treat voice notes), falling back to TEXT for
unknown types. Adds regression tests covering the mapping and the old
AttributeError.
Co-authored-by: Sahibzada Allahyar <94376830+sahibzada-allahyar@users.noreply.github.com>
The Hermes Docker image's venv is built with `uv sync`, which does not
bootstrap pip into the venv. When the google-workspace setup script needs
to install its deps and the running interpreter has no pip,
`sys.executable -m pip install` dead-ends with "No module named pip"
(reported via Discord support).
install_deps() now falls back to `uv pip install --python <interpreter>`
when the pip path fails and uv is on PATH. uv installs into the exact
interpreter the script is running under without needing pip present, so
the pip-less venv self-heals (e.g. a dep evicted on image update, or a
build without the [google]/[all] extra). On environments with neither
pip nor uv, the [google] extra hint is printed as before.
Verified E2E against nousresearch/hermes-agent:latest: under the venv
python with a missing dep, --install-deps now prints "Dependencies
installed." and exits 0 instead of failing.
Adds TestInstallDeps regression coverage: pip path, uv fallback,
uv-not-consulted-when-pip-works control, and both no-installer-available
and uv-also-fails failure cases.
Since #38591 made the dashboard's embedded chat unconditional, every
browser refresh of /chat spins up a fresh session.create (new sid + a
fresh _SlashWorker via _deferred_build) over /api/ws, but the old tab's
WS disconnect only DETACHES the transport (ws.py) — it never closes the
old session or its slash_worker. The dashboard's in-process gateway is
long-lived, so the detached _SlashWorker subprocess's stdin pipe stays
open forever and the worker never reaches EOF: one leaked python process
per refresh.
Fix at the session-lifecycle layer (not PTY signal timing — verified that
a process whose owning gateway dies is always reaped via stdin-EOF; the
leak is specifically the long-lived dashboard process keeping detached
sessions parked). On WS disconnect, schedule a grace-delayed reap of any
session left orphaned (transport detached to stdio, not mid-turn). A quick
reconnect / session.resume / prompt.submit rebinds a live transport and
cancels the reap, preserving the intentional detach-for-reconnect window.
- server.py: extract _teardown_session() (shared with session.close),
add _ws_session_is_orphaned() + _schedule_ws_orphan_reap(), gated by
HERMES_TUI_WS_ORPHAN_REAP_GRACE_S (default 20s, 0 disables).
- ws.py: schedule the reap for each detached session on disconnect.
- tests: reap-closes-worker, spares-reattached/mid-turn/finalized,
disabled-when-grace-zero.
check_execute_code_guard() never called is_approved() before entering the
approval flow, and never persisted session/permanent approvals from the
gateway response. This meant 'Approve session' and 'Always' buttons had
no effect — every execute_code call re-prompted the user.
- Add is_approved() check after get_current_session_key(), matching
check_all_command_guards()
- Persist session ('approve_session') and permanent ('approve_permanent')
approvals based on the gateway choice, same as terminal command guard
- Add 3 regression tests for session persistence, permanent persistence,
and short-circuit on pre-existing approval
Resolve conflicts in desktop settings/cron/messaging/sidebar: adopt main's
ListRow + actions-menu refactors for credential rows; keep our profileColor
import on the sidebar. Drop the now-orphaned Tip-based helpers.
On Windows, os.symlink() raises OSError (WinError 1314) unless the
process has Administrator rights or Developer Mode is enabled. The SSH
bulk-upload staging logic used symlinks to mirror the remote layout
before piping through tar; this caused all ssh_bulk_upload tests to
fail on Windows.
- ssh.py: wrap os.symlink() in try/except OSError and fall back to
shutil.copy2() so staging works on every platform. shutil was already
imported, no new dependency introduced.
- file_sync.py: replace str(Path(remote).parent) with
posixpath.dirname(remote) in unique_parent_dirs(). pathlib.Path uses
the host separator (\ on Windows), but these paths are sent to a
remote Linux host over SSH and must always use forward slashes.
- test_ssh_bulk_upload.py: make test_staging_symlinks_mirror_remote_layout
platform-agnostic — assert file existence and content instead of
os.path.islink() + os.readlink(), since the staged entry may be a
copy on Windows.
web_tools.is_safe_url was replaced by async_is_safe_url, but three
web-provider test files still monkeypatched the old sync name, raising
AttributeError. Patch the async variant with an async lambda.
Add async_is_safe_url() wrapping is_safe_url via asyncio.to_thread, and route
all async SSRF call sites through it: web_extract_tool, the vision/video
preflight checks, and both download redirect guards. socket.getaddrinfo blocks;
calling it inline from async tool paths froze the event loop for the duration of
DNS resolution.
vision_tools: split _validate_image_url into _image_url_shape_ok (no DNS) +
sync _validate_image_url (for sync callers/tests) + async _validate_image_url_async.
Widened beyond the original PR #3691 to sibling async sites that also blocked
the loop (second redirect guard, video preflight).
Salvage of #3691 by @Kewe63 — surgically re-applied onto current main because
the original branch was too stale to cherry-pick cleanly (would have reverted
the web_crawl_tool refactor).
Co-authored-by: Kewe63 <kewe.3217@gmail.com>
session.py _persist() bypassed SessionDB's thread-safe write path by
accessing private internals db._lock and db._conn directly:
with db._lock:
db._conn.execute("UPDATE sessions SET model_config = ? ...")
db._conn.commit()
This was fragile for three reasons:
1. It bypassed _execute_write()'s BEGIN IMMEDIATE + jitter-retry logic,
so concurrent writes could hit SQLite BUSY without retrying.
2. It called db._conn.commit() manually, breaking the transactional
contract that _execute_write() enforces.
3. Any internal rename of _lock or _conn would silently break this
call site with an AttributeError at runtime.
Fix:
- Add SessionDB.update_session_meta(session_id, model_config_json, model)
to hermes_state.py. Routes through _execute_write() for the standard
BEGIN IMMEDIATE + lock + jitter-retry guarantee. Uses COALESCE so
passing model=None leaves the stored model column unchanged.
- Replace the db._lock / db._conn block in session.py _persist() with
a single db.update_session_meta() call.
Tests (tests/acp/test_session_db_private_access.py, 11 tests):
- Unit tests for update_session_meta: updates model_config, updates
model, preserves existing model on None, routes through _execute_write,
no-op on non-existent session.
- AST checks: db._lock and db._conn not referenced in session.py;
_persist() calls update_session_meta().
- Integration round-trips: cwd and model persisted correctly; COALESCE
prevents overwriting an existing model with NULL.
Tests in test_gateway_service.py imported grp inline without a
platform guard, causing ImportError on systems where grp is
unavailable (e.g. macOS, WSL without grp module).
Added pytest.importorskip('grp') at module level alongside the
existing pwd guard, and removed three redundant inline import grp
statements.
Fixes#24531
The standalone `_send_slack()` function used by the send_message tool
and cron delivery fallback was not passing `thread_ts` to the Slack API,
causing messages to post to the top-level channel instead of inside
threads.
- Add `thread_ts` parameter to `_send_slack()`
- Include `thread_ts` in the chat.postMessage payload when present
- Pass `thread_id` from `_send_to_platform()` to `_send_slack()`
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Drag a sidebar session into the composer to drop an @session:<profile>/<id>
chip the agent resolves via session_search. New READ shape dumps a whole
session by id (head+tail when large); a `profile` param reads another
profile's DB read-only, and a cross-profile locate scan resolves bare ids
when the model drops the owning profile from the link.
Also: ASCII "waking up <profile>" overlay during lazy gateway swaps,
global haptic rate-limit to kill the reconnect-storm "clickity" buzz, and
reauth toasts surfaced once per disconnect instead of every backoff tick.
_build_memory_uri produced URIs of the form:
viking://user/{user}/memories/{subdir}/mem_{slug}.md
The /agent/{agent}/ segment was missing, causing every agent under
the same user to write into the same flat namespace. In multi-agent
deployments agents silently overwrite each other's memories and
vector retrieval cross-pollinates results.
self._agent was already populated correctly (from OPENVIKING_AGENT
env var, default 'hermes') and sent via X-OpenViking-Agent header —
it was simply not interpolated into the URI.
Fix: add the missing segment so URIs follow the documented shape:
viking://user/{user}/agent/{agent}/memories/{subdir}/mem_{slug}.md
Tests: 4 new regression tests in TestOpenVikingMemoryUriBuilder,
13/13 passed (9 existing + 4 new).
Salvage of #36631 (@annguyenNous), rebased onto current main with
regression tests added. Fixes#36266.
When a persistent Docker sandbox container is removed out-of-band (idle
reaper, `docker prune`, OOM kill, daemon restart), the gateway kept
issuing `docker exec` against the dead container ID, returning
"No such container" on every subsequent tool call — the agent was
permanently blocked until the gateway process restarted.
DockerEnvironment.execute() now detects the "No such container" /
"is not running" error after a non-zero exit (gated on
persist_across_processes) and calls _recreate_container(): it tries
label-based reuse first, falls back to a fresh container replaying the
same image + full all_run_args set, re-runs init_session(), and retries
the command once. A genuine non-zero exit is NOT misclassified as
container-gone.
Differs from #36631 as submitted: adds the tests the original lacked.
tests/tools/test_docker_environment.py covers _is_container_gone pattern
matching (incl. the negative/control case), the recover-and-retry path,
the persist_across_processes=False opt-out (no recovery), and the
ordinary-failure passthrough (no spurious recreation). _make_dummy_env
now forwards persist_across_processes.
Verified:
- Unit: 67/67 in test_docker_environment.py (4 new + existing).
- Live E2E against the real docker daemon: started a persistent
container, `docker rm -f`'d it out-of-band, and the next execute()
transparently recreated a fresh container and succeeded; a follow-up
command worked in the recovered container; a real `exit N` passed
through without triggering recovery.
Co-authored-by: annguyenNous <annguyenNous@users.noreply.github.com>
When `docker run -d` fails after Docker has already created the container
object (e.g. exit 125 when the daemon isn't ready, or a timeout mid image
pull), the code raised before `self._container_id` was set — so the
container leaked permanently in "Created" state. Reported in #7439:
110+ orphaned containers accumulated over 3 days from hourly cron-
scheduled gateway sessions hitting a Docker Desktop startup race.
The orphan reaper added in #33645 (reap_orphan_containers) does NOT cover
this case: it filters `status=exited`, but a failed-create container is in
`Created` state, so it slips through and is never reaped.
Wrap the `docker run -d` call in try/except and `docker rm -f` the
container by its known name before re-raising.
Salvages #7440 by @Tranquil-Flow. Their branch predated the cross-process
reuse + labels rework on `main`, so a cherry-pick conflicted; reconstructed
the same intent (plus their two regression tests, adapted to mock the new
reuse `docker ps` probe) against current `main`.
Verified adversarially: reverted just the product change to origin/main's
`docker.py`, ran the two new tests -> both FAIL with
`assert 0 == 1 ("docker rm should be called once")`. With the fix applied,
both pass; full test_docker_environment.py is 65/65 green.
Closes#7440. Fixes#7439.
Co-authored-by: Evi Nova <66773372+Tranquil-Flow@users.noreply.github.com>
Docker Compose service names (e.g. ollama, litellm, hermes-litellm)
are unqualified hostnames with no dots. These are always local — they
resolve via Docker DNS, /etc/hosts, or mDNS. Without this fix, the
stale stream timeout fires on local LLM proxies, causing infinite
reconnect loops.
Closes#7905
_ notification_poller_loop_ re-emits status.update every cycle
when a background process completes while the session is busy.
The same completion event gets re-queued and re-emitted to the
TUI every few ms, flooding the transcript with duplicate lines.
Add _notification_event_dedup_key(evt) that returns a tuple
identity for each notification event. Only emit status.update
on first sight per identity:
- completions: (sid, type) — one-shot per process session
- watch_match: (sid, type, command, pattern, output, ...)
- watch_overflow/disabled: (sid, type, command, message, ...)
The dedup key design was refined from an initial sid:type approach
after @lordbuffcloud identified that distinct watch_match events
(READY vs DONE) for the same process would be incorrectly collapsed.
Tests from @tymrtn cover distinct watch matches, exact replay
dedup, and completion one-shot behavior.
Co-authored-by: tymrtn <ty@tmrtn.com>
* Revert "fix(gateway): anchor Google Chat OAuth client secret to default Hermes root"
This reverts commit fff0561441.
* Revert "fix(cli): honor global-root active_provider fallback for named profiles"
This reverts commit 3858cf4307.
* docs(google_chat): describe OAuth client secret as profile-scoped, not host-wide
The setup docs, oauth docstring, and the adapter's 'no credentials'
error message all described the Google Chat OAuth client secret as
host-wide shared infrastructure. That contradicts profile isolation:
profiles are separate auth boundaries, so two profiles can point at
different Google OAuth apps / accounts. Reword all three to say the
secret is profile-scoped and each profile registers its own.
The per-session icon picker added more noise than value — rip it out end
to end (sessions.icon column, set_session_icon, the PATCH field, the
picker UI, and the SessionInfo.icon type).
The cross-profile session aggregator now opens each profile's state.db
read-only (mode=ro, no schema init), so listing other profiles on every
sidebar refresh never DDLs or takes a write lock on their live DBs. The
single-profile hot path stays on par with /api/sessions.
Add first-class profile support to the desktop app without app reloads.
- Swap the single live gateway onto a session's profile lazily (spawned on
demand by the Electron backend pool), so one backend serves the active
profile and others stay cold — no OOM with many profiles.
- Aggregate sessions across profiles by reading each profile's state.db
read-only; unified "All profiles" view groups sessions per profile with
per-profile pagination, while the default view stays scoped to one profile.
- Add an Arc-style profile rail at the sidebar foot: a default<->all toggle
pinned left, colored named-profile squares scrolling between, Manage pinned
right. Profile identity is a deterministic per-name color.
- Route profile-scoped REST (config/env/skills/tools/model) to the active
gateway profile and invalidate React Query caches on swap. Single-profile
users never trigger a swap, so their path is unchanged.
Backend:
- web_server: profile-aware active/list endpoints + per-profile session
totals; hermes_state: session_count(exclude_children); main.py: honor
--profile over HERMES_HOME env for pooled backends.
UI primitives:
- Add a position-aware Tip tooltip (instant, themed) as a drop-in for native
title=, and strip redundant tooltips from self-descriptive chrome.
/branch (aka /fork) sessions vanished from /resume and /sessions. Both
surfaces funnel through list_sessions_rich(include_children=False), which
hid any session with a parent_session_id unless identified as a branch via a
heuristic — parent.end_reason == 'branched' AND child.started_at >=
parent.ended_at.
Two ways that heuristic failed:
1. CLI/gateway branches: once the parent was reopened (e.g. resumed) and
re-ended with a different end_reason (tui_shutdown overwriting 'branched'),
the heuristic stopped matching and the branch was hidden permanently.
2. TUI branches (tui_gateway session.branch): the TUI never ends the parent
as 'branched' — it creates the child while the parent is still live — so
the heuristic NEVER matched and TUI branches were hidden from the moment
they were created (this is the macOS desktop app's primary symptom).
Fix: persist a stable '_branched_from' marker in the branch session's
model_config at creation time across ALL THREE branch paths (CLI cli.py,
gateway gateway/run.py, and TUI tui_gateway/server.py), and OR a
json_extract(model_config, '$._branched_from') IS NOT NULL check into the
list_sessions_rich filter. The marker is immutable across the parent's
lifecycle, so the branch stays visible regardless of how/whether the parent
is ended. The legacy end_reason heuristic is kept (OR'd) so pre-existing
branches remain visible. Subagent/compression children (no marker, parent
not 'branched') stay correctly hidden. Fixes#20856.
Approach by liuhao1024 (PR #20864); reimplemented on current main, extended
to the TUI branch path (which the original missed), with regression tests for
the reopen+re-end scenario and the TUI marker persistence.
hermes update can brick a Windows install. When 'hermes update --force' runs
past the concurrent-process guard, rebuild_venv runs while the venv is still in
use: shutil.rmtree(ignore_errors=True) deletes site-packages + certifi's cert
bundle but can't remove the locked python.exe, leaving a half-gutted venv that
uv venv then refuses to overwrite. Every later HTTPS call dies with
FileNotFoundError for the missing cacert and there is no recovery.
--clear alone (the c136eb4de retry path) does not fix the real lock case: when
the locked interpreter is *inside* the venv being rebuilt, neither rmtree nor
uv venv --clear can delete it. os.replace of the parent directory *is* allowed
on Windows (a running .exe is tracked by handle, not path), so we move the old
venv aside atomically to <venv>.old, rebuild with --clear in its place, and the
still-running gateway/desktop keep using the moved-aside copy until they
restart. If the venv genuinely can't be moved, we abort cleanly and leave it
fully intact; if the rebuild fails, we restore the moved-aside copy.
Folds in the call-site guards from #38511 (@f3rs3n):
- rebuild_venv() returns False (and restores the backup) if uv exits 0 without
producing an interpreter.
- both hermes update venv-rebuild call sites abort with RuntimeError instead of
continuing into dependency install when rebuild_venv() returns False.
Also gitignore /venv.old/ so the update autostash (git stash --include-untracked)
doesn't sweep the moved-aside venv on every run.
Root-cause fix for #37881. Supersedes the --clear-only retry from c136eb4de.
Co-authored-by: f3rs3n <32328813+f3rs3n@users.noreply.github.com>
The salvaged fix held _session_resume_lock across _make_agent (MCP discovery
+ AIAgent construction, seconds), serializing it against session.close. Since
session.close runs on the main RPC dispatch thread (not a _LONG_HANDLER), a
close racing a mid-build resume would stall all fast-path RPCs (approval.respond,
session.interrupt).
Restructure to double-checked locking: build the agent outside the lock, then
re-check _find_live_session_by_key under the lock before _init_session. A losing
concurrent resume discards its just-built agent (no worker/poller wired yet) and
reuses the winner. Updated the concurrent-resume regression test to assert the
real invariant (one surviving live session + loser agent closed) rather than the
implementation detail of a single _make_agent call.
quick() and dry_run() previously trusted the stored category from
tracked.json without re-validating at delete time. Stale entries from
before #34840 could carry category="cron-output" for cron control-plane
paths (e.g. cron/jobs.json), causing quick() to delete the live
scheduler registry.
Fix:
- Fix guess_category() to only classify cron/output/** as cron-output
(was classifying ALL cron/* paths, missing the #34840 fix).
- Re-validate cron-output entries via guess_category() at delete time
in quick() and dry_run(); stale entries that are no longer classified
as cron-output are skipped and removed from tracked.json.
- Add _is_protected_cron_path() as a hard defense-in-depth guard that
blocks deletion of cron/cronjobs directories and known control-plane
files (jobs.json, .tick.lock) regardless of stored category.
- Update test_cron_subtree_categorised to match fixed guess_category
(only cron/output/* is cron-output, not all of cron/).
Tests: add 5 regression tests in TestStaleCronEntryMigration.
A session_reset (/new, /cc) that bumps the run generation while an agent
turn is in flight left the dead agent in the _running_agents slot: the
in-flight run's own release is generation-guarded and correctly returns
False, and the outer finally's sentinel-only check also missed the
leftover real agent. The session then silently dropped every subsequent
message as 'agent busy' until a full gateway restart. (#28686)
- _process_message_or_command outer finally now calls the unconditional,
idempotent _release_running_agent_state(key) on all exit paths instead
of the sentinel-vs-else branch that could strand a dead agent.
- _handle_reset_command evicts the slot right after bumping the
generation, so the zombie is cleared at reset time regardless of how
the in-flight run unwinds.
Co-authored-by: CryptoByz <cryptobyz.airdrop@gmail.com>
Root-run gateways have $HOME=/root, which is on the MEDIA system-path
denylist, so the gateway silently dropped agent-generated deliverables
under /root (e.g. /root/work/proposal.docx) — the user got a 'here is
your file' reply with nothing attached.
_path_under_denied_prefix now treats the running user's own home as
deliverable: the home tree itself is no longer denied, while the
more-specific denied paths inside it (~/.ssh, ~/.aws, ~/.hermes/.env,
auth.json, config.yaml) stay blocked because they are separate denylist
entries. The exception only matches when the denied prefix IS $HOME, so
a non-root gateway still can't deliver another user's home.
Diagnosis, reproduction, and the failing-case analysis are from
@GodsBoy (#38108 / #38106). Implemented here as the minimal denylist
fix rather than a staging/copy subsystem.
Co-authored-by: GodsBoy <dhuysamen@gmail.com>
search_sessions_by_id previously fetched up to 10k sessions via
list_sessions_rich and filtered them in Python — O(n) per keystroke.
Push the id match into SQL instead.
- list_sessions_rich gains an optional id_query param: a case-insensitive
LIKE pushed into the outer WHERE, matched against each surfaced row's id
AND every id in its forward compression chain (via the existing chain
CTE). Searching a compression root id or a tip id both resolve to the
same projected conversation. LIKE wildcards in the needle are escaped.
- search_sessions_by_id now fetches only matching rows (limit*4) and ranks
exact > prefix > substring in Python over that small set.
- web_server /api/sessions/search: route ID matches and content matches
through one lineage-keyed dedup helper so an id-hit and a content-hit on
the same conversation collapse to a single result (the contributor's
version keyed ID hits by raw sid and content hits by root, which could
double-list a compression tip).
- command-center haystack also matches _lineage_root_id for parity.
E2E verified against a real DB: exact match over 3000+ sessions
materializes 1 row in Python (was ~3000), 5ms; root-id resolves to tip;
LIKE-wildcard escaping holds.
Follow-up to @0xharryriddle's feat(desktop): search sessions by id.
The salvaged detector validated each cached electron-*.zip with
zipfile.testzip() and only purged ones it judged corrupt. But stdlib
zipfile reads from the end-of-central-directory backward, so it silently
tolerates prepended/concatenated junk — which is exactly the corruption
the bug report names ('86257938 extra bytes at beginning or within
zipfile', a partial download resumed into the same file). testzip()
returns clean on those zips, so the self-heal never fired for the
reported failure mode.
Drop the self-rolled validator: on any packaged-build failure, purge the
version's cached zips AND the half-written unpacked dir, then retry once.
@electron/get re-downloads with its own SHASUM verification — the real
source of truth, which catches prepend/concat/truncate alike. An
unrelated failure just costs one clean re-download and fails the same way.
Verified empirically: zipfile.testzip() returns None (clean) on a
prepended-junk zip; the unconditional purge removes it correctly.
hermes desktop failed on Linux with an ENOENT renaming
release/linux-unpacked/electron -> Hermes. Root cause is a corrupt
cached Electron zip (~/.cache/electron/electron-*.zip): app-builder
unpack-electron extracts a partial tree from the bad zip that is
missing the electron binary, so electron-builder dies on the final
rename. Re-running repeats the broken extraction, leaving the desktop
app permanently unlaunchable until the cache is manually purged.
- Add _electron_download_cache_dirs() + _purge_corrupt_electron_cache()
to hermes_cli/main.py: validate every electron-*.zip via
zipfile.testzip() and delete corrupt ones; honor electron_config_cache
/ ELECTRON_CACHE overrides with per-OS defaults.
- Wire purge + single retry into cmd_gui packaged-build failure path so
a poisoned download self-heals (electron re-downloads clean).
- Add beforePack hook (apps/desktop/scripts/before-pack.cjs) to wipe the
target unpacked dir before staging, making packaging idempotent across
interrupted runs. Cross-platform, best-effort.
- Tests: corrupt-zip detector, cmd_gui purge/retry/launch path,
no-retry-when-clean path, and node --test for the cleanup helper.
Follow-up to the salvaged #37727. That PR fixed the reactive recovery path
(classifier + post-failure shrinker) but left the PROACTIVE embed-time guard
in vision_tools byte-only — a tall small-byte screenshot (e.g. 1200x12000 at
0.06 MB) still baked into immutable history un-resized, relying on a failed
round-trip to trigger reactive shrink.
- vision_tools: add _image_exceeds_dimension() + _EMBED_MAX_DIMENSION (7900px);
the embed-time cap now fires on bytes OR pixels and passes max_dimension to
the resizer, so tall small-byte images are shrunk before they're embedded.
- vision_tools: best-effort lazy-install of Pillow (tool.vision) in the resize
ImportError fallback so the soft dep self-heals (respects allow_lazy_installs).
- error_classifier: add two more Anthropic dimension-cap wording variants.
- pyproject + lazy_deps: declare Pillow as the [vision] extra / tool.vision
lazy dep (it was undeclared everywhere; without it ALL resize recovery no-ops).
- tests: cover _image_exceeds_dimension (tall/small/edge/no-Pillow/corrupt).
Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com>
Anthropic enforces two independent ceilings per image:
1. 5 MB encoded byte size
2. 8000 px longest side
Hermes only guarded #1. A tall screenshot (e.g. 1200x12000 at 0.06 MB)
passes every byte check but fails the pixel check, returning a
non-retryable HTTP 400 that permanently bricks the conversation thread.
Fixes:
- error_classifier: add 'image dimensions exceed' pattern to
_IMAGE_TOO_LARGE_PATTERNS so the 400 is classified as image_too_large
and triggers the shrink/retry path instead of falling through to
non-retryable error.
- conversation_compression: check pixel dimensions (via Pillow) even
when byte size is under the 4 MB target. If max(dims) > 8000, force
shrink.
- vision_tools._resize_image_for_vision: add optional max_dimension param.
When set, images exceeding the pixel cap are downscaled even if they're
under the byte budget. The resize loop now checks both byte AND pixel
limits before accepting a candidate.
Closes#37677