A kanban worker that exhausted its retries purely on a provider rate
limit / quota wall (e.g. opencode-go's 5-hour window) exited with code 1.
The dispatcher counted that as a crash, and with DEFAULT_FAILURE_LIMIT=2
two quota-wall hits permanently blocked the card. Fanning out many
workers against one shared quota made this routine.
Now a rate-limited worker exits with EX_TEMPFAIL (75); the dispatcher
classifies that as a 'rate_limited' exit, releases the task back to
'ready' WITHOUT incrementing consecutive_failures (the breaker can't trip
on a transient throttle), and the respawn guard defers the next attempt
on a cooldown (default 5min, HERMES_KANBAN_RATE_LIMIT_COOLDOWN_SECONDS)
until the quota window clears. Genuine crashes still count and trip the
breaker as before. The 120s Retry-After cap is unchanged — no worker
parks for hours holding a slot.
- conversation_loop.py: surface failure_reason in the exhaustion return
- cli.py: kanban worker picks exit 75 on rate_limit/billing failure
- kanban_db.py: rate_limited exit kind, no-count requeue, cooldown guard
decompose_triage_task hardcoded every fan-out child to workspace_kind
'scratch', ignoring the root task's workspace. A code-gen task created
with a dir:/worktree: workspace would fan out into throwaway scratch tmp
dirs (GC'd on archive), so generated code never landed in the project.
Children now inherit the root's workspace_kind + workspace_path. A child
dict may still override with its own workspace_kind/workspace_path; the
path only carries over when kinds match. Scratch roots are unchanged.
* feat(kanban): goal_mode cards run workers in a /goal loop
A goal_mode card wraps its dispatched worker in the Ralph-style goal
loop behind /goal: after each turn an auxiliary judge checks the
worker's response against the card title+body, and if not done the
worker keeps going in the SAME session until the judge agrees, the
worker terminates the task itself, or the turn budget runs out (which
blocks the card for human review — never a silent exit).
- kanban_db: goal_mode + goal_max_turns columns (additive migration),
Task fields, create_task params, INSERT wiring, created-event payload.
- kanban_tools: goal_mode/goal_max_turns on the kanban_create tool so
orchestrators can opt cards in when fanning out.
- kanban CLI: --goal / --goal-max-turns on 'kanban create'.
- dashboard API: goal_mode/goal_max_turns on the create endpoint
(auto-surfaced back via asdict).
- _default_spawn: sets HERMES_KANBAN_GOAL_MODE / _GOAL_MAX_TURNS only
when the card opts in.
- goals.run_kanban_goal_loop: standalone, callback-injected loop engine
(no SessionDB persistence; ephemeral worker). cli.py quiet path calls
it after the worker's first turn when the env vars are set.
- Docs: orchestrator skill + kanban feature page.
Tests: DB roundtrip + legacy migration, spawn env gating, and the loop's
continuation/completion/budget-block/finalize-nudge branches. E2E run
against a real kanban DB confirms a budget-exhausted goal worker lands
in a sticky blocked state.
* feat(kanban/dashboard): goal-mode toggle in the create form
Wires the goal_mode card setting into the dashboard UI (the plugin's
hand-written IIFE bundle, no build step):
- InlineCreate: 'goal mode' checkbox after the skills field; checking it
reveals an optional 'max turns' number input. Both reset on submit and
only post goal_mode/goal_max_turns when enabled.
- TaskDrawer: a 'Goal mode: on (max N turns)' MetaRow so a card's
goal-mode setting is visible after creation (auto-fed by asdict via the
existing _task_dict).
Live-tested through the running dashboard with a browser: created a
goal-mode card with max-turns=8, confirmed it persisted to the kanban DB
(goal_mode=1, goal_max_turns=8) and rendered back in the drawer as
'on (max 8 turns)'. No JS console errors.
Tasks can now carry file attachments (PDFs, images, source docs) that
workers read directly — closes the gap where source material had to be
pasted as a path into the task body.
- kanban_db: task_attachments table (additive), Attachment dataclass,
add/list/get/delete accessors, attachments_root/task_attachments_dir
path helpers (per-board, HERMES_KANBAN_ATTACHMENTS_ROOT override)
- build_worker_context: surfaces each attachment's absolute path so the
worker (full file/terminal tool access) reads it via read_file/pdftotext
- dashboard API: POST/GET/DELETE attachment routes (multipart upload,
25MB cap, traversal-safe filenames, root-containment check on download)
- dashboard UI: Attachments section in the task drawer — upload button,
list with download, per-row remove
- docs + tests (13 cases: DB accessors, REST round-trip, traversal
rejection, collision suffixing, worker-context surfacing)
Closes#35338
Follow-up to the budget-exhaustion recovery fix. recompute_ready's
new circuit-breaker guard resolved its effective limit from per-task
max_retries -> DEFAULT_FAILURE_LIMIT, skipping the dispatcher's
configured kanban.failure_limit. _record_task_failure resolves
max_retries -> failure_limit(config) -> DEFAULT, so the two disagreed
whenever an operator set kanban.failure_limit != 2:
- config > 2: a task could get stuck at DEFAULT(2) before reaching its
allowed retry count.
- config < 2: a task the breaker already blocked could be auto-recovered
back to ready, defeating the stricter limit.
Thread the dispatcher's failure_limit through dispatch_once into
recompute_ready so the guard and the breaker share one resolution order.
Updated test_circuit_breaker_block_still_auto_promotes (it asserted a
failures=5 block auto-recovers and resets the counter — that's the
pre-#35072 behavior the loop fix removes); it now exercises a below-limit
transient block, with the at-limit case covered in test_kanban_db.py.
Added two tests for the config-tier and per-task override resolution.
recompute_ready() previously reset consecutive_failures to 0 when
auto-recovering a blocked task. This defeated the circuit-breaker:
a task that repeatedly exhausted its iteration budget would cycle
forever (block → auto-recover with counter=0 → respawn → budget
exhausted → block → …) with no signal to the operator.
Fix: don't auto-recover tasks whose consecutive_failures has reached
the effective failure limit (per-task max_retries or
DEFAULT_FAILURE_LIMIT). The counter is also preserved across
recovery so the breaker can accumulate across cycles.
Fixes#35072
Legacy kanban boards (pre-AUTOINCREMENT schema) crashed the gateway
notifier on every tick — int(None) on a NULL id in unseen_events_for_sub
— silently losing all kanban notifications. CREATE TABLE IF NOT EXISTS
skips existing tables regardless of schema and _add_column_if_missing
only adds columns, so neither could fix a drifted primary-key type.
_rebuild_drifted_tables() detects the legacy shape via PRAGMA table_info
and rebuilds task_events/task_comments/task_runs (TEXT PK -> INTEGER
AUTOINCREMENT) and kanban_notify_subs.last_event_id (TEXT/NULL -> INTEGER
NOT NULL DEFAULT 0), preserving data. The whole pass is one transaction
so an interruption can't leave a table half-renamed, and recreates every
index DROP TABLE would otherwise take down (including idx_events_run).
Co-authored-by: liuhao1024 <liuhao1024@users.noreply.github.com>
Vulture + per-symbol verification (whole-repo grep incl. tests, string
literals, getattr, decorator/registry/argparse dispatch) confirmed each of
these has zero callers anywhere — not reachable via any dynamic-dispatch path,
not referenced by tests, not re-exported.
Removed:
- acp_adapter/tools.py: _build_patch_mode_content
- agent/anthropic_adapter.py: read_claude_managed_key (diagnostics-only, never called)
- agent/bedrock_adapter.py: get_bedrock_model_ids
- agent/browser_registry.py: get_active_browser_provider
- agent/chat_completion_helpers.py: _take_request_client (x2 nested closures, never invoked)
- gateway/platforms/weixin.py: _rewrite_headers_for_weixin, _rewrite_table_block_for_weixin
- hermes_cli/banner.py: _skin_branding
- hermes_cli/debug.py: _delete_hint
- hermes_cli/gateway.py: _setup_email, _setup_sms, _setup_yuanbao
(platform keys absent from the _builtin_setup_fn dispatch dict; handled by
the _setup_standard_platform fallback)
- hermes_cli/kanban_db.py: set_max_runtime, active_run
- hermes_cli/kanban_diagnostics.py: severity_of_highest, _latest_clean_event_ts
- hermes_cli/main.py: _build_provider_choices, cmd_portal
(portal subcommand is wired via portal_cli.add_parser, not this wrapper)
- hermes_cli/model_switch.py: CustomAutoResult (orphaned by the switch_model() extraction)
- hermes_cli/models.py: format_model_pricing_table, fetch_nous_account_tier
- hermes_cli/portal_cli.py: _nous_portal_base_url
- hermes_cli/proxy/server.py: handle_models_fallback (defined but never registered on the router)
- tools/computer_use/cua_backend.py: _parse_element, _is_arm_mac
- tools/file_operations.py: _get_safe_write_root (prod uses the imported
agent.file_safety.get_safe_write_root directly)
- tools/skills_tool.py: _load_category_description
Also dropped two imports left unused by the removals:
- tools/file_operations.py: get_safe_write_root alias
- tools/computer_use/cua_backend.py: import platform
Pure deletion: -551 LOC. No behavior change. Test files covering the edited
modules pass (640/640); the broader suite's pre-existing/env-dependent
failures reproduce unchanged on origin/main.
Reporter diagnosed three independent gaps that together allowed infinite
'unblock → re-stuck' loops with no surfacing or escalation:
GAP 1: `_rule_stuck_in_blocked` resets timer on any `commented`/`unblocked`
event, so a task that cycles every few minutes is invisible to it
regardless of how many times it cycles.
Fix: new `_rule_block_unblock_cycling` rule (`hermes_cli/kanban_diagnostics.py`)
that counts block→unblock cycles in a sliding window. Default threshold
3 cycles within 24h, configurable via `block_cycle_threshold` /
`block_cycle_window_seconds`. Walks events in arrival order (event id)
since multiple events can share the same `created_at` second. Fires as a
warning with a CLI hint to inspect the block reasons.
GAP 2: Iteration-budget-exhausted runs in kanban workers map to
`kanban_block` (status=blocked, but a clean exit from the kernel's
perspective). `_rule_repeated_failures` reads `consecutive_failures`,
which `_record_task_failure` increments only for crashed/timed_out/
spawn_failed — `blocked` outcome bypasses the failure counter, so the
`kanban.failure_limit` circuit breaker never trips on budget-exhaustion
loops.
Fix: `agent/conversation_loop.py` budget-exhaustion path now calls
`_record_task_failure(outcome="timed_out")` instead of `kanban_block`.
Budget exhaustion is genuinely a timeout-shaped failure (the task ran out
of allowed iterations), so this is more honest semantics; it also routes
through the unified failure counter, so repeated budget exhaustions trip
the circuit breaker and the task auto-blocks with `gave_up` after
`failure_limit` retries.
GAP 3: `release_stale_claims` uses `_pid_alive(worker_pid)` only and
ignores `last_heartbeat_at`. Reporter observed a 91-min run that held
its claim with frozen heartbeat because the worker entered a logic loop
with no tool calls — `_pid_alive` kept returning True so the claim was
extended every 15 minutes indefinitely.
Fix: heartbeat-stale backstop. If `last_heartbeat_at` is set AND older
than `DEFAULT_CLAIM_HEARTBEAT_MAX_STALE_SECONDS` (default 1h), reclaim
even if the PID is alive. NULL `last_heartbeat_at` preserves backward
compatibility (no heartbeat yet = extend, as before). The reclaim event
payload now includes a `heartbeat_stale` boolean so operators see why a
live-PID worker was reclaimed.
This works cleanly in concert with PR #34418 (#31752 runtime → heartbeat
bridge): once `_touch_activity` keeps `last_heartbeat_at` fresh as a
side effect of normal API traffic, the backstop only fires for genuinely
wedged workers (no chunks, no tool results, no progress at all).
Co-authored-by: baofuen <45189813+baofuen@users.noreply.github.com>
Remove unused imports (F401) and duplicate/shadowed import
redefinitions (F811) across the codebase using ruff's safe
autofixes. No behavioral changes -- imports only.
- ~1400 safe autofixes applied across 644 files (net -1072 lines)
- __init__.py re-exports preserved (excluded from F401 removal so
public re-export surfaces stay intact)
- Re-exports that are imported or monkeypatched by tests but look
unused in their defining module are kept with explicit # noqa:
F401 (gateway/run.py load_dotenv; run_agent re-exports from
agent.message_sanitization, agent.context_compressor,
agent.retry_utils, agent.prompt_builder, agent.process_bootstrap,
agent.codex_responses_adapter)
- Unsafe F841 (unused-variable) fixes deliberately skipped -- those
can change behavior when the RHS has side effects
- ruff lints remain disabled in pyproject.toml (only PLW1514 is
selected); this is a one-time cleanup, not a config change
Verification:
- python -m compileall: clean
- pytest --collect-only: all 27161 tests collect (zero import errors)
- core entry points import clean (run_agent, model_tools, cli,
toolsets, hermes_state, batch_runner, gateway)
- static scan: every name any test imports directly from an edited
module still resolves
Two related dispatcher behaviors that have been missing for a while.
## kanban.default_assignee (#27145)
Reporter (@agarzon): dashboard creates a task without an assignee, task
parks in 'ready' forever even though the operator's intent ('default')
is perfectly clear. The dispatcher already had a 'skipped_unassigned'
bucket but no fallback routing — users had to manually type 'default'
in the assignee field every time.
Behavior: when 'kanban.default_assignee' is set in config.yaml, the
dispatcher applies that assignee to any unassigned ready task before
deciding whether to spawn. The row is mutated (assignee column + an
'assigned' event with source='kanban.default_assignee' for the audit
trail). Empty/whitespace config value = no fallback, preserving the
existing skipped_unassigned behavior.
Dry-run mode reports what WOULD happen via the new
'auto_assigned_default' bucket on DispatchResult, but does NOT mutate
the DB — operators using 'hermes kanban dispatch --dry-run' see the
routing decision before committing.
## kanban.max_in_progress_per_profile (#21582)
Reporter (@edwardchenchen, @simlu, 4 reactions): fan-out workloads
saturate one profile's local model / API quota / browser pool while
other profiles sit idle. The existing global 'max_in_progress' caps
total workers but doesn't balance across profiles.
Behavior: when 'kanban.max_in_progress_per_profile' is set to a
positive int, the dispatcher tracks per-assignee running counts (one
query at tick start) and refuses to spawn for any assignee already at
the cap. Tasks blocked this way go to a new
'skipped_per_profile_capped' bucket on DispatchResult as
(task_id, assignee, current_running_count) tuples — NOT an
operator-actionable failure, just 'try again next tick when the
profile has capacity'.
Pre-existing 'running' tasks count against the cap (verified via
regression test). The cap respects dry_run mode by incrementing
its in-memory counter on each would-be spawn so dry_run reports
the same balanced subset that a real tick would.
Invalid cap values (0, negative, non-int, None) are treated as 'no
cap', preserving the existing behavior. Backward-compatible for
installs that don't set the config.
## Surfaces
- 'hermes kanban dispatch' CLI now prints 'Auto-assigned to
kanban.default_assignee=X: ...' and 'Deferred (X at per-profile cap,
N running): ...' lines, plus matching JSON keys in --json output.
- Gateway dispatcher logs the configured values at startup
('default_assignee=X', 'max_in_progress_per_profile=N').
- 'kanban.max_in_progress_per_profile' added to DEFAULT_CONFIG with
inline docs.
## Validation
- tests/hermes_cli/test_kanban_default_assignee.py (6 cases): no-cap
baseline, auto-assign + DB mutation, dry-run reports without
mutating, whitespace treated as None, explicit assignees untouched,
DispatchResult field schema.
- tests/hermes_cli/test_kanban_per_profile_cap.py (9 cases including
4 parametrized): no-cap baseline, balanced 2-profile fan-out,
pre-existing running counts against cap, invalid cap values
(0/-1/'abc'/None), capped tasks dispatched on next tick after
running task completes, DispatchResult field schema.
- Broader kanban suite: 464/464 pass (was 449 baseline; +15 new
regression tests across both features).
## Credit
#27145 — Jimmy Johansson reported the dispatcher skipped-unassigned
gap; @agarzon scoped the simpler 'honor kanban.default_assignee' fix
that matches the existing config knob.
#21582 — @edwardchenchen filed the per-profile cap ask after hitting
model 429s on fan-out research projects; @simlu confirmed the same
pain on local-model setups.
Repeated quarantines of an unchanged corrupt kanban.db used to amplify
disk usage by N: the gateway dispatcher's 5-minute retry loop, multi-
profile fleets sharing one DB, and manual reopen attempts each produced
a fresh '.corrupt.<timestamp>.bak' copy of the same bytes. After 10
retries on a 100KB DB you had 11x the disk footprint of duplicate
corrupt data.
Derive the backup filename from a sha256 of the main DB instead of a
timestamp + collision counter. Same bytes → same filename → skip the
copy on retries. Different bytes (partial repair, further damage) →
different filename → preserve separately. Sidecar (-wal/-shm) backups
inherit the same content-addressed name.
Inspired by @hanzckernel's PR #33529, simplified down to ~30 LOC: drop
the persistent JSON marker file, drop the atomic temp+fsync+rename
helper (shutil.copy2 is fine for a quarantine-only path), drop the
gateway-side WAL/SHM fingerprint extension (the existing
(path, mtime, size) tuple still gives the 5-minute retry semantics it
needs), and drop the gateway-side helper extraction. The backup file
existing IS the marker; no separate state needed.
Test: tests/hermes_cli/test_kanban_db.py::test_repeated_corrupt_open_reuses_single_backup
proves 10 retries on the same corrupt bytes produce 1 backup (was 11),
and mutating the corrupt bytes produces a second backup with a
different fingerprint.
Refs #33529
Co-authored-by: hanzckernel <zhicheng.han@mathematik.uni-goettingen.de>
`sqlite3.Connection.__exit__` commits/rollbacks but does NOT close the
underlying FD. `with kb.connect() as conn:` in long-lived processes
(gateway `run_slash`, dashboard `decompose_task_endpoint`) therefore
leaks one FD to `kanban.db` per call. After enough operations the
gateway dies with `[Errno 24] Too many open files` (~4 days uptime
in the production report — #33159).
Fix: add a `connect_closing()` context manager in `hermes_cli/kanban_db`
that wraps `connect()` with a real `try/finally: conn.close()`. Switch
the 42 leak-prone call sites in `hermes_cli/kanban.py` (35),
`hermes_cli/kanban_decompose.py` (4), and `hermes_cli/kanban_specify.py`
(3) over to it.
`kanban.py` matters because `run_slash` (called from the gateway for
every `/kanban` slash command) parses argparse and dispatches to those
`_cmd_*` functions in-process — each one was leaking one FD per
invocation.
Tests inside `tests/` are untouched: short-lived processes where OS
cleanup masks the leak. Regression tests added in
`test_kanban_db.py` cover both happy-path and exception-path closure,
plus an explicit assertion that bare `with kb.connect()` still does
NOT close (documenting the upstream sqlite3 behaviour we're working
around).
Closes#33159.
Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows.
Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix.
Squashes three sibling commits from PR #32301 into one logical change for batch review.
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect().
Refs: #31208 (Bug E - same file, coordinate), #30973 (wal_autocheckpoint)
Refs: #30445, #30896, #30908 (corruption reports)
`detect_crashed_workers` calls `_pid_alive` on every `running` task whose
claim is held by this host. The check can transiently return False for a
freshly-spawned worker (fork → /proc-visibility lag, or reap-race
between SIGCHLD and parent reaping). When a second dispatcher ticks
inside that window it reclaims the task and spawns a duplicate worker.
Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an
`HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override.
`detect_crashed_workers` skips the liveness check when
`time.time() - started_at < grace`. The existing 15-minute claim TTL
still reclaims genuinely-crashed workers; grace only suppresses the
launch-window false positive.
`HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home`
fixture in `test_kanban_core_functionality.py` so existing tests that
assert immediate reclaim retain pre-fix semantics.
Companion to merged PR #23442 (`release_stale_claims`, closes#23025),
which addressed the same multi-dispatcher race in the stale-claim path.
Related: #20015 (`_pid_alive` false-negative behaviour),
When code inside a write_txn block raises an OperationalError that SQLite
has already auto-rolled-back (typical for disk I/O error,
database is locked, and database disk image is malformed), the
explicit ROLLBACK in write_txn.__exit__ itself raises
cannot rollback - no transaction is active and the secondary exception
replaces the original in the traceback. Operators see a misleading error
and lose the diagnostic information they need.
Swallow the rollback-time OperationalError so the caller always sees the
original cause.
Confirmed reproducer: tests/hermes_cli/test_kanban_db.py::
test_write_txn_preserves_original_exception_when_rollback_fails
Production corruption #6 left b-tree pages with zeroed headers but intact old cell content — the Bug E pattern. This fix applies three pragma calls on every connect():
- synchronous=FULL (was NORMAL): closes the WAL-checkpoint reordering window where a crash between WAL commit and main-DB write leaves a partially-written b-tree page header. Cost is <1ms per commit on local SSD; negligible at kanban write volume.
- secure_delete=ON: forces SQLite to zero freed page bytes on disk. If a torn write or hardware fault later corrupts a page, the underlying cell content is zero, so corruption is detectable and no stale rows can resurface as live data.
- cell_size_check=ON: adds a read-side guard so corrupt cells surface as errors at read time rather than as silent wrong-data returns.
All three are connection-scoped and re-applied on every connect(). secure_delete also writes a persistent flag into the DB header on the first call against a fresh DB, making the protection durable across processes for new DBs.
Tests added for all four required cases: each pragma active on a fresh connection, and all three re-applied after close+reopen. Also adds the required negative test (migration path does not reset pragmas).
Board defaults represent persistent project checkouts. Scratch workspaces
are auto-deleted on completion and must stay under the per-board scratch
root that resolve_workspace() creates. Inheriting default_workdir for a
scratch task pointed the cleanup path at the user's source tree — the
data-loss vector documented in #28818.
The containment guard in _cleanup_workspace (just added) is the safety
rail. This commit prevents the bad state from being created in the first
place: only persistent kinds (dir/worktree) inherit board defaults.
Tests updated to cover the new semantics: scratch with default_workdir
set keeps workspace_path=None; dir/worktree still inherits the board
default.
Salvaged from PR #31315 by @leeseoki0 — prevention layer on top of the
#28819 containment fix by @briandevans.
Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>
Copilot review on PR #28819 flagged that `_is_managed_scratch_path` accepted
the entire `<kanban_home>/kanban` subtree as managed scratch storage. With
that, a task whose `workspace_kind='scratch'` and `workspace_path` was
mis-set to `<kanban_home>/kanban`, `.../kanban/logs`, or a board's
metadata directory (e.g. `.../kanban/boards/<slug>` without the
`workspaces/` child) would pass the containment guard and let task
completion `shutil.rmtree` Hermes' own DB, metadata, and log subtrees.
Tighten the guard:
* Allowed roots are now exclusively `workspaces/` directories — the
`HERMES_KANBAN_WORKSPACES_ROOT` override, `<kanban_home>/kanban/workspaces`,
and each `<kanban_home>/kanban/boards/<slug>/workspaces` discovered on
disk.
* Require strict descendancy: a path equal to a root itself is rejected
too, because deleting a workspaces root would wipe every task's scratch
dir at once.
Add a regression test covering the three Copilot-named attack paths
(kanban root, kanban/logs, board root without `workspaces/`) plus the
workspaces-root-itself case, and confirm the inner task-id dir still
matches.
A board's ``default_workdir`` (e.g. ``hermes kanban boards
set-default-workdir my-board /path/to/real/source``) is copied into
``tasks.workspace_path`` for tasks created without an explicit
``workspace_kind``. Those tasks default to ``workspace_kind='scratch'``,
so completion calls ``_cleanup_workspace`` and unconditionally runs
``shutil.rmtree(wp, ignore_errors=True)`` — deleting the user's real
source tree as if it were disposable scratch storage.
Add ``_is_managed_scratch_path()`` and gate ``_cleanup_workspace`` on
it: only delete paths under ``HERMES_KANBAN_WORKSPACES_ROOT`` (the
worker-side override the dispatcher injects) or under the active kanban
home's ``kanban/`` subtree (covering both the legacy default-board root
and per-board ``kanban/boards/<slug>/workspaces`` roots). Anything else
gets a warning log and is left alone, so a misconfigured
``default_workdir`` can no longer destroy user data on task completion.
Adds `hermes kanban promote <task_id>` for manual lifecycle recovery
when an auto-promote daemon misses the parent-done transition (issue
#28822). Refuses promotion unless every parent dep is done/archived
(override with --force). Emits a `promoted_manual` audit event distinct
from the automatic `promoted` kind, so audit consumers can filter
human-driven from system-driven promotions. Supports --dry-run and
--json for orchestration. Does not mutate assignee/claim state — the
dispatcher picks the card up via its normal ready polling path.
Closes#28822.
First scratch workspace creation on an install now emits a one-shot
warning log + a 'tip_scratch_workspace' event on the task. Sentinel
file at ~/.hermes/kanban/.scratch_tip_shown silences subsequent
creations across the whole install.
Behavior unchanged — scratch is still ephemeral by design. This just
makes the design visible to new users (reported in user community:
'progress files vanished, no warning anywhere').
Docs (en + ko) updated to spell out 'Deleted when the task completes'
on the scratch bullet and 'Preserved on completion' on worktree/dir.
Path.resolve() before any I/O and confine backup writes to the resolved
parent directory. Adds explicit parent-equality assertions so static
analyzers see the containment guarantee, and walks WAL/SHM sidecars
through the same resolved-parent path so accidental .. segments are
collapsed before shutil.copy2.
Functionally equivalent to the original PR; preserves the corrupt bytes
to <db>.corrupt.<ts>.bak in the same directory, still raises
KanbanDbCorruptError from connect(). E2E with Stefan's exact hex header
+ malformed pages still passes. 163/163 kanban tests still pass.
When a worker calls ``kanban_block(reason="review-required: ...")`` to
hand a task off for human review, the dispatcher's ``recompute_ready``
was treating the resulting ``blocked`` status as eligible for
auto-promotion — exactly the same as a circuit-breaker block. On the
next tick the task flipped back to ``ready``, a fresh worker spawned,
found nothing to do (work already applied, review-required comment
already posted), exited cleanly, got recorded as ``protocol_violation``
→ ``gave_up`` → ``blocked``, and the dispatcher promoted again.
Infinite loop until manual ``hermes kanban reclaim`` + ``kanban block``.
Add ``_has_sticky_block`` which distinguishes the two block sources
using the cheapest available signal: the most recent
``"blocked"``/``"unblocked"`` event in ``task_events``.
* Worker / operator ``kanban_block`` emits ``"blocked"`` →
``_has_sticky_block`` returns True → ``recompute_ready`` skips the
task entirely. ``unblock_task`` emits ``"unblocked"`` which flips
the predicate back, so the only legitimate exit is the documented
human-in-the-loop path.
* Circuit-breaker ``_record_task_failure`` emits ``"gave_up"`` (not
``"blocked"``) → predicate stays False → original
parent-completion-recovery semantics from #40c1decb3 are preserved.
* Tasks blocked purely by direct DB manipulation also recover, since
they have no ``"blocked"`` event row at all — matches the existing
``test_recompute_ready_promotes_blocked_with_done_parents`` fixture
behaviour.
Extends the previous commit to cover the remaining additive-column index
that sits on the same migration trap:
- ``task_events.run_id`` -> ``idx_events_run`` was still in SCHEMA_SQL.
A legacy ``task_events`` table predating #17805 (no ``run_id``) would
still abort ``executescript`` before ``_migrate_add_optional_columns``
could add the column. Hoisted out of SCHEMA_SQL and made unconditional
in the migration alongside the other three indexes.
- Removed the now-redundant ``CREATE INDEX idx_tasks_idempotency`` that
was nested inside the ``if "idempotency_key" not in cols`` branch.
The unconditional create lower in the function makes it idempotent
on both fresh and legacy DBs.
- Strengthened the regression test to cover all four indexes
(``idx_tasks_session_id``, ``idx_tasks_tenant``, ``idx_tasks_idempotency``,
``idx_events_run``) and to seed a pre-#17805 ``task_events`` shape that
exercises the ``run_id`` migration path.
The result: every ``CREATE INDEX`` that depends on an additive column now
runs after the migration ensures the column exists. Verified against a
realistic pre-#16081 board fixture (tasks + task_events both legacy
shape) — origin/main reproduces ``no such column: session_id``; this
branch migrates cleanly and creates all four indexes.
Follow-up to #28455. The respawn guard's blocker_auth rule (last error
matched a quota/auth/429 pattern) was auto-blocking the task on first
occurrence. That's too aggressive: transient rate limits typically
clear in seconds to minutes, but the auto-block puts the task in
'blocked' status which requires manual unblock.
Now treats blocker_auth the same as recent_success and active_pr:
defer the spawn this tick, leave the task in 'ready', let the next
tick try again. If the auth error genuinely persists, the existing
consecutive_failures counter trips the auto-block circuit breaker
after failure_limit failures via the normal path — so a persistent
401/403/quota-exhausted still ends up blocked, just not on first hit.
Also documents the respawn_guarded event in kanban.md's events table
with the three guard reasons.
Updated test_dispatch_respawn_guard_auto_blocks_auth_error → renamed
to test_dispatch_respawn_guard_defers_auth_error_without_auto_block;
asserts task stays in 'ready' and the guard reason is recorded.
Follow-up to #28452. detect_stale_running() was calling
_record_task_failure() on every reclaim, which ticked the
consecutive_failures counter. With the default failure_limit=2,
two legitimately long-running tasks (>4 h without explicit
heartbeat) would auto-block via the spawn-failure circuit
breaker — even though no worker actually failed.
Stale reclaim is dispatcher-side absence-of-heartbeat detection,
not a worker fault. Removed the _record_task_failure() call;
the 'stale' event in task_events is still the audit surface,
but the failure counter is now reserved for spawn_failed /
timed_out / crashed (real failures).
Also documents the heartbeat requirement:
- KANBAN_GUIDANCE in agent/prompt_builder.py now states the
rule ('call kanban_heartbeat at least once an hour for tasks
running longer than 1 hour') so workers learn the contract.
- kanban.md adds the stale event row to the events table and
flags the heartbeat requirement in the worker lifecycle list.
New regression test: test_detect_stale_does_not_tick_failure_counter
locks in the new behaviour.
Salvages #28125 by @Jpalmer95. Adds:
- Drag-to-delete trash zone in the kanban dashboard
- Bulk delete endpoint with cascading delete_task cleanup
- Frontend updates (drag visual + drop handler)
- Confirmation prompt before delete
Resolved end-of-file test conflict by appending both halves.
Salvages #24533 by @roycepersonalassistant. Adds a first-class
'scheduled' Kanban status for time-delay follow-ups that aren't
waiting on human input.
- hermes kanban schedule <task_id> [reason] CLI command
- Dashboard/API transitions to/from Scheduled
- unblock_task() now releases both 'blocked' AND 'scheduled' tasks
(re-checking parent dependencies before moving to ready/todo)
- i18n + docs updates
Resolved conflicts: kept HEAD's failure-counter reset on unblock
alongside the PR's scheduled state, kept HEAD's 'running' direct-set
rejection, combined both bulk-status branches. Dropped the dist/
bundle changes (months-stale; would need rebuild from source).
Salvages #26496 by @aqilaziz. Adds branch_name column + CLI flag so
tasks with workspace_kind='worktree' can pin a target branch on
create. Schema migration added to _migrate_add_optional_columns.
- Task.branch_name field + DB column + migration
- create_task accepts branch_name kwarg
- hermes kanban create --branch <name> flag
- kanban show output includes 'Branch: <name>' when set
Cherry-picked the substantive commit (a7558cf27); the PR's tip was
an unrelated service-path-dirs commit. Resolved 2 INSERT-column-list
and show-output conflicts alongside main's session_id and
max_runtime_seconds additions; kept all three.
Salvages #27484 by @fardoche6. Adds a respawn guard that skips worker
spawn for tasks where:
- a recent run already succeeded (recent_success — within guard window)
- the previous run hit a quota/auth error (blocker_auth, also auto-blocks)
- a recent task comment includes a GitHub PR URL (active_pr)
The guard prevents repeat worker storms on the same bug/task. Includes
the contributor's review-findings fixup (regex hardening, observability,
auth coverage).
Resolved a small DispatchResult conflict alongside main's 'stale' field;
kept both. Authorship preserved via rebase merge.
Salvages #26745 by @nehaaprasaad. Exposes filtering for the existing
workflow_template_id and current_step_key columns:
- list_tasks() accepts workflow_template_id and current_step_key kwargs
- 'hermes kanban list' adds matching CLI flags
- dashboard plugin_api also exposes the filters
Resolved a small conflict in list_tasks signature alongside main's
session_id and order_by additions; combined all three into the single
filter list.
Salvages #23790 by @thewillhuang. Adds detect_stale_running() to
the dispatcher cycle. Running tasks that have been started for longer
than dispatch_stale_timeout_seconds (default 14400 = 4h) without a
heartbeat in the last hour are auto-reclaimed to ready.
- New config kanban.dispatch_stale_timeout_seconds (default 14400, 0 disables)
- New 'stale' field on DispatchResult
- detect_stale_running() in kanban_db.py with heartbeat freshness check
- Records outcome='stale' on run close + 'stale' event; ticks failure counter
- Wires config through gateway embedded dispatcher
- Updates _cmd_dispatch verbose/JSON output and daemon logging
Resolved test-file end-of-file conflict by appending both halves.
Salvages #23772 by @thewillhuang. Adds 'review' as a valid kanban task
status and extends dispatch_once to monitor the review column as a
second dispatch source (in addition to the existing ready column).
- Adds 'review' to VALID_STATUSES
- Adds claim_review_task() — atomically transitions review → running
- Adds has_spawnable_review() — health telemetry mirror
- Extends dispatch_once with a review column dispatch loop
- Review agents get 'sdlc-review' skill auto-loaded
Resolved 2 conflicts (VALID_STATUSES merge with main's 'scheduled' state,
test file additions). Adapted claim_review_task to main's
ttl_seconds: Optional[int] = None convention (matches claim_task).
Salvages #23208 by @awizemann. Tracks which chat session created a
kanban task so clients can render a per-session board without falling
back to tenant + time-window heuristics.
- Schema: tasks gains nullable session_id TEXT column with index
(additive migration in _migrate_add_optional_columns).
- ACP: server.py exposes the originating session id via HERMES_SESSION_ID
with save/restore around the agent loop.
- Tool: kanban_create reads HERMES_SESSION_ID (with explicit override).
- CLI: 'hermes kanban list --session <id>' filter; JSON output exposes
session_id.
Salvages #25745 by @LizerAIDev. Adds --sort {created,created-desc,
priority,priority-desc,status,assignee,title,updated} to 'hermes kanban
list'. Validated against VALID_SORT_ORDERS map; invalid values raise
ValueError. Default behaviour (priority DESC, created ASC) is unchanged
when --sort is omitted.
Salvages #22981 by @SimbaKingjoe. Adds 'kanban.max_in_progress' config
that caps simultaneously running tasks. When the board already has N
running, dispatcher skips spawning so slow workers (local LLMs,
resource-constrained hosts) don't pile up and time out.
Threads through dispatch_once(max_in_progress=) and gateway dispatcher
config parsing with validation (warns on invalid/below-1 values).
Salvages #27369 by @LeonJS. complete_task() now calls _cleanup_workspace()
and _cleanup_worker_tmux() after marking a task complete.
Scratch workspaces (used by swarm agents) accumulate on disk — hundreds
of MB per task, never released. Stale tmux sessions from completed
agents also persist indefinitely.
Both gates are safe:
- workspace_kind == 'scratch' gate preserves user worktree/dir workspaces
- tmux #{pane_dead} == 1 gate only kills sessions where the worker has
already exited
- best-effort: cleanup failures never block task completion
Salvages #27526 by @shunsuke-hikiyama. Adds an --initial-status flag
(running|blocked, default running) to 'kanban create', threaded through
kanban_db.create_task() and the kanban_create tool schema. 'blocked'
parks the task directly in the blocked column for R3 human-ops review,
skipping the brief running-to-blocked transition.
Dropped the unrelated 'add' alias, WIFEXITED Windows compat, and
slash-handler error formatting changes that were bundled in the
original PR — those should ship as their own focused changes if still
wanted.
Salvages #27372 by @oemtalks. The dispatcher unconditionally injected
`--skills kanban-worker` into every worker spawn, but worker profiles
sometimes don't have that bundled skill in their skills dir, which is
fatal at CLI startup (`ValueError: Unknown skill(s): kanban-worker`).
Adds `_kanban_worker_skill_available(hermes_home)` and only injects the
flag when the skill resolves. The MANDATORY lifecycle still ships via
KANBAN_GUIDANCE in the system prompt, so omitting the flag is safe.
Salvages #28301 by @Ade5954. If WAL setup, PRAGMA application, or schema
init raises after sqlite3.connect() succeeds, the new connection was
leaking. Wrap the body in try/except so the connection is closed before
the exception propagates.