fix(kanban): don't permanently block tasks that hit a provider rate limit (#38223)

A kanban worker that exhausted its retries purely on a provider rate
limit / quota wall (e.g. opencode-go's 5-hour window) exited with code 1.
The dispatcher counted that as a crash, and with DEFAULT_FAILURE_LIMIT=2
two quota-wall hits permanently blocked the card. Fanning out many
workers against one shared quota made this routine.

Now a rate-limited worker exits with EX_TEMPFAIL (75); the dispatcher
classifies that as a 'rate_limited' exit, releases the task back to
'ready' WITHOUT incrementing consecutive_failures (the breaker can't trip
on a transient throttle), and the respawn guard defers the next attempt
on a cooldown (default 5min, HERMES_KANBAN_RATE_LIMIT_COOLDOWN_SECONDS)
until the quota window clears. Genuine crashes still count and trip the
breaker as before. The 120s Retry-After cap is unchanged — no worker
parks for hours holding a slot.

- conversation_loop.py: surface failure_reason in the exhaustion return
- cli.py: kanban worker picks exit 75 on rate_limit/billing failure
- kanban_db.py: rate_limited exit kind, no-count requeue, cooldown guard
This commit is contained in:
Teknium
2026-06-03 06:19:32 -07:00
committed by GitHub
parent 60b6352fe5
commit 4c544b633d
4 changed files with 408 additions and 16 deletions

View File

@ -3378,6 +3378,12 @@ def run_conversation(
"completed": False,
"failed": True,
"error": _final_summary,
# Surface the classified reason so callers (notably the
# kanban worker path in cli.py) can distinguish a
# transient throttle from a real failure and choose a
# different exit code. ``rate_limit`` / ``billing`` here
# mean "quota wall, not a task error".
"failure_reason": classified.reason.value,
}
# For rate limits, respect the Retry-After header if present