fix(kanban): don't permanently block tasks that hit a provider rate limit (#38223)
A kanban worker that exhausted its retries purely on a provider rate limit / quota wall (e.g. opencode-go's 5-hour window) exited with code 1. The dispatcher counted that as a crash, and with DEFAULT_FAILURE_LIMIT=2 two quota-wall hits permanently blocked the card. Fanning out many workers against one shared quota made this routine. Now a rate-limited worker exits with EX_TEMPFAIL (75); the dispatcher classifies that as a 'rate_limited' exit, releases the task back to 'ready' WITHOUT incrementing consecutive_failures (the breaker can't trip on a transient throttle), and the respawn guard defers the next attempt on a cooldown (default 5min, HERMES_KANBAN_RATE_LIMIT_COOLDOWN_SECONDS) until the quota window clears. Genuine crashes still count and trip the breaker as before. The 120s Retry-After cap is unchanged — no worker parks for hours holding a slot. - conversation_loop.py: surface failure_reason in the exhaustion return - cli.py: kanban worker picks exit 75 on rate_limit/billing failure - kanban_db.py: rate_limited exit kind, no-count requeue, cooldown guard
This commit is contained in:
@ -3378,6 +3378,12 @@ def run_conversation(
|
||||
"completed": False,
|
||||
"failed": True,
|
||||
"error": _final_summary,
|
||||
# Surface the classified reason so callers (notably the
|
||||
# kanban worker path in cli.py) can distinguish a
|
||||
# transient throttle from a real failure and choose a
|
||||
# different exit code. ``rate_limit`` / ``billing`` here
|
||||
# mean "quota wall, not a task error".
|
||||
"failure_reason": classified.reason.value,
|
||||
}
|
||||
|
||||
# For rate limits, respect the Retry-After header if present
|
||||
|
||||
Reference in New Issue
Block a user