fix(dashboard-auth): don't abort verify chain on one provider's ProviderError

The gated dashboard verifies a session cookie by trying each registered
DashboardAuthProvider's verify_session in turn (the session cookie stores
only the access token, not which provider issued it). A provider that
doesn't recognise a token returns None; a provider whose IDP/JWKS is
unreachable raises ProviderError.

The loop used to return HTTP 503 on the FIRST ProviderError, before any
later provider got a turn. With multiple providers stacked, that means an
unreachable IDP for a session you didn't even use blocks login through a
different, reachable provider.

Concrete repro: a self-hosted-OIDC session hits the 'nous' provider first
(registered earlier); nous tries to reach Nous Portal's JWKS, which is
unreachable in a self-hosted deployment, so it raises — and the gate
503s before the 'self-hosted' provider can verify the token. Hit live
while testing the new self-hosted OIDC plugin against a local Keycloak.

Fix: a ProviderError from one provider is logged and the loop continues
to the next. A 503 is returned only if NO provider verified the token
AND at least one was unreachable — distinguishing a transient IDP outage
(don't force a needless re-login) from a token that's genuinely invalid
(fall through to refresh/relogin). Single-provider behaviour is
unchanged.

Tests: adds an _UnreachableProvider stub and three cases — unreachable
provider first must not block a working second; all-unreachable still
503s; reachable-but-unrecognised falls through to 401/relogin (not 503).
Mutation-tested: reverting the fix makes the first case fail with the
exact 503 bug.
This commit is contained in:
Ben
2026-06-04 17:02:43 +10:00
committed by Teknium
parent f57ce341dc
commit 616c0a36b6
2 changed files with 151 additions and 4 deletions

View File

@ -207,6 +207,22 @@ async def gated_auth_middleware(
# good refresh token — defeating the whole transparent-refresh feature.
session = None
if at:
# Try every registered provider's verify_session in turn. A provider
# that doesn't recognise the token returns None and we move on; the
# first provider that returns a Session wins.
#
# A provider may instead raise ProviderError (its IDP/JWKS is
# unreachable, so it can neither confirm nor deny the token). With
# multiple providers stacked, that MUST NOT abort the chain — the
# token may belong to a *different*, reachable provider. (Concretely:
# a self-hosted-OIDC session hits the `nous` provider first, which
# tries to reach Nous Portal's JWKS; if that's unreachable it raises,
# but the `self-hosted` provider can still verify the token.) So we
# remember the unreachable error and keep going. Only if NO provider
# verifies the token AND at least one was unreachable do we surface a
# 503 — distinguishing "transient IDP outage" (don't force re-login)
# from "token genuinely invalid" (fall through to refresh/relogin).
unreachable_provider: str | None = None
for provider in list_providers():
try:
session = provider.verify_session(access_token=at)
@ -221,12 +237,19 @@ async def gated_auth_middleware(
reason="provider_unreachable",
ip=_client_ip(request),
)
return JSONResponse(
{"detail": f"Auth provider {provider.name!r} unreachable"},
status_code=503,
)
if unreachable_provider is None:
unreachable_provider = provider.name
continue
if session is not None:
break
if session is None and unreachable_provider is not None:
# No provider could verify the token and at least one couldn't be
# reached — treat as a transient outage rather than forcing a
# re-login through a (possibly also-unreachable) refresh.
return JSONResponse(
{"detail": f"Auth provider {unreachable_provider!r} unreachable"},
status_code=503,
)
if session is None:
# Access token is expired/invalid. Before forcing re-login, try to