fix(docker): clean up orphaned container when docker run fails (salvage #7440) (#39412)

When `docker run -d` fails after Docker has already created the container
object (e.g. exit 125 when the daemon isn't ready, or a timeout mid image
pull), the code raised before `self._container_id` was set — so the
container leaked permanently in "Created" state. Reported in #7439:
110+ orphaned containers accumulated over 3 days from hourly cron-
scheduled gateway sessions hitting a Docker Desktop startup race.

The orphan reaper added in #33645 (reap_orphan_containers) does NOT cover
this case: it filters `status=exited`, but a failed-create container is in
`Created` state, so it slips through and is never reaped.

Wrap the `docker run -d` call in try/except and `docker rm -f` the
container by its known name before re-raising.

Salvages #7440 by @Tranquil-Flow. Their branch predated the cross-process
reuse + labels rework on `main`, so a cherry-pick conflicted; reconstructed
the same intent (plus their two regression tests, adapted to mock the new
reuse `docker ps` probe) against current `main`.

Verified adversarially: reverted just the product change to origin/main's
`docker.py`, ran the two new tests -> both FAIL with
`assert 0 == 1 ("docker rm should be called once")`. With the fix applied,
both pass; full test_docker_environment.py is 65/65 green.

Closes #7440. Fixes #7439.

Co-authored-by: Evi Nova <66773372+Tranquil-Flow@users.noreply.github.com>
This commit is contained in:
Ben Barclay
2026-06-05 10:19:08 +10:00
committed by GitHub
parent 4690bbc363
commit 82c157b267
2 changed files with 103 additions and 7 deletions

View File

@ -854,13 +854,31 @@ class DockerEnvironment(BaseEnvironment):
"sleep", "infinity", # no fixed lifetime — idle reaper handles cleanup
]
logger.debug(f"Starting container: {' '.join(run_cmd)}")
result = subprocess.run(
run_cmd,
capture_output=True,
text=True,
timeout=120, # image pull may take a while
check=True,
)
try:
result = subprocess.run(
run_cmd,
capture_output=True,
text=True,
timeout=120, # image pull may take a while
check=True,
)
except (subprocess.CalledProcessError, subprocess.TimeoutExpired) as e:
# Docker may create the container object before `docker run`
# fails to start it (e.g. exit code 125 when the daemon isn't
# ready, or a timeout mid-pull). That orphan is left in
# "Created" state — which the exited-only orphan reaper
# (reap_orphan_containers, status=exited) never catches, so it
# leaks permanently. Remove it by its known name before
# re-raising. See #7439.
logger.warning(
"docker run failed for %s, cleaning up orphaned container: %s",
container_name, e,
)
subprocess.run(
[self._docker_exe, "rm", "-f", container_name],
capture_output=True, timeout=10,
)
raise
self._container_id = result.stdout.strip()
logger.info(f"Started container {container_name} ({self._container_id[:12]})")