When `docker run -d` fails after Docker has already created the container object (e.g. exit 125 when the daemon isn't ready, or a timeout mid image pull), the code raised before `self._container_id` was set — so the container leaked permanently in "Created" state. Reported in #7439: 110+ orphaned containers accumulated over 3 days from hourly cron- scheduled gateway sessions hitting a Docker Desktop startup race. The orphan reaper added in #33645 (reap_orphan_containers) does NOT cover this case: it filters `status=exited`, but a failed-create container is in `Created` state, so it slips through and is never reaped. Wrap the `docker run -d` call in try/except and `docker rm -f` the container by its known name before re-raising. Salvages #7440 by @Tranquil-Flow. Their branch predated the cross-process reuse + labels rework on `main`, so a cherry-pick conflicted; reconstructed the same intent (plus their two regression tests, adapted to mock the new reuse `docker ps` probe) against current `main`. Verified adversarially: reverted just the product change to origin/main's `docker.py`, ran the two new tests -> both FAIL with `assert 0 == 1 ("docker rm should be called once")`. With the fix applied, both pass; full test_docker_environment.py is 65/65 green. Closes #7440. Fixes #7439. Co-authored-by: Evi Nova <66773372+Tranquil-Flow@users.noreply.github.com>
This commit is contained in:
@ -854,13 +854,31 @@ class DockerEnvironment(BaseEnvironment):
|
||||
"sleep", "infinity", # no fixed lifetime — idle reaper handles cleanup
|
||||
]
|
||||
logger.debug(f"Starting container: {' '.join(run_cmd)}")
|
||||
result = subprocess.run(
|
||||
run_cmd,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=120, # image pull may take a while
|
||||
check=True,
|
||||
)
|
||||
try:
|
||||
result = subprocess.run(
|
||||
run_cmd,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=120, # image pull may take a while
|
||||
check=True,
|
||||
)
|
||||
except (subprocess.CalledProcessError, subprocess.TimeoutExpired) as e:
|
||||
# Docker may create the container object before `docker run`
|
||||
# fails to start it (e.g. exit code 125 when the daemon isn't
|
||||
# ready, or a timeout mid-pull). That orphan is left in
|
||||
# "Created" state — which the exited-only orphan reaper
|
||||
# (reap_orphan_containers, status=exited) never catches, so it
|
||||
# leaks permanently. Remove it by its known name before
|
||||
# re-raising. See #7439.
|
||||
logger.warning(
|
||||
"docker run failed for %s, cleaning up orphaned container: %s",
|
||||
container_name, e,
|
||||
)
|
||||
subprocess.run(
|
||||
[self._docker_exe, "rm", "-f", container_name],
|
||||
capture_output=True, timeout=10,
|
||||
)
|
||||
raise
|
||||
self._container_id = result.stdout.strip()
|
||||
logger.info(f"Started container {container_name} ({self._container_id[:12]})")
|
||||
|
||||
|
||||
Reference in New Issue
Block a user