Container Zombie Buildup & Runtime Wedge Runbook
A phased recovery runbook for container zombie buildup, unhealthy containers, and Docker/containerd runtime wedges on long-running Linux hosts. Covers rapid triage, graceful and forceful recovery, runtime reconciliation, clean recreation, root cause analysis, and permanent hardening with init processes and direct-exec healthchecks.
This entry documents a practical recovery runbook for container zombie buildup, unhealthy containers, and Docker/containerd runtime wedges. It is intended for long-running Linux hosts and SBC-style systems where subtle process-handling issues can accumulate over time and eventually desynchronize the container runtime from the actual process state.
The scenario that inspired this runbook involved a PostgreSQL container that remained running but unhealthy, accumulated repeated [sh] <defunct> zombie processes, and eventually reached a point where Docker could no longer stop, kill, restart, or reliably exec into the container. The runtime began returning the familiar and dreaded message:
did not receive an exit eventThis guide turns that experience into a repeatable operational response: how to detect the symptoms quickly, recover the runtime cleanly, and harden the container definition to reduce the likelihood of recurrence.
🧭Phase 0: Trigger Conditions
Use this runbook when one or more of the following symptoms are present on a Docker host:
docker stop,docker kill, ordocker restartfails withdid not receive an exit event.- A container remains
Upbut is marked(unhealthy). docker execbegins failing for a container that was previously reachable.- The host or container shows repeated zombie processes such as
[sh] <defunct>. containerd-shimappears to be consuming an unexpectedly large amount of memory for a single container.
🔎Phase 1: Rapid Triage
Begin with a fast, low-friction diagnostic pass. The aim is not to explain everything immediately, but to confirm whether the container is merely sick or whether the runtime itself is beginning to wedge.
# Check for defunct processes on the host
ps aux | grep defunct
# Check container state
docker ps
# Get the host PID for the container
docker inspect -f '{{.State.Pid}}' <container-name>
# Inspect the host-side process state for that PID
ps -o pid,ppid,stat,cmd -p <PID>Healthy containers should not accumulate large numbers of zombies over time. Likewise, a container process that cannot be stopped or killed cleanly is a warning that the runtime and process state may already be drifting apart.
🩺Phase 2: Attempt Graceful Recovery
If the runtime is still mostly coherent, start with a normal container stop. This gives the application a chance to exit cleanly and is always preferable to jumping straight to forceful termination.
# Attempt a graceful stop
docker stop -t 30 <container-name>
# If that fails, try a direct Docker kill
docker kill <container-name>If either command succeeds and the container exits normally, you can usually proceed with a clean restart or a controlled recreation via Compose.
💣Phase 3: Escalate to Host-Level Termination
If Docker can no longer stop or kill the container, move to the host PID directly. This bypasses the Docker CLI's normal lifecycle path and attempts to terminate the actual process representing the container.
# Resolve the host PID of the container
PID=$(docker inspect -f '{{.State.Pid}}' <container-name>)
# Try SIGTERM first
sudo kill -TERM $PID
sleep 2
# Escalate to SIGKILL if needed
sudo kill -KILL $PID
# Re-check container state
docker psIf the process disappears and Docker updates accordingly, the issue was recoverable without deeper runtime intervention. If the process dies but Docker still claims the container is up, runtime bookkeeping has likely become desynchronized.
D state, even SIGKILL may not terminate it immediately. At that point, the issue is beneath the application layer.🚨Phase 4: Recover the Runtime
This is the decisive phase. If Docker repeatedly reports that it cannot stop, kill, or restart the container because it did not receive an exit event, stop treating the container as the primary problem and reconcile the runtime instead.
# Restart containerd first
sudo systemctl restart containerd
# Restart Docker after containerd
sudo systemctl restart docker
# Re-check runtime and container state
docker psRestarting containerd and docker forces the runtime to rebuild its view of container state. In practice, this is often the cleanest way to resolve a wedged control plane without rebooting the entire host.
containerd and docker immediately reconciled the runtime and restored the container to a healthy state.🔄Phase 5: Recreate Cleanly via Compose
Once the runtime is healthy again, prefer a proper container recreation rather than simply trusting the recovered instance indefinitely. This clears stale namespace state and gives the container a clean lifecycle boundary.
# Move to the service directory
cd <compose-directory>
# Bring the stack down cleanly
docker compose down
# Recreate it
docker compose up -d
# Verify state
docker psFor long-running stateful services such as PostgreSQL, a clean recreation after runtime recovery is a prudent operational step, provided the persistent data lives on a mounted volume.
docker compose down casually on stateful services unless you understand precisely where the data lives and which resources are ephemeral versus persistent.✅Phase 6: Verify Process Hygiene
Once the service is back up, verify both health and process cleanliness. The absence of zombies matters just as much as the service reporting itself healthy.
# Host-side zombie check
ps aux | grep defunct
# Container state
docker ps
# Optional: check inside the container as well
docker exec <container-name> ps aux | grep defunctThe expected end state is straightforward:
- Container reports
healthy. - No significant zombie buildup on the host.
- No repeated
<defunct>entries inside the container. docker execworks normally again.
🧠Phase 7: Root Cause Analysis
In the observed case, the root cause was not PostgreSQL itself. It was the combination of a shell-based healthcheck and a container running without a proper init process to reap child processes cleanly over time.
Inspect the container healthcheck configuration:
docker inspect <container-name> --format '{{json .Config.Healthcheck}}'A result like this is a warning sign:
{
"Test": ["CMD-SHELL", "pg_isready -U tachyon -d tachyon_db"],
"Interval": 10000000000,
"Timeout": 5000000000,
"Retries": 5
}CMD-SHELL means Docker spawns /bin/sh -c to run the healthcheck. On a long-lived container, especially one without a proper init process, that repeated shell spawning can become a liability if child reaping ever goes wrong.
🛡️Phase 8: Permanent Hardening
Two changes dramatically reduce the likelihood of recurrence for this class of issue.
First: add an init process so PID 1 in the container can reap orphaned children properly.
services:
postgres:
image: postgres:18
container_name: postgres
init: trueSecond: replace shell-based healthchecks with direct command execution wherever possible.
# Less desirable
healthcheck:
test: ["CMD-SHELL", "pg_isready -U tachyon -d tachyon_db"]
interval: 10s
timeout: 5s
retries: 5
# Preferred
healthcheck:
test: ["CMD", "pg_isready", "-U", "tachyon", "-d", "tachyon_db"]
interval: 10s
timeout: 5s
retries: 5These changes remove /bin/sh from the healthcheck path and give the container a proper reaper at PID 1.
📋Operational Checklist
When this failure mode appears again, the response can be condensed into the following operational sequence:
# 1) Quick diagnosis
ps aux | grep defunct
docker ps
docker inspect -f '{{.State.Pid}}' <container-name>
# 2) Try graceful stop
docker stop -t 30 <container-name> || docker kill <container-name>
# 3) Escalate to host PID
PID=$(docker inspect -f '{{.State.Pid}}' <container-name>)
sudo kill -TERM $PID
sleep 2
sudo kill -KILL $PID
# 4) If Docker still says "did not receive an exit event"
sudo systemctl restart containerd
sudo systemctl restart docker
# 5) Recreate cleanly
cd <compose-directory>
docker compose down
docker compose up -d
# 6) Verify
ps aux | grep defunct
docker ps🎯Final State & Lessons Learned
A container runtime wedge is not always a dramatic crash. Sometimes it is the cumulative result of tiny lifecycle mistakes repeated for days or weeks: shell-wrapped healthchecks, weak PID 1 behavior, unreaped children, and a control plane that slowly drifts away from the truth.
The end goal of this runbook is not merely to recover a sick container. It is to restore alignment between:
- The application process
- The container namespace
- The Docker control plane
- The underlying runtime state
When all four agree again, the system is truly healthy.
Changelog
Initial release documenting recovery and hardening procedures for container zombie buildup and Docker/containerd runtime wedge scenarios.
Filed under: Docker, containerd, PostgreSQL, Runbook, Linux, SRE, Process Management, Containers, Troubleshooting, Operational Wisdom