Knowledge Base
Infrastructure
2026-03-24
14 min

Container Zombie Buildup & Runtime Wedge Runbook

A phased recovery runbook for container zombie buildup, unhealthy containers, and Docker/containerd runtime wedges on long-running Linux hosts. Covers rapid triage, graceful and forceful recovery, runtime reconciliation, clean recreation, root cause analysis, and permanent hardening with init processes and direct-exec healthchecks.

Docker
containerd
PostgreSQL
Runbook
Linux
SRE
Process Management
Containers
Troubleshooting
Operational Wisdom

This entry documents a practical recovery runbook for container zombie buildup, unhealthy containers, and Docker/containerd runtime wedges. It is intended for long-running Linux hosts and SBC-style systems where subtle process-handling issues can accumulate over time and eventually desynchronize the container runtime from the actual process state.

The scenario that inspired this runbook involved a PostgreSQL container that remained running but unhealthy, accumulated repeated [sh] <defunct> zombie processes, and eventually reached a point where Docker could no longer stop, kill, restart, or reliably exec into the container. The runtime began returning the familiar and dreaded message:

runtime-wedge.txt
did not receive an exit event

This guide turns that experience into a repeatable operational response: how to detect the symptoms quickly, recover the runtime cleanly, and harden the container definition to reduce the likelihood of recurrence.

🧭Phase 0: Trigger Conditions

Use this runbook when one or more of the following symptoms are present on a Docker host:

  • docker stop, docker kill, or docker restart fails with did not receive an exit event.
  • A container remains Up but is marked (unhealthy).
  • docker exec begins failing for a container that was previously reachable.
  • The host or container shows repeated zombie processes such as [sh] <defunct>.
  • containerd-shim appears to be consuming an unexpectedly large amount of memory for a single container.
ℹ️
Once Docker reports that it did not receive an exit event, you are no longer debugging only the application. You are debugging the container runtime's view of reality.

🔎Phase 1: Rapid Triage

Begin with a fast, low-friction diagnostic pass. The aim is not to explain everything immediately, but to confirm whether the container is merely sick or whether the runtime itself is beginning to wedge.

rapid-triage.sh
# Check for defunct processes on the host
ps aux | grep defunct

# Check container state
docker ps

# Get the host PID for the container
docker inspect -f '{{.State.Pid}}' <container-name>

# Inspect the host-side process state for that PID
ps -o pid,ppid,stat,cmd -p <PID>

Healthy containers should not accumulate large numbers of zombies over time. Likewise, a container process that cannot be stopped or killed cleanly is a warning that the runtime and process state may already be drifting apart.

⚠️
A container can still be running while being operationally incorrect. Responsiveness is not the same thing as process hygiene.

🩺Phase 2: Attempt Graceful Recovery

If the runtime is still mostly coherent, start with a normal container stop. This gives the application a chance to exit cleanly and is always preferable to jumping straight to forceful termination.

graceful-recovery.sh
# Attempt a graceful stop
docker stop -t 30 <container-name>

# If that fails, try a direct Docker kill
docker kill <container-name>

If either command succeeds and the container exits normally, you can usually proceed with a clean restart or a controlled recreation via Compose.

ℹ️
A successful graceful stop strongly suggests that the runtime is still authoritative and the failure is likely contained within the application or its process tree.

💣Phase 3: Escalate to Host-Level Termination

If Docker can no longer stop or kill the container, move to the host PID directly. This bypasses the Docker CLI's normal lifecycle path and attempts to terminate the actual process representing the container.

host-level-kill.sh
# Resolve the host PID of the container
PID=$(docker inspect -f '{{.State.Pid}}' <container-name>)

# Try SIGTERM first
sudo kill -TERM $PID
sleep 2

# Escalate to SIGKILL if needed
sudo kill -KILL $PID

# Re-check container state
docker ps

If the process disappears and Docker updates accordingly, the issue was recoverable without deeper runtime intervention. If the process dies but Docker still claims the container is up, runtime bookkeeping has likely become desynchronized.

⚠️
If a process is stuck in a kernel D state, even SIGKILL may not terminate it immediately. At that point, the issue is beneath the application layer.

🚨Phase 4: Recover the Runtime

This is the decisive phase. If Docker repeatedly reports that it cannot stop, kill, or restart the container because it did not receive an exit event, stop treating the container as the primary problem and reconcile the runtime instead.

runtime-recovery.sh
# Restart containerd first
sudo systemctl restart containerd

# Restart Docker after containerd
sudo systemctl restart docker

# Re-check runtime and container state
docker ps

Restarting containerd and docker forces the runtime to rebuild its view of container state. In practice, this is often the cleanest way to resolve a wedged control plane without rebooting the entire host.

ℹ️
In the incident that informed this runbook, restarting containerd and docker immediately reconciled the runtime and restored the container to a healthy state.

🔄Phase 5: Recreate Cleanly via Compose

Once the runtime is healthy again, prefer a proper container recreation rather than simply trusting the recovered instance indefinitely. This clears stale namespace state and gives the container a clean lifecycle boundary.

compose-recreate.sh
# Move to the service directory
cd <compose-directory>

# Bring the stack down cleanly
docker compose down

# Recreate it
docker compose up -d

# Verify state
docker ps

For long-running stateful services such as PostgreSQL, a clean recreation after runtime recovery is a prudent operational step, provided the persistent data lives on a mounted volume.

⚠️
Never treat docker compose down casually on stateful services unless you understand precisely where the data lives and which resources are ephemeral versus persistent.

Phase 6: Verify Process Hygiene

Once the service is back up, verify both health and process cleanliness. The absence of zombies matters just as much as the service reporting itself healthy.

verify-recovery.sh
# Host-side zombie check
ps aux | grep defunct

# Container state
docker ps

# Optional: check inside the container as well
docker exec <container-name> ps aux | grep defunct

The expected end state is straightforward:

  • Container reports healthy.
  • No significant zombie buildup on the host.
  • No repeated <defunct> entries inside the container.
  • docker exec works normally again.
ℹ️
If the container is healthy but zombies begin accumulating again within minutes or hours, the recovery succeeded but the root cause remains active.

🧠Phase 7: Root Cause Analysis

In the observed case, the root cause was not PostgreSQL itself. It was the combination of a shell-based healthcheck and a container running without a proper init process to reap child processes cleanly over time.

Inspect the container healthcheck configuration:

inspect-healthcheck.sh
docker inspect <container-name> --format '{{json .Config.Healthcheck}}'

A result like this is a warning sign:

healthcheck-cmd-shell.json
{
  "Test": ["CMD-SHELL", "pg_isready -U tachyon -d tachyon_db"],
  "Interval": 10000000000,
  "Timeout": 5000000000,
  "Retries": 5
}

CMD-SHELL means Docker spawns /bin/sh -c to run the healthcheck. On a long-lived container, especially one without a proper init process, that repeated shell spawning can become a liability if child reaping ever goes wrong.

⚠️
The service may not be the thing leaking zombies. Sometimes the supporting lifecycle machinery around the service is the real offender.

🛡️Phase 8: Permanent Hardening

Two changes dramatically reduce the likelihood of recurrence for this class of issue.

First: add an init process so PID 1 in the container can reap orphaned children properly.

compose-init-true.yaml
services:
  postgres:
    image: postgres:18
    container_name: postgres
    init: true

Second: replace shell-based healthchecks with direct command execution wherever possible.

healthcheck-before-after.yaml
# Less desirable
healthcheck:
  test: ["CMD-SHELL", "pg_isready -U tachyon -d tachyon_db"]
  interval: 10s
  timeout: 5s
  retries: 5

# Preferred
healthcheck:
  test: ["CMD", "pg_isready", "-U", "tachyon", "-d", "tachyon_db"]
  interval: 10s
  timeout: 5s
  retries: 5

These changes remove /bin/sh from the healthcheck path and give the container a proper reaper at PID 1.

ℹ️
On small boards and long-running edge hosts, deterministic process handling matters more than clever convenience. Fewer wrappers means fewer surprises.

📋Operational Checklist

When this failure mode appears again, the response can be condensed into the following operational sequence:

container-wedge-checklist.sh
# 1) Quick diagnosis
ps aux | grep defunct
docker ps
docker inspect -f '{{.State.Pid}}' <container-name>

# 2) Try graceful stop
docker stop -t 30 <container-name> || docker kill <container-name>

# 3) Escalate to host PID
PID=$(docker inspect -f '{{.State.Pid}}' <container-name>)
sudo kill -TERM $PID
sleep 2
sudo kill -KILL $PID

# 4) If Docker still says "did not receive an exit event"
sudo systemctl restart containerd
sudo systemctl restart docker

# 5) Recreate cleanly
cd <compose-directory>
docker compose down
docker compose up -d

# 6) Verify
ps aux | grep defunct
docker ps

🎯Final State & Lessons Learned

A container runtime wedge is not always a dramatic crash. Sometimes it is the cumulative result of tiny lifecycle mistakes repeated for days or weeks: shell-wrapped healthchecks, weak PID 1 behavior, unreaped children, and a control plane that slowly drifts away from the truth.

The end goal of this runbook is not merely to recover a sick container. It is to restore alignment between:

  • The application process
  • The container namespace
  • The Docker control plane
  • The underlying runtime state

When all four agree again, the system is truly healthy.

ℹ️
Build containers as though they will run for months. Long-lived systems are brutally honest about the quality of their process hygiene.

Changelog

2026-03-24v1.0

Initial release documenting recovery and hardening procedures for container zombie buildup and Docker/containerd runtime wedge scenarios.

Filed under: Docker, containerd, PostgreSQL, Runbook, Linux, SRE, Process Management, Containers, Troubleshooting, Operational Wisdom

Last updated: 2026-03-24