Container Escape Is Not a Hypothetical — It's Your Tuesday Afternoon

It Happened on a Tuesday

Not a dramatic zero-day drop at DEF CON. Not a nation-state attack that made the front page of Wired. A Tuesday. A dev pushed a containerized workload with a mounted Docker socket because "it needed to talk to the orchestrator," nobody reviewed it closely, and by Wednesday morning that container had become a full host compromise. I've seen variations of this story more times than I care to count.

Here's the thing about container escapes — the threat model isn't theoretical. It's not a "someday maybe" risk you put on a roadmap and revisit during your next annual review. The primitives are well understood, the CVEs are documented, and the misconfigurations are embarrassingly common in environments that should know better. If your containers are running with --privileged or with the Docker socket mounted as a volume, you've already lost. You just haven't found out yet.

The Kernel Is the Problem (It Always Was)

Containers aren't VMs. I know you know that, but let's really sit with it for a second. When you spin up a container, you're sharing the host kernel. Namespaces and cgroups give you the illusion of isolation — process trees, network stacks, filesystems scoped to a container. But the kernel syscall interface is shared. Every process in every container on that host is talking to the same kernel. That surface is enormous, and historically, it's been leaky.

CVE-2019-5736 is probably the cleanest illustration of how bad this can get. The runc vulnerability — discovered by Adam Iwaniuk and Borys Popławski and disclosed publicly by Aleksa Sarai — allowed a malicious container to overwrite the host's runc binary itself. The attack vector was elegant in a terrifying way: if an attacker could execute a specially crafted binary inside a container (or modify an existing executable), they could cause the host's runc process to open /proc/self/exe while it was still running. The container could then write to that file descriptor, replacing the runc binary on the host with whatever they wanted. Next time runc runs, the host executes attacker code. Full escape. Root on the host.

What made 5736 particularly nasty wasn't just the technical mechanism — it was how many organizations had no idea it applied to them. If you were running Docker, runc was underneath it. If you were running containerd directly, same deal. Kubernetes? Same deal. The blast radius was enormous, and patching required updating the container runtime itself, not just pulling new images. A lot of teams found out the hard way that their "immutable infrastructure" wasn't so immutable when the runtime layer was out of date.

Leaky Vessels Wasn't a Surprise to Anyone Paying Attention

Fast forward to early 2024. Snyk drops research on what they brand "Leaky Vessels" — a set of vulnerabilities in runc and BuildKit. The headliner is CVE-2024-21626, another runc escape. This one abuses a working directory race condition. If a container image specifies a WORKDIR that resolves to a path like /proc/self/fd/<n> before the container fully initializes, an attacker can get a file descriptor that points outside the container's filesystem namespace — back to the host.

The frustrating part? The underlying pattern — file descriptor leaks across namespace boundaries — isn't new. Security researchers have been poking at this class of bug in container runtimes for years. CVE-2024-21626 is a reminder that the runc codebase is complex, handles a genuinely difficult problem (setting up namespace boundaries correctly during container initialization), and that complexity reliably produces edge cases. We got a similar wake-up call with CVE-2020-15257, the containerd "Shim API" vulnerability, where containerd's abstract Unix socket was accessible from containers sharing the host network namespace. An attacker with access to a container using host networking could interact with containerd's control plane directly. That's... not great.

I'm not trying to pile on these projects. runc and containerd are maintained by talented engineers doing genuinely hard work. But organizations need to stop treating "we're running containers" as equivalent to "we have strong isolation." The isolation you get depends entirely on the runtime version, the configuration, and the kernel version underneath everything. And most organizations aren't keeping all three of those current simultaneously.

The Misconfigurations That Actually Get People

Forget the CVEs for a moment. The most common container escapes I see in the wild aren't sophisticated runtime exploits — they're misconfigurations that hand an attacker the keys without requiring any cleverness at all.

Privileged containers are the canonical example. Running a container with --privileged strips away almost all the security boundaries. The container can load kernel modules, interact with devices, manipulate cgroups, and mount the host filesystem. It's not really a container at that point — it's a process on the host that happens to be described by a container image. I've done security assessments where I found privileged containers in production Kubernetes clusters because a developer needed CAP_NET_ADMIN for some network configuration task and just escalated to privileged rather than figuring out the right capability set. That drives me nuts. You don't need a sledgehammer.

The Docker socket mount is arguably worse because it's so seductive. Mount /var/run/docker.sock into a container and that container can now issue Docker API calls to the host daemon. Create a new container with a bind mount of / from the host. Chroot into it. You're on the host. This isn't theoretical — it's a two-command escape that any attacker who lands in such a container will execute within seconds. And yet I keep seeing it in CI/CD pipelines, in monitoring agents, in tooling where someone wanted the convenience of container introspection without thinking through what that access actually means.

Here's my hot take: if your CI/CD pipeline mounts the Docker socket, you should treat your entire build infrastructure as potentially compromised and design accordingly. Use Docker-in-Docker with appropriate isolation, or switch to something like Kaniko or Buildah that builds images without requiring privileged access to the host daemon. Yes, it's more work to set up. No, that's not an excuse to leave the socket mounted.

What Defense Actually Looks Like

Syscall filtering is underused and I genuinely don't understand why. Seccomp profiles let you define exactly which syscalls a container is permitted to make. Docker ships a default seccomp profile that blocks around 44 syscalls — including keyctl, add_key, request_key, and several others that have been leveraged in container escapes over the years. It's not perfect, but it meaningfully reduces the kernel attack surface. The default profile is a reasonable starting point, but for production workloads you should be restricting further based on what the application actually needs.

Getting to a tight seccomp profile takes effort. You need to know what syscalls your application legitimately uses. Tools like strace and audit logging can help you enumerate this, and there are projects that try to automate profile generation. It's not glamorous work, but it's the kind of depth-of-defense that actually makes runtime exploits harder to weaponize even after they're disclosed.

AppArmor and SELinux play a similar role at the MAC layer. The standard Docker AppArmor profile, docker-default, restricts access to /proc and /sys paths, prevents certain mount operations, and limits write access to sensitive kernel tunables. It's another layer that makes post-exploitation harder. The challenge is that both AppArmor and SELinux require expertise to configure correctly, and the path of least resistance when a container breaks is to disable the policy rather than debug it. Please don't do that.

Runtime security monitoring is where I'd put serious investment if I were building out a container security program today. Falco — the CNCF project originally from Sysdig — uses eBPF (or, historically, a kernel module) to observe syscalls in real-time and alert on suspicious patterns. It ships with rules for things like: process spawning a shell inside a container, sensitive file access, unexpected outbound connections, privilege escalation attempts. Is it noisy? It can be. Does tuning it take time? Absolutely. But having runtime visibility into what's actually happening inside your containers is the difference between detecting an escape in minutes versus finding out from a threat intelligence feed three months later.

Kubernetes Pod Security Standards are the current best-practice mechanism for enforcing baseline container security in k8s clusters. The Restricted profile blocks privileged containers, requires non-root UIDs, blocks host network/PID/IPC namespace sharing, and requires seccomp profiles. Enforce it. Not in audit mode — enforce it. Audit mode gives you visibility but no protection. I understand there's operational pain involved in migrating workloads to meet Restricted requirements, but that pain is the whole point: it surfaces the bad configurations that would otherwise sit quietly in production waiting to be exploited.

The Sandbox Runtimes Conversation

If you're running workloads where the isolation requirements genuinely demand stronger guarantees than Linux namespaces can provide — multi-tenant environments, untrusted code execution, anything handling highly sensitive data — you need to be thinking about sandbox runtimes. And before you dismiss this as "too complex for us," let me push back a little.

gVisor (from Google) implements a user-space kernel — the Sentry — that intercepts and implements syscalls in Go, sitting between containerized applications and the host kernel. A guest process's syscalls never reach the host kernel directly. The attack surface collapses dramatically. You still have some host kernel exposure because gVisor itself runs as a process, but the syscall path through an application bug to host kernel is severed. The tradeoff is performance: gVisor introduces overhead, particularly for syscall-heavy workloads and I/O. It's not the right choice for everything. But for running untrusted workloads or workloads with extreme security requirements, it's a compelling option, and major cloud providers (Google Cloud Run, in particular) have deployed it at scale.

Kata Containers takes a different approach — running containers inside lightweight VMs using hardware virtualization. You get real VM-level isolation with a separate kernel per workload. The performance story has improved considerably as the project matured and as hardware support for nested virtualization became more common, but you're still paying a cost compared to standard containers. For some organizations, that cost is exactly the right trade to make.

Here's the honest reality: most teams don't need sandbox runtimes for everything. Standard containers with proper configuration, up-to-date runtimes, seccomp profiles, and runtime monitoring get you a long way. But "we don't need that" should be a deliberate risk decision based on your threat model, not a reflexive "sounds complicated."

The Bit That's Actually Hard

Technical controls are solvable. The hard part — and I say this having watched it play out in multiple organizations — is the organizational dynamic around container security. Development teams move fast. Security teams are often consulted late or not at all. The path to a working deployment is "add --privileged and figure it out later." Later never comes.

The teams I've seen get this right treat container security as part of the platform, not as a gate. They build golden base images with hardened configurations. They enforce Pod Security Standards through admission control so developers can't accidentally deploy non-compliant workloads — they get an error at deploy time with a clear explanation of what needs to change. They run Falco in production and actually have a process for responding to alerts. They keep their runtime up to date as a non-negotiable operational requirement, not something they get to when there's a slow sprint.

And when CVE-2024-21626 drops, they don't scramble. They know their runtime version across all clusters, they have a tested process for rolling out runtime updates, and they can go from "vulnerability disclosed" to "patched in production" in hours, not weeks. That's the goal. Not zero CVEs — that's not achievable — but a security posture that makes exploitation difficult and detection fast.

Container escape is real, it's happening, and it's going to keep happening. The kernel attack surface isn't going away. What you control is how hard you make it for an attacker who lands in one of your containers to go anywhere useful with that access. Make it hard. Keep it current. Watch what's happening at runtime. And for the love of everything, stop mounting the Docker socket.