This post is also available in: 日本語 (Japanese)
On May 31th, the Kubernetes Product Security Committee announced a security regression in Kubernetes for which they had assigned CVE-2019-11245. The problem caused containers that use images which are supposed to run with a non root user to run as root, on the second time they are used or upon restart of the container.
Before elaborating on this particular security issue, let’s first clarify why running a program as root in a container is even a concern at all.
When running applications on a non-containerized Linux environment, e.g. on the host machine, it is commonly understood why isolation between the root user and non-privileged users is desired. If run as root, any breached or misbehaving application could easily wreak havoc on the system, by modifying system files, stopping or launching privileged processes and so on. Many popular Linux daemons drop privileges in their early setup stages, for example the nginx daemon for Ubuntu forks and runs as the unprivileged www-data user.
With containers, privilege separation was made even simpler. Each container runs in its own runtime, isolated by Linux namespaces and a dropped set of capabilities. Each application is meant to run in its own container, and root is assumed to be safe to use in the container.
It has therefore become commonplace for applications in containers to run with root privileges, and most images do not change to a non-privileged user. Unfortunately, it is not commonly understood that by doing so these images may be missing out on an important security layer. That is especially true given many of these images have no real use for running as the root user.
To their credit, some container platforms run all their containers as a non root user by default. OpenShift, for example, requires its users to use images that support running as a random, non-root user. It then runs each of its containers as an arbitrary non-root user.1
In addition, some containerized applications drop root privileges by changing to a non-root user after setup, allowing them to rely on user based file permissions to prevent access to sensitive files (e.g. configurations) or processes in the containers. This limits the damage an attacker can do in a breached container. Jenkins is a good example of an image with this design.
Nevertheless, the main reason for which using a non-root user in containers is a valuable security precaution is prevention of container breakout.
Most container runtimes share the same root user between the host and containers. By itself this is not a problem, since the container process is sandboxed using Linux capabilities and Linux namespaces such as PID, net, mount and IPC. But if the kernel makes no distinction between the user or group IDs of the host and the container, any vulnerability in the container runtime that exposes anything from the host to a container process invites serious trouble. Put simply, succeeding in container breakout is much more probable for processes running as root.
In such a scenario, a malicious process would be able to modify any file on the system without having to further escalate privileges to the root user. Moreover, some container escape vulnerabilities simply require root privileges to exploit.
Over the years, multiple breakout vulnerabilities have been revealed in container engines including Kubernetes and Docker. One example is CVE-2016-9962, a vulnerability in runc2 allowing container processes to escape the host by grabbing file-descriptors from the host before their capabilities were dropped. Exploitation of this flaw required running the process as root. CVE-2019-5736, another runc vulnerability of the same nature that was found earlier this year, also required the malicious container process to run as the root user, as explained by Yuval Avrahami in his exploitation writeup.
User namespaces and rootless containers
A recent Linux kernel development that was designed to solve the issue of root in containers is user namespaces. User namespaces allow for isolation of the container’s user IDs and group IDs. This kernel feature uses a mapping of UIDs and GIDs to differentiate between the users and groups on the host to those in the container. This allows for a process to run as UID 0 (root) inside the container while being UID 100000 (or any other UID) on the host. If then a process breaks out of a container the kernel will treat its effective user as the mapped non-root UID, and that process won’t have root privileges on the host. Great!
Yet the main container runtimes currently refrain from using user namespaces as the default. The feature was added to the kernel in Linux 3.8, which dates back to 2013, and Docker first integrated user namespaces back in 2015. However, since then it’s never been enabled in Docker as a default feature, presumably due to the limitations its use imposes. For example, external volume or storage drivers are unaware of the user mappings, and mounting files from the host may become complex due to file permissions.
At the same time, all the current implementations of rootless containers rely on user namespaces at their core. Not to be confused with what is referred to as non-root containers in this article, rootless containers are containers that can be run and managed by unprivileged users on the host. While Docker and other runtimes require a daemon running as root, rootless containers can be run by any user without additional capabilities.
Rootless containers have some benefits over traditional containers, such as allowing users on a shared machine to run containers without admin permission (e.g. in university labs) and making it possible to run nested containers (even for non privileged containers). However the main motivation for designing rootless containers was, in fact, mitigating the aforementioned vulnerabilities in container runtimes3.
The adoption of rootless containers is growing in trend. Major container tools like runc and Docker are releasing support for rootless, and new container engines and image builders have been released to allow for native rootless containers. Podman, a daemonless container engine, is seemingly mature enough to replace Docker with rootless containers alone.
Although user namespaces are not the focus of this article, it is worth mentioning that they do provide additional security benefits to traditional non-root containers. Without user namespaces, even if a container process runs without root, any privilege escalation vulnerability in the container could still compromise the host. For example, a malicious container process could exploit a vulnerable setuid binary to become root. This would not be possible if the root user inside the container is mapped to a non-root user on the host.
Building and running non-root containers
In Docker, the USER instruction can be used in a Dockerbuild file to change to any desired UID for all subsequent commands. The last USER instruction also determines which UID the container will run as by default. Many containers run all their build logic as root and add this instruction before the end of the build file.
Docker also lets users run images or execute commands in a container with a particular UID by using the --user flag.
With Kubernetes, pods or containers can be further constricted to run as a specific UID through the runAsUser field of either a Security Context or a Pod Security Policy. With Pod Security Policies it is also possible to restrict containers to run as UIDs within a selected range with the MustRunAs field, or simply prevent containers from running as root with the MustRunAsNonRoot field.
Kubernetes issue CVE-2019-11245
Following the release of Kubelet v1.13.6 and v1.14.2, multiple Kubernetes users complained that their non-root containers were being run as root from their second execution4. Upon investigation of this issue, it was discovered to be a problem specific to dockershim, the Kubernetes component that runs Docker containers.
Docker containers built to run as a non root users with the USER instruction were being run as root by Kubernetes, starting from their second execution.
This was, of course, a security issue. Besides the previously mentioned dangers of running as root in containers, users may have relied on the user configurations for their design. For instance, users could have exposed volumes to containers in knowing they do not have root privilege.
The issue was quickly fixed by reverting a certain commit in the Kubelet code that broke the detection of an image UID. The broken logic was run only if the Docker image was previously pulled to the node, explaining why the bug only took place from the second execution or later.
The two Kubelet releases with this commit (v1.13.6 and v1.14.2) were the only affected by this vulnerability.
The use of containers follows the principle of least privilege by its nature, as containers usually have limited responsibilities and capabilities. By restricting containers from using root when it is not strictly needed, we can further increase their security and prevent attackers from exploiting flaws that may be found in container engines.
There is good reason to believe that in the not-so-far future container runtimes will enable user namespaces support by default and rootless containers will be adopted as the mainstream. Sure, current implementations have their handful of limitations, but soon enough developers will eliminate them as the reach of these projects grow.
2Currently the most common OCI implementation for container runtimes, used in Docker, see in GitHub
3Rootless Containers, DevConf.CZ (Jan 25, 2019), Giuseppe Scrivano and Akihiro Suda
4See https://github.com/kubernetes/kubernetes/pull/78178, https://github.com/kubernetes/kubernetes/issues/78175 and https://github.com/kubernetes/kubernetes/issues/78308