From the beginning, Cloud Foundry was designed to run applications inside Linux containers. These restrict access to the host and other containers, preventing impact. Taking advantage of features in the Linux kernel, applications can run side-by-side on the same host without interfering with each other.
This post explores the features of container technology and how this is implemented in Cloud Foundry.
Warden and Garden
Cloud Foundry’s container technology is provided by Warden, which was created by VMware’s Pieter Noorduis and others. Warden is a subtle combination of Ruby code, a core written in C, and shell scripts to configure the host and containers.
Warden provides a service for managing a collection of containers and defines a protocol for clients to send requests to and receive responses from the server. Each DEA host in a Cloud Foundry deployment runs the Warden service, which manages cgroups, namespaces, process life cycle, and provides telemetry about the state of the host and each container. The Warden protocol is defined using Google protocol buffer definitions. Warden creates a root process, called “wshd” for Warden shell daemon, in each container. This root process is responsible for managing the container, launching application processes, and streaming the standard output and standard error back to the client.
More recently, Pivotal’s Alex Suraci and others have rewritten the Ruby portions of Warden in Go to produce Garden. Garden provides the container technology for the Diego project (the future architecture for Cloud Foundry). Garden separates the server and protocol buffer handling (in a separate Garden repository) from a Garden Linux backend which maps the protocol requests into Linux operating system primitives. The protocol is sufficiently platform agnostic for a Windows backend to be developed. Garden also supports a REST API closely modeled on the protocol buffer definition. The REST API is particularly useful during experimentation and testing.
Applications necessarily consume named resources that appear on the host in various global namespaces. For example, an application might listen on a particular port, and this port is visible in the global network namespace of the host. To avoid applications clashing with each other and with other programs running in the host, the application is executed inside a collection of container-specific namespaces. For example, a network namespace isolates the IP addresses and ports used by an application from those in other network namespaces.
Linux has been adding support for namespaces for several years, starting with a namespace of file system mount points, and, most recently, support was added for a namespace of users. Michael Kerrisk has written an excellent series of articles on Linux namespaces over on LWN.
Garden currently uses all the available Linux namespaces for containers except the user namespace, which is under discussion. Garden relies heavily on the mount namespace by replacing the host’s root file system (more below), using the Linux pivot_root operation, with one specified by the user. It then unmounts the host’s root file system so the container cannot access it directly.
Applications running inside namespaces are still free to consume anonymous operating system resources such as memory and CPU. They can also operate directly on devices. Linux provides control groups to enable processes to be partitioned into hierarchical groups and subjected to constraints. Constraints are applied by resource controllers which participate in control groups and interface to the corresponding Linux kernel subsystems. For instance, the memory resource controller can limit the number of pages of real memory that may be consumed by a particular control group and can ensure processes are killed when the limit is about to be exceeded.
Garden uses five resource controllers—cpuset (CPUs and memory nodes) , cpu (CPU bandwidth), cpuacct (CPU accounting), devices (device access), and memory (memory usage)—and creates a control group for each container. The processes in the container are then subject to the constraints imposed by those resource controllers (except that cpuacct does not impose any constraints but accounts for CPU usage).
In addition to control groups, Garden uses a couple of other Linux features to impose limits on the processes in a container. Specifically, setrlimit restricts the consumption of certain resources by processes in the container and setquota restricts the consumption of certain other resources by users in the container.
Since each container runs in a separate network namespace, Garden provides a way for network traffic to flow into and out of a container. It creates a pair of virtual ethernet devices, allocates an IP address for each device, and moves one of the devices into the container’s network namespace. Garden then sets up suitable IP routing table entries to ensure IP packets are routed correctly to and from the container. Finally, packet filtering rules create a firewall for the container. The firewall allows inbound and outbound traffic to be restricted.
Root File System
Warden allows the user to configure a directory which is used as the root file system of all containers. Garden extends this behaviour so that each container can either use the configured directory as its root file system or can have a root file system built from a Docker image. Either way, a read-write layer is added to the root file system so that the container can update the root file system without affecting other containers.
The garden API, as defined in Google protocol buffers, has the following operations:
- Capacity – returns the memory and disk capacity of the host machine
- Create – creates a container and returns its handle (a string which identifies the container)
- Info – returns information about a specified container such as its IP address and a list of processes running in the container
- Run – spawns a process in the container and streams its output back to the client
- Attach – starts streaming the output of a specified process in a specified container back to the client
- List – lists all container handles
- LimitBandwidth, LimitCpu, LimitDisk, LimitMemory – adjusts the limits of a specified container for network bandwidth, CPU shares, disk usage, and memory usage, respectively
- NetIn – maps a port on the host machine to a port in the specified container
- NetOut – whitelists outbound network traffic from the specified container to a specified network and/or port
- StreamIn – copies data into a specified file in the specified container’s file system
- StreamOut – copies data out of a specified file in the specified container’s file system
- Ping – checks that the garden server is running
- Stop – terminates all processes in a specified container but leaves the container around (in stopped state)
- Destroy – destroys the specified container
Cloud Foundry’s container technology has some other interesting applications. BOSH Lite enables Cloud Foundry to run in a single virtual machine using a collection of containers instead of separate virtual machines for the various Cloud Foundry components. Another application is the Concourse continuous integration system which runs each CI job in a fresh container, isolated from other jobs. Supporting these additional use cases helps to keep Cloud Foundry’s container technology suited for general purpose.
Cloud Foundry’s container technology, Garden, comprises a mostly platform agnostic front end and a platform-specific backend. The Linux backend relies heavily on standard Linux containers and operating system features such as namespaces, control groups, and various resource control and networking facilities to isolate containers from each other and limit their impact on the host virtual machine. Garden is designed to create containers, provide telemetry, and manage the container life cycle. Garden supports using a Docker image as the root file system of a container (as demonstrated at this year’s VMworld). A Windows backend is being investigated and other platforms are also feasible.
About the AuthorMore Content by Glyn Normington