Linux Isolation Basics
In the complex world of modern app deployment solutions, containers have been gaining traction as a popular distribution method. But what are they, and why are people so excited about them? This two part series will look into some of the benefits offered.
First, we’ll look at how isolation is generally used to solve a whole class of problems. Next we’ll look at how containers, specifically, makes isolation more manageable. An intermediate familiarity with UNIX-like systems is assumed throughout.
User/Group Privilege Isolation
User and group access restrictions are one of the most basic forms of isolating what a particular application can do.
Users are often warned to avoid running applications as the admin user (in most cases root) as much as possible. To accomplish this, privileges can be dropped to an unprivileged user at execution time through setuid and setgid system calls.
For example taking this simple echo server in C and modifying it to bind to a port which requires root privileges:
servaddr.sin_family = AF_INET;
servaddr.sin_addr.s_addr = htons(INADDR_ANY);
servaddr.sin_port = htons(7);
bind(listen_fd, (struct sockaddr *) &servaddr, sizeof(servaddr));
If this is run as-is, it will require root access for binding to a low numbered port. If we want to improve on this, we can require that the program drops its privileges after binding to the port with privileged access:
servaddr.sin_family = AF_INET;
servaddr.sin_addr.s_addr = htons(INADDR_ANY);
servaddr.sin_port = htons(7);
bind(listen_fd, (struct sockaddr *) &servaddr, sizeof(servaddr));
setuid(65534);
The 65534
is the user ID for the nobody
user on the particular system this is running on. (Practical solutions (i.e. not example code) should take user names as strings as well as user IDs.)
When the program is run with this new setuid addition:
# ./echo
Echoing back - Testing
# ps aux
...
nobody 4031 0.0 0.0 4044 344 pts/3 S+ 16:07 0:00 ./echo
This process is now running as the nobody
user while still being able to properly accept connections. However, as it stands, the server has the potential to access any files on the system that the nobody
user can access. While not too much of a concern for this simple echo server, it could cause security issues in a production environment, should the server be compromised.
Filesystem isolation can be used to prevent this.
Filesystem Isolation
The chroot is a basic form of isolation at the filesystem level, with the name of the program being an abbreviation for “change root”. Essentially it changes the root of the filesystem for a one process and all child processes under it.
A few practical uses of chroots include:
- Development environments
- Isolation of services
- Restricted user SSH access
Let’s continue working with our echo server. Firstly, we need to create a basic directory structure, and then copy over some essential shared libraries:
# mkdir chroot
# mkdir chroot/bin
# mkdir chroot/lib64
# cp echo chroot/bin
# ldd chroot/bin/echo
linux-vdso.so.1 (0x00007fff0adff000)
libc.so.6 => /lib64/libc.so.6 (0x00007f516c7cf000)
/lib64/ld-linux-x86-64.so.2 (0x00007f516cb74000)
# cp /lib64/libc.so.6 chroot/lib64/
# cp /lib64/ld-linux-x86-64.so.2 chroot/lib64/
Now these files are copied over, it’s time to attempt to run the program in the isolated filesystem:
# chroot chroot /bin/echo
Echoing back - Testing
nobody 15285 0.0 0.0 4048 388 pts/3 S+ 16:51 0:00 /bin/echo
Even though filesystem isolation is present, there’s more we can do. The server is still using the host’s resources without any real limitation. This might be a problem if there are other processes running on this host that are competing for resources. What we need to do now is to isolate those resources.
Control Group Resource Isolation
Control groups, or cgroups for short, are a way to isolate shared resources. These resources include block IO, memory, CPU, and so on.
Let’s look at IO for a second. For a disk on AWS EC2 hdparam shows:
# hdparm --direct -t /dev/xvda
/dev/xvda:
Timing O_DIRECT disk reads: 382 MB in 3.02 seconds = 126.68 MB/sec
So IO is around 126 MB/sec.
Now, let’s throttle that to 1 MB/sec using control groups.
First a control group needs to be created:
# cgcreate -g blkio:throttled-io
Here, blkio
is the name of the subsystem (block IO) we’re going to restrict and throttled-io
is the name of the control group we’re creating.
Throttling works on specific devices, so the major/minor identifier of the device needs to be obtained:
brw-rw---- 1 root disk 202, 0 Apr 24 10:06 /dev/xvda
In this case it is 202, 0.
Next, cgset
is used to set the actual throttling:
# cgset -r blkio.throttle.read_bps_device="202:0 1048576" throttled-io
Now we can run hdparm
with this new control group using cgexec
:
# cgexec -g blkio:throttled-io hdparm --direct -t /dev/xvda
/dev/xvda:
Timing O_DIRECT disk reads: 4 MB in 4.00 seconds = 1023.59 kB/sec
As shown, the IO rate is now throttled around 1 MB/sec. Success!
This is just one example of the many other cgroups that are available to utilize for resource management. Read more about the other cgroups in the Red Hat docs.
However, the service has the potential to see process information it really shouldn’t. This leads into the next form of isolation: namespaces.
Linux Namespaces
Namespaces are a way to isolate areas like network and process space.
Due to the rather complex nature of network namespaces with isolated applications and chroots, we’ll discuss that in more detail in part two of this series instead. For now we’ll focus on process space isolation.
So, we need to modify the code for this. The full code is available in this gist, but the important parts are here:
printf("Child: PID=%ld PPID=%ld\n", (long) getpid(), (long) getppid());
...
char *stack;
char *stackTop;
stack = malloc(STACK_SIZE);
if (stack == NULL) {
printf("malloc(2) failed\n");
return 1;
}
signal(SIGCHLD, SIG_IGN);
stackTop = stack + STACK_SIZE;
pid_t child_pid = clone(echo_server, stackTop, CLONE_NEWPID | SIGCHLD, NULL);
First, in the child process, the PID is printed out so that we can verify the process namespace is working properly. If it is, the PID will show as 1, which is normally the init process on the host system, but will be our top level process once it’s isolated. The clone()
function creates the new namespace and will execute the server with it. This new namespace will have an entirely isolated process space.
We can see that by running:
# chroot chroot /bin/echo
Child: PID=1 PPID=0
The process has a PID of 1, showing that the process space isolation is working.
Conclusion
We built a fairly simple echo server, and took steps to isolate the privileges, filesystem, allocated resources, and process space. Of course, the methods used here serve as an introduction only. For those looking for more information on what was presented, you can read more about: setuid, chroot, Control Groups, and Linux Namespaces.
In part two, we’ll look at network isolation. We’ll also introduce containers and discuss how containers are the modern way to do all of this stuff effortlessly.
P.S. What sorts of things would you like to see my cover in part two? If you have any questions or comments or requests, throw us a comment below.
Addendum: I’ve received some comments regarding various access control systems as a form of isolation. While technologies such as AppArmor and SELinux do provide fine-tunable access controls, their proper implementation is a more involved discussion than the more simplified isolation overview that this article was meant to achieve. For those looking at more advanced ways to secure systems, I recommend starting with the Wikipedia article on security models. With that in mind, these sort of systems will not be touched on in detail in this series.
Share your thoughts with @engineyard on Twitter
OR
Talk about it on reddit