Docker review: Linux Namespace

docker的简单实现原理

参考的书编写于2017年,本人会将书上相关的内容先过一遍,然后把现在比较具体的实现贴出来。

Namespace

以下内容是对 Linux Man namespace(7) 的一个整理,建议有空闲去阅读官方的文档,不要吸收二手知识。

如果你跑下面的话:

1
2
3
4
5
6
7
8
9
10
➜  ~ ls -l /proc/$$/ns
总用量 0
lrwxrwxrwx 1 mwish mwish 0 1月 11 04:10 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 mwish mwish 0 1月 11 04:10 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 mwish mwish 0 1月 11 04:10 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 mwish mwish 0 1月 11 04:10 net -> 'net:[4026531992]'
lrwxrwxrwx 1 mwish mwish 0 1月 11 04:10 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 mwish mwish 0 1月 11 04:10 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 mwish mwish 0 1月 11 04:10 user -> 'user:[4026531837]'
lrwxrwxrwx 1 mwish mwish 0 1月 11 04:10 uts -> 'uts:[4026531838]'

实际上你可以在 man 里头找到 namespace

1
2
3
4
5
6
7
8
9
10
A namespace wraps a global system resource in an abstraction that
makes it appear to the processes within the namespace that they have
their own isolated instance of the global resource. Changes to the
global resource are visible to other processes that are members of
the namespace, but are invisible to other processes. One use of
namespaces is to implement containers.

This page provides pointers to information on the various namespace
types, describes the associated /proc files, and summarizes the APIs
for working with namespaces.

种类也可以:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Namespace types
The following table shows the namespace types available on Linux.
The second column of the table shows the flag value that is used to
specify the namespace type in various APIs. The third column
identifies the manual page that provides details on the namespace
type. The last column is a summary of the resources that are
isolated by the namespace type.

Namespace Flag Page Isolates
Cgroup CLONE_NEWCGROUP cgroup_namespaces(7) Cgroup root directory
IPC CLONE_NEWIPC ipc_namespaces(7) System V IPC,
POSIX message queues
Network CLONE_NEWNET network_namespaces(7) Network devices,
stacks, ports, etc.
Mount CLONE_NEWNS mount_namespaces(7) Mount points
PID CLONE_NEWPID pid_namespaces(7) Process IDs
User CLONE_NEWUSER user_namespaces(7) User and group IDs
UTS CLONE_NEWUTS uts_namespaces(7) Hostname and NIS
domain name

通俗的说,就是:

  • Mount: 隔离文件系统挂载点
  • UTS: 隔离主机名和域名信息
  • IPC: 隔离进程间通信
  • PID: 隔离进程的ID
  • Network: 隔离网络资源
  • User: 隔离用户和用户组的ID

此外,在一些地方你可以注意限制

Note that namespaces do not restrict access to physical resources such as CPU, memory and disk. That access is metered and restricted by a kernel feature called ‘cgroups’.

嗯这个名词我们稍后再介绍吧。

实际上可以读完 namespace(7) 对应的 man 手册:

1
2
Each process has a /proc/[pid]/ns/ subdirectory containing one entry
for each namespace that supports being manipulated by setns(2):

刚刚 cat /proc/$$/ns 实际上就是这样读取信息的。同时我们留意一下之前的 namespace , 他是有一个 id 的。如果多个 process 的 namespace 有着同一个 id, 我们有理由认为它们在一个 namespace 下头。那么以下方法可以保留自己(或者我们渴望的进程 id 的) uts namespace. 这些内容可以参考 man 的 namespace lifetime 这个 part.

1
2
➜  ~ touch uts
➜ ~ mount --bind /proc/$$/ns/uts uts

假设我们之前 mount 了 uts, 这个时候我们知道他是能:隔离主机名和域名信息

接下来我们介绍几个 syscall:

你可以当成是添加 namespace 和 切换 namespace。

以下部分介绍的是各种 namespace, 代码很多 sample 来自于这个地方:http://crosbymichael.com/creating-containers-part-1.html

Skeleton

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sched.h>
#include <sys/wait.h>
#include <errno.h>

#define STACKSIZE (1024*1024)
static char child_stack[STACKSIZE];

struct clone_args {
char **argv;
};

// child_exec is the func that will be executed as the result of clone
static int child_exec(void *stuff)
{
struct clone_args *args = (struct clone_args *)stuff;
if (execvp(args->argv[0], args->argv) != 0) {
fprintf(stderr, "failed to execvp argments %s\n",
strerror(errno));
exit(-1);
}
// we should never reach here!
exit(EXIT_FAILURE);
}

int main(int argc, char **argv)
{
struct clone_args args;
args.argv = &argv[1];

int clone_flags = SIGCHLD;

// the result of this call is that our child_exec will be run in another
// process returning it's pid
pid_t pid =
clone(child_exec, child_stack + STACKSIZE, clone_flags, &args;
if (pid < 0) {
fprintf(stderr, "clone failed WTF!!!! %s\n", strerror(errno));
exit(EXIT_FAILURE);
}
// lets wait on our child process here before we, the parent, exits
if (waitpid(pid, NULL, 0) == -1) {
fprintf(stderr, "failed to wait pid %d\n", pid);
exit(EXIT_FAILURE);
}
exit(EXIT_SUCCESS);
})

我们需要一个 skeleton 来填充逻辑

Unix Time-sharing Namespace

UTS 隔离主机名和域名信息, 我在自己的 manjaro 上仿照代码写了一个

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#define _GNU_SOURC
#include <sys/types.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <stdio.h>
#include <linux/sched.h>
#include <signal.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>

#define STACK_SIZE (1024 * 1024)

static char child_stack[STACK_SIZE];
char* const child_args[] = {
"/bin/bash",
NULL,
};


int child_main(void* args) {
printf("in sub process\n");
if (sethostname("mname", strlen("mname")) != 0) {
fprintf(stderr, "failed to execvp argments %s\n",
strerror(errno));
exit(-1);
}
execv(child_args[0], child_args);
return 1;
}


int main() {
printf("start\n");
int child_pid = clone(child_main, child_stack + STACK_SIZE, SIGCHLD, NULL);
waitpid(child_pid, NULL, 0);
printf("exit\n");
return 0;
}

你需要 sudo 以运行这个程序。

IPC Namespace

The IPC namespace is used for isolating interprocess communication, things like SysV message queues. Let’s make a copy of skeleton.c for this namespace.

PID Namespace

This one is fun. The PID namespace is a way to carve up the PIDs that one process can view and interact with. When we create a new PID namespace the first process will get to be the loved PID 1. If this process exits the kernel kills everyone else within the namespace. Let’s start by making a copy of skeleton.c for our changes.

他会对进程 PID 重新标号,每个 PID namespace 有自己的 wx, 内核为 PID ns 维护了一个树状结构,从 root ns 到 child ns,其中:

  • 父节点可以看到子ns进程
  • 子不能看到父 ns
  • PID 1 有 init 特权
  • 不能 kill/ptrace 父/兄 ns
  • 如果重新挂载 /proc 文件系统

一个进程 1 可以屏蔽一定的信号,但是四种杀掉程序的 SIG 中,对于父级 ns 进程的 SIGKILL, SIGSTOP 会被杀掉。

同时,用 ps, 因为看到的是内存的 /proc 虚拟文件系统,所以可能需要重新 mount.

mount Namespace

(历史上第一个实现的 ns)

CLONE_NEWNS

挂载的形式有:

  • share
  • slave
  • share and slave
  • primary
  • unbindable