Linux Namespace

2020-09-01 2499 words 12 minutes

Contents

Linux Namespace

namespace 在编程语言中是一种常见的概念，C++/Clojure 中就使用 namespace 关键字来模块化组织代码。模块与模块之间互不污染，模块A有个 helloworld 的方法，模块B也可以有一个 helloworld 的方法。Java/Go 中的 package 也是同样的意义。

Linux namespaces 是 Linux 内核用于隔离内核资源的手段，进程使用隔离的内核资源保证了与其他进程之间的独立。也是 Docker 容器的底层技术（Docker 主要开发语言是 Go，container 的创建使用 C）。

Namespaces are a feature of the Linux kernel that partitions kernel resources such that one set of processes sees one set of resources while another set of processes sees a different set of resources. The feature works by having the same namespace for a set of resources and processes, but those namespaces refer to distinct resources. Resources may exist in multiple spaces. Examples of such resources are process IDs, hostnames, user IDs, file names, and some names associated with network access, and interprocess communication.

Namespaces are a fundamental aspect of containers on Linux.

内核资源

man namespaces 列举：

       Linux provides the following namespaces:

       Namespace   Constant          Isolates
       Cgroup      CLONE_NEWCGROUP   Cgroup root directory
       IPC         CLONE_NEWIPC      System V IPC, POSIX message queues
       Network     CLONE_NEWNET      Network devices, stacks, ports, etc. 容器有自己的网络栈，IP，就像一个小主机。
       Mount       CLONE_NEWNS       Mount points 因为是第一个实现的namespace，所以起名草率了。
       PID         CLONE_NEWPID      Process IDs
       User        CLONE_NEWUSER     User and group IDs
       UTS         CLONE_NEWUTS      Hostname and NIS domain name

man clone 中列举了如下的资源的描述：

       CLONE_NEWCGROUP (since Linux 4.6)
              Create the process in a new cgroup namespace.  If this flag is not set, then (as with fork(2))  the
              process is created in the same cgroup namespaces as the calling process.  This flag is intended for
              the implementation of containers.

              For further information on cgroup namespaces, see cgroup_namespaces(7).

              Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWCGROUP.

       CLONE_NEWIPC (since Linux 2.6.19)
              If CLONE_NEWIPC is set, then create the process in a new IPC namespace.  If this flag is  not  set,
              then  (as  with  fork(2)), the process is created in the same IPC namespace as the calling process.
              This flag is intended for the implementation of containers.

              An IPC namespace provides an isolated view of System V IPC objects (see svipc(7)) and (since  Linux
              2.6.30)  POSIX  message queues (see mq_overview(7)).  The common characteristic of these IPC mecha-
              nisms is that IPC objects are identified by mechanisms other than filesystem pathnames.

              Objects created in an IPC namespace are visible to all other processes that  are  members  of  that
              namespace, but are not visible to processes in other IPC namespaces.

              When  an  IPC namespace is destroyed (i.e., when the last process that is a member of the namespace
              terminates), all IPC objects in the namespace are automatically destroyed.

              Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWIPC.  This flag can't be specified in
              conjunction with CLONE_SYSVSEM.

              For further information on IPC namespaces, see namespaces(7).

       CLONE_NEWNET (since Linux 2.6.24)
              (The implementation of this flag was completed only by about kernel version 2.6.29.)

              If  CLONE_NEWNET  is  set, then create the process in a new network namespace.  If this flag is not
              set, then (as with fork(2)) the process is created in the same network  namespace  as  the  calling
              process.  This flag is intended for the implementation of containers.

              A  network  namespace provides an isolated view of the networking stack (network device interfaces,
              IPv4 and IPv6 protocol stacks, IP routing tables, firewall rules, the /proc/net and  /sys/class/net
              directory  trees, sockets, etc.).  A physical network device can live in exactly one network names-
              pace.  A virtual network (veth(4)) device pair provides a pipe-like abstraction that can be used to
              create tunnels between network namespaces, and can be used to create a bridge to a physical network
              device in another namespace.

              When a network namespace is freed (i.e., when the last process in the  namespace  terminates),  its
              physical  network devices are moved back to the initial network namespace (not to the parent of the
              process).  For further information on network namespaces, see namespaces(7).

              Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWNET.

       CLONE_NEWNS (since Linux 2.4.19)
              If CLONE_NEWNS is set, the cloned child is started in a new mount  namespace,  initialized  with  a
              copy  of the namespace of the parent.  If CLONE_NEWNS is not set, the child lives in the same mount
              namespace as the parent.

              Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWNS.  It is not permitted  to  specify
              both CLONE_NEWNS and CLONE_FS in the same clone() call.

              For further information on mount namespaces, see namespaces(7) and mount_namespaces(7).

       CLONE_NEWPID (since Linux 2.6.24)
              If  CLONE_NEWPID  is set, then create the process in a new PID namespace.  If this flag is not set,
              then (as with fork(2)) the process is created in the same PID namespace  as  the  calling  process.
              This flag is intended for the implementation of containers.

              For further information on PID namespaces, see namespaces(7) and pid_namespaces(7).

              Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWPID.  This flag can't be specified in
              conjunction with CLONE_THREAD or CLONE_PARENT.

       CLONE_NEWUSER
              (This flag first became meaningful for clone() in Linux 2.6.23, the current clone() semantics  were
              merged in Linux 3.5, and the final pieces to make the user namespaces completely usable were merged
              in Linux 3.8.)

              If CLONE_NEWUSER is set, then create the process in a new user namespace.  If this flag is not set,
              then (as with fork(2)) the process is created in the same user namespace as the calling process.

              Before  Linux  3.8,  use  of  CLONE_NEWUSER  required  that  the  caller  have  three capabilities:
              CAP_SYS_ADMIN, CAP_SETUID, and CAP_SETGID.  Starting with Linux 3.8, no privileges  are  needed  to
              create a user namespace.

              This  flag  can't be specified in conjunction with CLONE_THREAD or CLONE_PARENT.  For security rea-
              sons, CLONE_NEWUSER cannot be specified in conjunction with CLONE_FS.

              For further information on user namespaces, see namespaces(7) and user_namespaces(7).

       CLONE_NEWUTS (since Linux 2.6.19)
              If CLONE_NEWUTS is set, then create the process in a new UTS namespace, whose identifiers are  ini-
              tialized  by  duplicating  the  identifiers from the UTS namespace of the calling process.  If this
              flag is not set, then (as with fork(2)) the process is created in the same  UTS  namespace  as  the
              calling process.  This flag is intended for the implementation of containers.

              A  UTS  namespace  is the set of identifiers returned by uname(2); among these, the domain name and
              the hostname can be modified by setdomainname(2) and sethostname(2), respectively.  Changes made to
              the  identifiers  in  a UTS namespace are visible to all other processes in the same namespace, but
              are not visible to processes in other UTS namespaces.

              Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWUTS.

              For further information on UTS namespaces, see namespaces(7).

文件系统 /proc

每个进程的namespace文件的路径格式：/proc/<pid>/ns/<ns_kind>

$ ls -l /proc/$$/ns

total 0
lrwxrwxrwx 1 vagrant vagrant 0 Sep  1 06:45 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 vagrant vagrant 0 Sep  1 06:45 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 vagrant vagrant 0 Sep  1 06:45 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 vagrant vagrant 0 Sep  1 06:45 net -> 'net:[4026531993]'
lrwxrwxrwx 1 vagrant vagrant 0 Sep  1 06:45 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 vagrant vagrant 0 Sep  1 06:45 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 vagrant vagrant 0 Sep  1 06:45 user -> 'user:[4026531837]'
lrwxrwxrwx 1 vagrant vagrant 0 Sep  1 06:45 uts -> 'uts:[4026531838]'

$$ 代表当前进程 pid。

使用 lsns 可以看到 Linux 内核维护的所有 namespace。包括我用Docker启动了一个mysqld，内核为其准备的namespace。

$ sudo lsns

        NS TYPE   NPROCS   PID USER            COMMAND
4026531835 cgroup    192     1 root            /lib/systemd/systemd --system --deserialize 18
4026531836 pid       191     1 root            /lib/systemd/systemd --system --deserialize 18
4026531837 user      193     1 root            /lib/systemd/systemd --system --deserialize 18
4026531838 uts       189     1 root            /lib/systemd/systemd --system --deserialize 18
4026531839 ipc       190     1 root            /lib/systemd/systemd --system --deserialize 18
4026531840 mnt       178     1 root            /lib/systemd/systemd --system --deserialize 18
4026531861 mnt         1    31 root            kdevtmpfs
4026531993 net       190     1 root            /lib/systemd/systemd --system --deserialize 18
4026532160 mnt         1  4988 root            /lib/systemd/systemd-udevd
4026532162 mnt         1  4449 systemd-network /lib/systemd/systemd-networkd
4026532181 mnt         1  4481 systemd-resolve /lib/systemd/systemd-resolved
4026532182 mnt         1   954 root            /usr/sbin/ModemManager --filter-policy=strict
4026532184 mnt         1  1118 root            /usr/sbin/NetworkManager --no-daemon
4026532185 mnt         1 25318 root            sh
4026532190 uts         1 25318 root            sh
4026532191 ipc         1 25318 root            sh
4026532192 pid         1 25318 root            sh
4026532194 net         1 25318 root            sh
4026532241 mnt         6  1782 root            /usr/sbin/apache2 -k start
4026532254 mnt         1 25816 999             mysqld
4026532255 uts         1 25816 999             mysqld
4026532256 ipc         1 25816 999             mysqld
4026532257 pid         1 25816 999             mysqld
4026532259 net         1 25816 999             mysqld
4026532313 uts         1 32225 root            ./newuts helloworld

三个 syscall

Linux 提供了三个系统调用。

clone

clone 系统调用允许父进程创建子进程时，让子进程使用独立的（新的）namespace。

clone() creates a new process, in a manner similar to fork(2).

man clone 里的例子，创建一个子进程，设置了关于 hostname 的 namespace CLONE_NEWUTS：

#define _GNU_SOURCE
#include <sys/wait.h>
#include <sys/utsname.h>
#include <sched.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define errExit(msg)        \
    do                      \
    {                       \
        perror(msg);        \
        exit(EXIT_FAILURE); \
    } while (0)

static int /* Start function for cloned child */
childFunc(void *arg)
{
    struct utsname uts;

    /* Change hostname in UTS namespace of child */

    if (sethostname(arg, strlen(arg)) == -1)
        errExit("sethostname");

    /* Retrieve and display hostname */

    if (uname(&uts) == -1)
        errExit("uname");
    printf("uts.nodename in child:  %s\n", uts.nodename);

    /* Keep the namespace open for a while, by sleeping.
              This allows some experimentation--for example, another
              process might join the namespace. */

    sleep(200);

    return 0; /* Child terminates now */
}

#define STACK_SIZE (1024 * 1024) /* Stack size for cloned child */
int main(int argc, char *argv[])
{
    char *stack;    /* Start of stack buffer */
    char *stackTop; /* End of stack buffer */
    pid_t pid;
    struct utsname uts;

    if (argc < 2)
    {
        fprintf(stderr, "Usage: %s <child-hostname>\n", argv[0]);
        exit(EXIT_SUCCESS);
    }

    /* Allocate stack for child */

    stack = malloc(STACK_SIZE);
    if (stack == NULL)
        errExit("malloc");
    stackTop = stack + STACK_SIZE; /* Assume stack grows downward */

    /* Create child that has its own UTS namespace;
              child commences execution in childFunc() */

    pid = clone(childFunc, stackTop, CLONE_NEWUTS | SIGCHLD, argv[1]);
    if (pid == -1)
        errExit("clone");
    printf("clone() returned %ld\n", (long)pid);

    /* Parent falls through to here */

    sleep(1); /* Give child time to change its hostname */

    /* Display hostname in parent's UTS namespace. This will be
              different from hostname in child's UTS namespace. */

    if (uname(&uts) == -1)
        errExit("uname");
    printf("uts.nodename in parent: %s\n", uts.nodename);

    if (waitpid(pid, NULL, 0) == -1) /* Wait for child */
        errExit("waitpid");
    printf("child has terminated\n");

    exit(EXIT_SUCCESS);
}

编译：

gcc newuts.c -o newuts

执行：

$ sudo ./newuts helloworld &

[1] 25052
clone() returned 25059                                                     
uts.nodename in child:  helloworld # 修改子进程的 hostname 不影响全局的 hostname。
uts.nodename in parent: ubuntu-bionic

$ sudo ls -l /proc/25052/ns

total 0
lrwxrwxrwx 1 root root 0 Sep  1 02:29 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 Sep  1 02:29 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 root root 0 Sep  1 02:29 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 root root 0 Sep  1 02:29 net -> 'net:[4026531993]'
lrwxrwxrwx 1 root root 0 Sep  1 02:29 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 Sep  1 02:29 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 Sep  1 02:29 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 Sep  1 02:29 uts -> 'uts:[4026531838]'

$ sudo ls -l /proc/25059/ns

total 0
lrwxrwxrwx 1 root root 0 Sep  1 02:30 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 Sep  1 02:30 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 root root 0 Sep  1 02:30 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 root root 0 Sep  1 02:30 net -> 'net:[4026531993]'
lrwxrwxrwx 1 root root 0 Sep  1 02:30 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 Sep  1 02:30 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 Sep  1 02:30 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 Sep  1 02:30 uts -> 'uts:[4026532313]' # 可以看到子进程的uts与父进程的uts不同

unshare

unshare 系统调用的作用，不创建新进程，设置当前线程（进程）的 namespace。unshare，代表不与父进程共享 namespace，也就是会创建新的 namespace。

unshare() allows a process (or thread) to disassociate parts of its execution context that are currently being shared with other processes (or threads). The main use of unshare() is to allow a process to control its shared execution context without creating a new process. The flags argument is a bit mask that specifies which parts of the execution context should be unshared.

man 2 unshare 里的例子，使用 unshare 创建新的 namespace：


#define _GNU_SOURCE
#include <sched.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

/* A simple error-handling function: print an error message based
          on the value in 'errno' and terminate the calling process */

#define errExit(msg)        \
    do                      \
    {                       \
        perror(msg);        \
        exit(EXIT_FAILURE); \
    } while (0)

static void
usage(char *pname)
{
    fprintf(stderr, "Usage: %s [options] program [arg...]\n", pname);
    fprintf(stderr, "Options can be:\n");
    fprintf(stderr, "    -i   unshare IPC namespace\n");
    fprintf(stderr, "    -m   unshare mount namespace\n");
    fprintf(stderr, "    -n   unshare network namespace\n");
    fprintf(stderr, "    -p   unshare PID namespace\n");
    fprintf(stderr, "    -u   unshare UTS namespace\n");
    fprintf(stderr, "    -U   unshare user namespace\n");
    exit(EXIT_FAILURE);
}

int main(int argc, char *argv[])
{
    int flags, opt;

    flags = 0;

    while ((opt = getopt(argc, argv, "imnpuU")) != -1)
    {
        switch (opt)
        {
        case 'i':
            flags |= CLONE_NEWIPC;
            break;
        case 'm':
            flags |= CLONE_NEWNS;
            break;
        case 'n':
            flags |= CLONE_NEWNET;
            break;
        case 'p':
            flags |= CLONE_NEWPID;
            break;
        case 'u':
            flags |= CLONE_NEWUTS;
            break;
        case 'U':
            flags |= CLONE_NEWUSER;
            break;
        default:
            usage(argv[0]);
        }
    }

    if (optind >= argc)
        usage(argv[0]);

    if (unshare(flags) == -1)
        errExit("unshare");

    execvp(argv[optind], &argv[optind]);
    errExit("execvp");
}

编译：

gcc unshare.c -o unshare

执行：

$ readlink /proc/$$/ns/mnt
mnt:[4026531840]

$ sudo ./unshare -m /bin/bash

# readlink /proc/$$/ns/mnt
mnt:[4026532313]

setns

Given a file descriptor referring to a namespace, reassociate the calling thread (current process) with that namespace.

man setns 里的例子，使用clone例子里子进程的uts namespace：

#define _GNU_SOURCE
#include <fcntl.h>
#include <sched.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

#define errExit(msg)        \
    do                      \
    {                       \
        perror(msg);        \
        exit(EXIT_FAILURE); \
    } while (0)

int main(int argc, char *argv[])
{
    int fd;

    if (argc < 3)
    {
        fprintf(stderr, "%s /proc/PID/ns/FILE cmd args...\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    fd = open(argv[1], O_RDONLY); /* Get file descriptor for namespace */
    if (fd == -1)
        errExit("open");

    if (setns(fd, 0) == -1) /* Join that namespace */
        errExit("setns");

    execvp(argv[2], &argv[2]); /* Execute a command in namespace */
    errExit("execvp");
}

编译：

gcc ns_exec.c -o ns_exec

执行：

$ sudo ./ns_exec /proc/25164/ns/uts /bin/bash

# hostname

helloworld

两个 cmd

unshare

是 unshare 系统调用的包装。

man unshare 获取命令参考。

举例，使用一个独立的 PID namespace。–fork –mount-proc –pid 都是为了 PID namespace。

$ sudo unshare --fork --pid --mount-proc bash

# echo $$

1

# ls /proc/

1/                 devices            irq/               mdstat             schedstat          timer_list
15                 diskstats          kallsyms           meminfo            scsi/              tty/
16                 dma                kcore              misc               self/              uptime
acpi/              driver/            key-users          modules            slabinfo           version
buddyinfo          execdomains        keys               mounts             softirqs           version_signature
bus/               fb                 kmsg               mpt/               stat               vmallocinfo
cgroups            filesystems        kpagecgroup        mtrr               swaps              vmstat
cmdline            fs/                kpagecount         net/               sys/               zoneinfo
consoles           interrupts         kpageflags         pagetypeinfo       sysrq-trigger
cpuinfo            iomem              loadavg            partitions         sysvipc/
crypto             ioports            locks              sched_debug        thread-self/

nsenter

man nsenter 获取命令参考。

生产环境的 container 因为要保持尽可能轻量的关系，container 内部缺少很多诊断工具。可以在宿主机中使用 nsenter 执行 shell 并使用 container 的 namespace，用途比如定位容器的网络问题。

$ docker run -p 3306:3306 -e MYSQL_ROOT_PASSWORD=root -d mysql:5

85fa62779aad

$ docker inspect -f '{ {.State.Pid} }' 85fa62779aad

25816

$ sudo nsenter -t 25816 -n zsh

# 进入了 mysqld 这个container的network namespace
$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.2  netmask 255.255.0.0  broadcast 172.17.255.255
        ether 02:42:ac:11:00:02  txqueuelen 0  (Ethernet)
        RX packets 130  bytes 14337 (14.3 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

总结

container 实际上就是一个 process。
容器技术的核心 API 就是 clone/unshare/setns 系统调用以及 7 个 CLONE_NEW* flag。

参考

Linux man(ual) 例子挺详细的。
DOCKER基础技术：LINUX NAMESPACE（上） https://coolshell.cn/articles/17010.html
DOCKER基础技术：LINUX NAMESPACE（下） https://coolshell.cn/articles/17029.html

参考 coolshell 查看更多示例的实践，就不重复了。