Linux Namespace
namespace 在编程语言中是一种常见的概念,C++/Clojure 中就使用 namespace 关键字来模块化组织代码。模块与模块之间互不污染,模块A有个 helloworld 的方法,模块B也可以有一个 helloworld 的方法。Java/Go 中的 package 也是同样的意义。
Linux namespaces 是 Linux 内核用于隔离内核资源的手段,进程使用隔离的内核资源保证了与其他进程之间的独立。也是 Docker 容器的底层技术(Docker 主要开发语言是 Go,container 的创建使用 C)。
Namespaces are a feature of the Linux kernel that partitions kernel resources such that one set of processes sees one set of resources while another set of processes sees a different set of resources. The feature works by having the same namespace for a set of resources and processes, but those namespaces refer to distinct resources. Resources may exist in multiple spaces. Examples of such resources are process IDs, hostnames, user IDs, file names, and some names associated with network access, and interprocess communication.
Namespaces are a fundamental aspect of containers on Linux.
内核资源
man namespaces
列举:
Linux provides the following namespaces:
Namespace Constant Isolates
Cgroup CLONE_NEWCGROUP Cgroup root directory
IPC CLONE_NEWIPC System V IPC, POSIX message queues
Network CLONE_NEWNET Network devices, stacks, ports, etc. 容器有自己的网络栈,IP,就像一个小主机。
Mount CLONE_NEWNS Mount points 因为是第一个实现的namespace,所以起名草率了。
PID CLONE_NEWPID Process IDs
User CLONE_NEWUSER User and group IDs
UTS CLONE_NEWUTS Hostname and NIS domain name
man clone
中列举了如下的资源的描述:
CLONE_NEWCGROUP (since Linux 4.6)
Create the process in a new cgroup namespace. If this flag is not set, then (as with fork(2)) the
process is created in the same cgroup namespaces as the calling process. This flag is intended for
the implementation of containers.
For further information on cgroup namespaces, see cgroup_namespaces(7).
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWCGROUP.
CLONE_NEWIPC (since Linux 2.6.19)
If CLONE_NEWIPC is set, then create the process in a new IPC namespace. If this flag is not set,
then (as with fork(2)), the process is created in the same IPC namespace as the calling process.
This flag is intended for the implementation of containers.
An IPC namespace provides an isolated view of System V IPC objects (see svipc(7)) and (since Linux
2.6.30) POSIX message queues (see mq_overview(7)). The common characteristic of these IPC mecha-
nisms is that IPC objects are identified by mechanisms other than filesystem pathnames.
Objects created in an IPC namespace are visible to all other processes that are members of that
namespace, but are not visible to processes in other IPC namespaces.
When an IPC namespace is destroyed (i.e., when the last process that is a member of the namespace
terminates), all IPC objects in the namespace are automatically destroyed.
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWIPC. This flag can't be specified in
conjunction with CLONE_SYSVSEM.
For further information on IPC namespaces, see namespaces(7).
CLONE_NEWNET (since Linux 2.6.24)
(The implementation of this flag was completed only by about kernel version 2.6.29.)
If CLONE_NEWNET is set, then create the process in a new network namespace. If this flag is not
set, then (as with fork(2)) the process is created in the same network namespace as the calling
process. This flag is intended for the implementation of containers.
A network namespace provides an isolated view of the networking stack (network device interfaces,
IPv4 and IPv6 protocol stacks, IP routing tables, firewall rules, the /proc/net and /sys/class/net
directory trees, sockets, etc.). A physical network device can live in exactly one network names-
pace. A virtual network (veth(4)) device pair provides a pipe-like abstraction that can be used to
create tunnels between network namespaces, and can be used to create a bridge to a physical network
device in another namespace.
When a network namespace is freed (i.e., when the last process in the namespace terminates), its
physical network devices are moved back to the initial network namespace (not to the parent of the
process). For further information on network namespaces, see namespaces(7).
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWNET.
CLONE_NEWNS (since Linux 2.4.19)
If CLONE_NEWNS is set, the cloned child is started in a new mount namespace, initialized with a
copy of the namespace of the parent. If CLONE_NEWNS is not set, the child lives in the same mount
namespace as the parent.
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWNS. It is not permitted to specify
both CLONE_NEWNS and CLONE_FS in the same clone() call.
For further information on mount namespaces, see namespaces(7) and mount_namespaces(7).
CLONE_NEWPID (since Linux 2.6.24)
If CLONE_NEWPID is set, then create the process in a new PID namespace. If this flag is not set,
then (as with fork(2)) the process is created in the same PID namespace as the calling process.
This flag is intended for the implementation of containers.
For further information on PID namespaces, see namespaces(7) and pid_namespaces(7).
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWPID. This flag can't be specified in
conjunction with CLONE_THREAD or CLONE_PARENT.
CLONE_NEWUSER
(This flag first became meaningful for clone() in Linux 2.6.23, the current clone() semantics were
merged in Linux 3.5, and the final pieces to make the user namespaces completely usable were merged
in Linux 3.8.)
If CLONE_NEWUSER is set, then create the process in a new user namespace. If this flag is not set,
then (as with fork(2)) the process is created in the same user namespace as the calling process.
Before Linux 3.8, use of CLONE_NEWUSER required that the caller have three capabilities:
CAP_SYS_ADMIN, CAP_SETUID, and CAP_SETGID. Starting with Linux 3.8, no privileges are needed to
create a user namespace.
This flag can't be specified in conjunction with CLONE_THREAD or CLONE_PARENT. For security rea-
sons, CLONE_NEWUSER cannot be specified in conjunction with CLONE_FS.
For further information on user namespaces, see namespaces(7) and user_namespaces(7).
CLONE_NEWUTS (since Linux 2.6.19)
If CLONE_NEWUTS is set, then create the process in a new UTS namespace, whose identifiers are ini-
tialized by duplicating the identifiers from the UTS namespace of the calling process. If this
flag is not set, then (as with fork(2)) the process is created in the same UTS namespace as the
calling process. This flag is intended for the implementation of containers.
A UTS namespace is the set of identifiers returned by uname(2); among these, the domain name and
the hostname can be modified by setdomainname(2) and sethostname(2), respectively. Changes made to
the identifiers in a UTS namespace are visible to all other processes in the same namespace, but
are not visible to processes in other UTS namespaces.
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWUTS.
For further information on UTS namespaces, see namespaces(7).
文件系统 /proc
每个进程的namespace文件的路径格式:/proc/<pid>/ns/<ns_kind>
$ ls -l /proc/$$/ns
total 0
lrwxrwxrwx 1 vagrant vagrant 0 Sep 1 06:45 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 vagrant vagrant 0 Sep 1 06:45 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 vagrant vagrant 0 Sep 1 06:45 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 vagrant vagrant 0 Sep 1 06:45 net -> 'net:[4026531993]'
lrwxrwxrwx 1 vagrant vagrant 0 Sep 1 06:45 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 vagrant vagrant 0 Sep 1 06:45 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 vagrant vagrant 0 Sep 1 06:45 user -> 'user:[4026531837]'
lrwxrwxrwx 1 vagrant vagrant 0 Sep 1 06:45 uts -> 'uts:[4026531838]'
$$
代表当前进程 pid。
使用 lsns
可以看到 Linux 内核维护的所有 namespace。包括我用Docker启动了一个mysqld,内核为其准备的namespace。
$ sudo lsns
NS TYPE NPROCS PID USER COMMAND
4026531835 cgroup 192 1 root /lib/systemd/systemd --system --deserialize 18
4026531836 pid 191 1 root /lib/systemd/systemd --system --deserialize 18
4026531837 user 193 1 root /lib/systemd/systemd --system --deserialize 18
4026531838 uts 189 1 root /lib/systemd/systemd --system --deserialize 18
4026531839 ipc 190 1 root /lib/systemd/systemd --system --deserialize 18
4026531840 mnt 178 1 root /lib/systemd/systemd --system --deserialize 18
4026531861 mnt 1 31 root kdevtmpfs
4026531993 net 190 1 root /lib/systemd/systemd --system --deserialize 18
4026532160 mnt 1 4988 root /lib/systemd/systemd-udevd
4026532162 mnt 1 4449 systemd-network /lib/systemd/systemd-networkd
4026532181 mnt 1 4481 systemd-resolve /lib/systemd/systemd-resolved
4026532182 mnt 1 954 root /usr/sbin/ModemManager --filter-policy=strict
4026532184 mnt 1 1118 root /usr/sbin/NetworkManager --no-daemon
4026532185 mnt 1 25318 root sh
4026532190 uts 1 25318 root sh
4026532191 ipc 1 25318 root sh
4026532192 pid 1 25318 root sh
4026532194 net 1 25318 root sh
4026532241 mnt 6 1782 root /usr/sbin/apache2 -k start
4026532254 mnt 1 25816 999 mysqld
4026532255 uts 1 25816 999 mysqld
4026532256 ipc 1 25816 999 mysqld
4026532257 pid 1 25816 999 mysqld
4026532259 net 1 25816 999 mysqld
4026532313 uts 1 32225 root ./newuts helloworld
三个 syscall
Linux 提供了三个系统调用。
clone
clone 系统调用允许父进程创建子进程时,让子进程使用独立的(新的)namespace。
clone() creates a new process, in a manner similar to fork(2).
man clone
里的例子,创建一个子进程,设置了关于 hostname 的 namespace CLONE_NEWUTS:
#define _GNU_SOURCE
#include <sys/wait.h>
#include <sys/utsname.h>
#include <sched.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#define errExit(msg) \
do \
{ \
perror(msg); \
exit(EXIT_FAILURE); \
} while (0)
static int /* Start function for cloned child */
childFunc(void *arg)
{
struct utsname uts;
/* Change hostname in UTS namespace of child */
if (sethostname(arg, strlen(arg)) == -1)
errExit("sethostname");
/* Retrieve and display hostname */
if (uname(&uts) == -1)
errExit("uname");
printf("uts.nodename in child: %s\n", uts.nodename);
/* Keep the namespace open for a while, by sleeping.
This allows some experimentation--for example, another
process might join the namespace. */
sleep(200);
return 0; /* Child terminates now */
}
#define STACK_SIZE (1024 * 1024) /* Stack size for cloned child */
int main(int argc, char *argv[])
{
char *stack; /* Start of stack buffer */
char *stackTop; /* End of stack buffer */
pid_t pid;
struct utsname uts;
if (argc < 2)
{
fprintf(stderr, "Usage: %s <child-hostname>\n", argv[0]);
exit(EXIT_SUCCESS);
}
/* Allocate stack for child */
stack = malloc(STACK_SIZE);
if (stack == NULL)
errExit("malloc");
stackTop = stack + STACK_SIZE; /* Assume stack grows downward */
/* Create child that has its own UTS namespace;
child commences execution in childFunc() */
pid = clone(childFunc, stackTop, CLONE_NEWUTS | SIGCHLD, argv[1]);
if (pid == -1)
errExit("clone");
printf("clone() returned %ld\n", (long)pid);
/* Parent falls through to here */
sleep(1); /* Give child time to change its hostname */
/* Display hostname in parent's UTS namespace. This will be
different from hostname in child's UTS namespace. */
if (uname(&uts) == -1)
errExit("uname");
printf("uts.nodename in parent: %s\n", uts.nodename);
if (waitpid(pid, NULL, 0) == -1) /* Wait for child */
errExit("waitpid");
printf("child has terminated\n");
exit(EXIT_SUCCESS);
}
编译:
gcc newuts.c -o newuts
执行:
$ sudo ./newuts helloworld &
[1] 25052
clone() returned 25059
uts.nodename in child: helloworld # 修改子进程的 hostname 不影响全局的 hostname。
uts.nodename in parent: ubuntu-bionic
$ sudo ls -l /proc/25052/ns
total 0
lrwxrwxrwx 1 root root 0 Sep 1 02:29 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 Sep 1 02:29 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 root root 0 Sep 1 02:29 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 root root 0 Sep 1 02:29 net -> 'net:[4026531993]'
lrwxrwxrwx 1 root root 0 Sep 1 02:29 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 Sep 1 02:29 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 Sep 1 02:29 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 Sep 1 02:29 uts -> 'uts:[4026531838]'
$ sudo ls -l /proc/25059/ns
total 0
lrwxrwxrwx 1 root root 0 Sep 1 02:30 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 Sep 1 02:30 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 root root 0 Sep 1 02:30 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 root root 0 Sep 1 02:30 net -> 'net:[4026531993]'
lrwxrwxrwx 1 root root 0 Sep 1 02:30 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 Sep 1 02:30 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 Sep 1 02:30 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 Sep 1 02:30 uts -> 'uts:[4026532313]' # 可以看到子进程的uts与父进程的uts不同
unshare
unshare 系统调用的作用,不创建新进程,设置当前线程(进程)的 namespace。unshare,代表不与父进程共享 namespace,也就是会创建新的 namespace。
unshare() allows a process (or thread) to disassociate parts of its execution context that are currently being shared with other processes (or threads). The main use of unshare() is to allow a process to control its shared execution context without creating a new process. The flags argument is a bit mask that specifies which parts of the execution context should be unshared.
man 2 unshare
里的例子,使用 unshare 创建新的 namespace:
#define _GNU_SOURCE
#include <sched.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
/* A simple error-handling function: print an error message based
on the value in 'errno' and terminate the calling process */
#define errExit(msg) \
do \
{ \
perror(msg); \
exit(EXIT_FAILURE); \
} while (0)
static void
usage(char *pname)
{
fprintf(stderr, "Usage: %s [options] program [arg...]\n", pname);
fprintf(stderr, "Options can be:\n");
fprintf(stderr, " -i unshare IPC namespace\n");
fprintf(stderr, " -m unshare mount namespace\n");
fprintf(stderr, " -n unshare network namespace\n");
fprintf(stderr, " -p unshare PID namespace\n");
fprintf(stderr, " -u unshare UTS namespace\n");
fprintf(stderr, " -U unshare user namespace\n");
exit(EXIT_FAILURE);
}
int main(int argc, char *argv[])
{
int flags, opt;
flags = 0;
while ((opt = getopt(argc, argv, "imnpuU")) != -1)
{
switch (opt)
{
case 'i':
flags |= CLONE_NEWIPC;
break;
case 'm':
flags |= CLONE_NEWNS;
break;
case 'n':
flags |= CLONE_NEWNET;
break;
case 'p':
flags |= CLONE_NEWPID;
break;
case 'u':
flags |= CLONE_NEWUTS;
break;
case 'U':
flags |= CLONE_NEWUSER;
break;
default:
usage(argv[0]);
}
}
if (optind >= argc)
usage(argv[0]);
if (unshare(flags) == -1)
errExit("unshare");
execvp(argv[optind], &argv[optind]);
errExit("execvp");
}
编译:
gcc unshare.c -o unshare
执行:
$ readlink /proc/$$/ns/mnt
mnt:[4026531840]
$ sudo ./unshare -m /bin/bash
# readlink /proc/$$/ns/mnt
mnt:[4026532313]
setns
Given a file descriptor referring to a namespace, reassociate the calling thread (current process) with that namespace.
man setns
里的例子,使用clone例子里子进程的uts namespace:
#define _GNU_SOURCE
#include <fcntl.h>
#include <sched.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#define errExit(msg) \
do \
{ \
perror(msg); \
exit(EXIT_FAILURE); \
} while (0)
int main(int argc, char *argv[])
{
int fd;
if (argc < 3)
{
fprintf(stderr, "%s /proc/PID/ns/FILE cmd args...\n", argv[0]);
exit(EXIT_FAILURE);
}
fd = open(argv[1], O_RDONLY); /* Get file descriptor for namespace */
if (fd == -1)
errExit("open");
if (setns(fd, 0) == -1) /* Join that namespace */
errExit("setns");
execvp(argv[2], &argv[2]); /* Execute a command in namespace */
errExit("execvp");
}
编译:
gcc ns_exec.c -o ns_exec
执行:
$ sudo ./ns_exec /proc/25164/ns/uts /bin/bash
# hostname
helloworld
两个 cmd
unshare
是 unshare 系统调用的包装。
man unshare
获取命令参考。
举例,使用一个独立的 PID namespace。–fork –mount-proc –pid 都是为了 PID namespace。
$ sudo unshare --fork --pid --mount-proc bash
# echo $$
1
# ls /proc/
1/ devices irq/ mdstat schedstat timer_list
15 diskstats kallsyms meminfo scsi/ tty/
16 dma kcore misc self/ uptime
acpi/ driver/ key-users modules slabinfo version
buddyinfo execdomains keys mounts softirqs version_signature
bus/ fb kmsg mpt/ stat vmallocinfo
cgroups filesystems kpagecgroup mtrr swaps vmstat
cmdline fs/ kpagecount net/ sys/ zoneinfo
consoles interrupts kpageflags pagetypeinfo sysrq-trigger
cpuinfo iomem loadavg partitions sysvipc/
crypto ioports locks sched_debug thread-self/
nsenter
man nsenter
获取命令参考。
生产环境的 container 因为要保持尽可能轻量的关系,container 内部缺少很多诊断工具。可以在宿主机中使用 nsenter 执行 shell 并使用 container 的 namespace,用途比如定位容器的网络问题。
$ docker run -p 3306:3306 -e MYSQL_ROOT_PASSWORD=root -d mysql:5
85fa62779aad
$ docker inspect -f '{ {.State.Pid} }' 85fa62779aad
25816
$ sudo nsenter -t 25816 -n zsh
# 进入了 mysqld 这个container的network namespace
$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.17.0.2 netmask 255.255.0.0 broadcast 172.17.255.255
ether 02:42:ac:11:00:02 txqueuelen 0 (Ethernet)
RX packets 130 bytes 14337 (14.3 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
loop txqueuelen 1000 (Local Loopback)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
总结
- container 实际上就是一个 process。
- 容器技术的核心 API 就是 clone/unshare/setns 系统调用以及 7 个 CLONE_NEW* flag。
参考
- Linux man(ual) 例子挺详细的。
- DOCKER基础技术:LINUX NAMESPACE(上) https://coolshell.cn/articles/17010.html
- DOCKER基础技术:LINUX NAMESPACE(下) https://coolshell.cn/articles/17029.html
参考 coolshell 查看更多示例的实践,就不重复了。
Last modified on 2020-09-01