极客油画

简介

在虚拟化领域,Linux逐渐增加了Cgroups、Namespace、Seccomp、capability、Apparmor等一些功能。 Docker重度使用这些特性,而且目前风靡大江南北。实际上,容器技术是一系列晦涩难懂甚至有些神秘的系统特性的集合, 因此Docker公司将这些底层的技术合并在一起,开源出了一个项目runC,并托管于OCI组织。

Linux基金会在2015年6月成立了OCI(Open Container Initiative)组织,旨在围绕容器网格定义和运行时的配置指定一个 开放的工业化标准。该组织主要由Docker、Google、IBM、Microsoft、Red Hat和其他许多合作伙伴创立。

runC是一个轻量级的容器运行引擎,包括所有Docker使用的和容器相关的系统调用的代码,其基本功能点如下:

  1. 完全支持Linux Namespace,包括User Namespace
  2. 原生支持所有Linux提供的安全特性:Selinux、Apparmor、Seccomp、control groups、capability、pivot_root等。 只要是Linux能做的,runC都能做
  3. 在CRIU项目的支持下原生支持容器热迁移
  4. 一份正式的容器标准,由Open Container Project管理,并挂靠在Linux基金会下,可以说这是真正的业界标准。

可以这么理解,runC的目标就是去构造到处都可以运行的标准容器。

runC创建容器的流程

输入runc run <container-id>就会根据当前路径下面的config.json文件去创建一个容器。

这里主要来介绍一下runC里面的createContainer流程,首先来看一下函数定义

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
func createContainer(context *cli.Context, id string, spec *specs.Spec) (libcontainer.Container, error) {
  config, err := specconv.CreateLibcontainerConfig(&specconv.CreateOpts{
    CgroupName:          id,
    UserSystemdCgroup:   context.GlobalBool("systemd-cgroup"),
    NoPivotRoot:         context.Bool("no-pivot"),
    NoNewKeyring:        context.Bool("no-new-keyring"),
    Spec:                spec,
  })
  if err != nil {
    return nil, err
  }

  factory, err := loadFactory(context)
  if err != nil {
    return nil, err
  }
  return factory.Create(id, config)
}

createContainer函数的参数列表接收上下文和关于容器的描述spec,然后根据spec描述来配置容器需要的信息, 最后把这些配置信息传递给factory的create方法。factory可以结余很多系统实现,比如Linux、Solaris、Windows、 Unix,这里主要看一下基于Linux的实现。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
func (l *ListenFactory) Create(id string, config *configs.Config) (Container, error) {
  // 检查配置信息
  if err := l.Validator.Validate(config); err != nil {
    return nil, newGenericError(err, ConfigInvalid)
  }
  logrus.Infof("Factory create containerPort %s", containerRoot)
  // 创建容器root filesystem
  if err := os.MkdirAll(containerRoot, 0711); err != nil {
    return nil, newGenericError(err, SystemError)
  }
  if err := os.Chown(containerRoot, uid, gid); err != nil {
    return nil, newGenericError(err, SystemError)
  }
  fifoName := filepath.Join(containerRoot, execFifoFilename)
  logrus.Infof("infoName %s", fifoName)
  oldMask := syscall.Umask(0000)
  // 创建进程间通信管道
  if err := syscall.Mkfifo(fifoName, 0622); err != nil {
    syscall.Umask(oldMask)
    return nil, newGenericError(err, SystemError)
  }
  syscall.Umask(oldMask)
  if err := os.Chown(fifoName, uid, gid); err != nil {
    return nil, newGenericError(err, SystemError)
  }
  // 生成包含容器信息的struct
  c := &linuxContainer {
    id:               id,
    root:             containerRoot,
    config:           config,
    initArgs:         l.InitArgs,
    criuPath:         l.CriuPath,
    cgroupManager:    l.NewCgroupsManager(config.Cgroups, nil),
  }
  return c, nil
}

这里截取了Create函数实现的一部分, 其实主要工作就是检查容器配置,然后根据目录结构初始化一下容器的root file system, 最后把包含所有信息的struct返回。

容器信息创建完毕,就需要真正创建容器进程了,下面列出创建容器进程的newParentProcess函数。

1
2
3
4
5
6
7
8
9
func (c *linuxContainer) newParentProcess(p *Process, doInit bool) (parentProcess, error) {
  // 创建匿名管道用于父子进程通信
  parentPipe, childPipe, err := newPipe()
  rootDir, err := os.Open(c.root)
  // 创建command信息
  cmd, err := c.commandTemplate(p, childPipe, rootDir)
  // 返回创建好的初始化进程信息
  return c.newInitProcess(p, cmd, parentPipe, childPipe, rootDir)
}

可以看到,newProcessProcess函数里面最重要的就是创建容器所属的command信息,下面来仔细看一下它的实现

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
func (c *linuxContainer) commandTemplate(p *Process, childPipe, rootDir *os.File) (*exec.Cmd, error) {
  // 创建command
  cmd := exec.Command(c.initArgs[0], c.initArgs[1:]...)
  logrus.Infof("command template args1 %s args2 %v", c.initArgs[0], c.initArgs[1:])
  cmd.Stdin = p.Stdin
  cmd.Stdout = p.Stdout
  cmd.Stderr = p.Stderr
  cmd.Dir = c.config.Rootfs
  if cmd.SysProcAttr == nil {
    cmd.SysProcAttr = &syscall.SysProcAttr{}
  }
  cmd.ExtraFiles = append(p.ExtraFiles, childPipe, rootDir)
  cmd.Env = append(cmd.Env, fmt.Sprintf("_LIBCONTAINER_INITPIPE=%d", stdioFdCount+len(cmd.ExtraFiles)-2), 
                         fmt.Sprintf("_LIBCONTAINER_INITPIPE=%d", stdioFdCount+len(cmd.ExtraFiles)-1))
  // NOTE: when running a container with no PID namespace and the parent
  // process spawning the container is
  // PID1 the pdeathsig is being delivered to the container's init process by
  // the kernel for some reason
  // even with the parent still running
  if c.config.ParentDeathSignal > 0 {
    cmd.SysProcAttr.Pdeathsig = syscall.Signal(c.config.ParentDeathSignal)
  }
  return cmd, nil
}

这段代码就不多做解释了,看过本书前面的章节应该就能明白,这和前面创建容器初始化进程是相似的流程, 只是多加了一些环境变量和参数。容器创建其实就参考了runC实现。

最后,来看一下最终的start是如何实现的

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
func (c *linuxContainer) start(process *Process, isInit bool) error {
  // 创建初始化进程
  parent, err := c.newParentProcess(process, isInit)
  logrus.Infof("libcontainer start %++v", parent)
  // 下面的start真正开启了容器进程的启动
  if err := parent.start(); err != nil {
    // terminate the process to ensure that it properly is reaped.
    if err := parent.terminate(); err != nil {
      logrus.Warn(err)
    }
    return newSystemErrorWithCause(err, "starting container process")
  }
}

至此,就完成了容器的初始化进程启动。下面会再次调用runC的init方法完成容器初始化进程的启动。这个参数在factory_linux.go里面有体现

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
// New returns a linux container factory based in the root directory and 
// configures the factory with the provided option funcs
func New(root string, options ...func(*LinuxFactory) error) (Factory, error) {
  if root != "" {
    if err := os.MkdirAll(root, 0700); err != nil {
      return nil, newGenericError(err, SystemError)
    }
  }
  l := *LinuxFactory {
    Root:         root,
    InitArgs:     []string{"/proc/self/exe", "init"},
    Validator:    validate.New(),
    CriuPath:     "criu",
  }
  return l, nil
}

在New函数中,可以看到熟悉的/proc/self/exe,后面的参数是init,其实架构和mydocker一样,也会重新运行runC的init方法来初始化容器的进程

代码读到这里,应该可以大概理解runC创建容器的整个过程了,如下:

  1. 读取配置文件
  2. 设置rootFileSystem
  3. 使用factory创建容器, 各个系统平台均有不同实现
  4. 创建容器的初始化进程process
  5. 设置容器的输出管道,主要是Go的pipes
  6. 执行Container.Start()启动物理的容器
  7. 回调init方法重新初始化进程
  8. runC父进程等待子进程初始化成功后退出

可以看到,具体的执行流程设计3个概念:process、container、factory

factory用来创建容器,process负责进程之间的通信和启动容器。

附录A 单独启动rootfs

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
# 将rootfs.tar.gz和runc拷贝到目标机器上
# 创建config.json文件,文件内容如下:
{
        "ociVersion": "1.0.2-dev",
        "process": {
                "terminal": false,
                "user": {
                        "uid": 0,
                        "gid": 0
                },
                "args": [
                        "tail",
                        "-f",
                        "/etc/hosts"
                ],
                "env": [
                        "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                        "TERM=xterm"
                ],
                "cwd": "/",
                "capabilities": {
                        "bounding":[
                                "CAP_CHOWN",
                                "CAP_DAC_OVERRIDE",
                                "CAP_FSETID",
                                "CAP_FOWNER",
                                "CAP_MKNOD",
                                "CAP_NET_RAW",            //// 不加这个,会报:ping: socket: Operation not permitted
                                "CAP_SETGID",
                                "CAP_SETUID",
                                "CAP_SETFCAP",
                                "CAP_SETPCAP",
                                "CAP_NET_BIND_SERVICE",
                                "CAP_SYS_CHROOT",
                                "CAP_KILL",
                                "CAP_AUDIT_WRITE"
                        ],
                        "effective":[
                                "CAP_CHOWN",
                                "CAP_DAC_OVERRIDE",
                                "CAP_FSETID",
                                "CAP_FOWNER",
                                "CAP_MKNOD",
                                "CAP_NET_RAW",
                                "CAP_SETGID",
                                "CAP_SETUID",
                                "CAP_SETFCAP",
                                "CAP_SETPCAP",
                                "CAP_NET_BIND_SERVICE",
                                "CAP_SYS_CHROOT",
                                "CAP_KILL",
                                "CAP_AUDIT_WRITE"
                        ],
                        "permitted":[
                                "CAP_CHOWN",
                                "CAP_DAC_OVERRIDE",
                                "CAP_FSETID",
                                "CAP_FOWNER",
                                "CAP_MKNOD",
                                "CAP_NET_RAW",
                                "CAP_SETGID",
                                "CAP_SETUID",
                                "CAP_SETFCAP",
                                "CAP_SETPCAP",
                                "CAP_NET_BIND_SERVICE",
                                "CAP_SYS_CHROOT",
                                "CAP_KILL",
                                "CAP_AUDIT_WRITE"
                        ]
                },
                "rlimits": [
                        {
                                "type": "RLIMIT_NOFILE",
                                "hard": 1024,
                                "soft": 1024
                        }
                ],
                "noNewPrivileges": true
        },
        "root": {
                "path": "rootfs",
                "readonly": false
        },
        "hostname": "runc",
        "mounts": [
                {
                        "destination": "/proc",
                        "type": "proc",
                        "source": "proc"
                },
                {
                        "destination": "/dev",
                        "type": "tmpfs",
                        "source": "tmpfs",
                        "options": [
                                "nosuid",
                                "strictatime",
                                "mode=755",
                                "size=65536k"
                        ]
                },
                {
                        "destination": "/dev/pts",
                        "type": "devpts",
                        "source": "devpts",
                        "options": [
                                "nosuid",
                                "noexec",
                                "newinstance",
                                "ptmxmode=0666",
                                "mode=0620",
                                "gid=5"
                        ]
                },
                {
                        "destination": "/dev/shm",
                        "type": "tmpfs",
                        "source": "shm",
                        "options": [
                                "nosuid",
                                "noexec",
                                "nodev",
                                "mode=1777",
                                "size=65536k"
                        ]
                },
                {
                        "destination": "/dev/mqueue",
                        "type": "mqueue",
                        "source": "mqueue",
                        "options": [
                                "nosuid",
                                "noexec",
                                "nodev"
                        ]
                },
                {
                        "destination": "/sys",
                        "type": "sysfs",
                        "source": "sysfs",
                        "options": [
                                "nosuid",
                                "noexec",
                                "nodev",
                                "ro"
                        ]
                },
                {
                        "destination": "/etc/resolv.conf",
                        "options": [
                                "rbind",
                                "rprivate"
                        ],
                        "source": "/etc/resolv.conf",
                        "type": "bind"
                },
                {
                        "destination": "/etc/hostname",
                        "options": [
                                "rbind",
                                "rprivate"
                        ],
                        "source": "/etc/hostname",
                        "type": "bind"
                },
                {
                        "destination": "/etc/hosts",
                        "options": [
                                "rbind",
                                "rprivate"
                        ],
                        "source": "/etc/hosts",
                        "type": "bind"
                },
                {
                        "destination": "/sys/fs/cgroup",
                        "type": "cgroup",
                        "source": "cgroup",
                        "options": [
                                "nosuid",
                                "noexec",
                                "nodev",
                                "relatime",
                                "ro"
                        ]
                }
        ],
        "linux": {
                "resources": {
                        "devices": [
                                {
                                        "allow": false,
                                        "access": "rwm"
                                }
                        ]
                },
                "namespaces": [
                        {
                                "type": "pid"
                        },
                        {
                                "type": "ipc"
                        },
                        {
                                "type": "uts"
                        },
                        {
                                "type": "mount"
                        }
                ],
                "maskedPaths": [
                        "/proc/acpi",
                        "/proc/asound",
                        "/proc/kcore",
                        "/proc/keys",
                        "/proc/latency_stats",
                        "/proc/timer_list",
                        "/proc/timer_stats",
                        "/proc/sched_debug",
                        "/sys/firmware",
                        "/proc/scsi"
                ],
                "readonlyPaths": [
                        "/proc/bus",
                        "/proc/fs",
                        "/proc/irq",
                        "/proc/sys",
                        "/proc/sysrq-trigger"
                ]
        }
}
 
tar -xzf rootfs.tar.gz
./runc run -d --pid-file init.pid   supervisord
./runc exec -t supervisord bash
# yum clean, yum clean all (= yum clean packages; yum clean oldheaders)
# yum clean headers      清除header
# yum clean oldheaders   清除缓存目录(/var/cache/yum)下旧的 headers
# yum clean packages     清除下载的rpm包
yum clean all; yum clean headers

runc在arm64 麒麟系统中,有个bug:

unable to start container process: error adding pid 2047088 to cgroups: failed to write 2047088: open /sys/fs/cgroup/blkio/system.slice/supervisord/cgroup.procs: no such file or directory

需要改runc的代码,并重新编译,改动分支release-1.1,改动如下:

diff --git a/Makefile b/Makefile
index e3af9bc..0c0a2fd 100644
--- a/Makefile
+++ b/Makefile
@@ -68,15 +68,16 @@ recvtty sd-helper seccompagent:
 static:
        $(GO_BUILD_STATIC) -o runc .
 
-releaseall: RELEASE_ARGS := "-a arm64 -a armel -a armhf -a ppc64le -a riscv64 -a s390x"
+releaseall: RELEASE_ARGS := "-a arm64"
 releaseall: release
 
-release: runcimage
+release:
        $(CONTAINER_ENGINE) run $(CONTAINER_ENGINE_RUN_FLAGS) \
                --rm -v $(CURDIR):/go/src/$(PROJECT) \
                -e RELEASE_ARGS=$(RELEASE_ARGS) \
+               --pull never \
                $(RUNC_IMAGE) make localrelease
-       script/release_sign.sh -S $(GPG_KEYID) -r release/$(VERSION) -v $(VERSION)
+       #script/release_sign.sh -S $(GPG_KEYID) -r release/$(VERSION) -v $(VERSION)
 
 localrelease: verify-changelog
        script/release_build.sh -r release/$(VERSION) -v $(VERSION) $(RELEASE_ARGS)
diff --git a/libcontainer/cgroups/utils.go b/libcontainer/cgroups/utils.go
index fc4ae44..e84295d 100644
--- a/libcontainer/cgroups/utils.go
+++ b/libcontainer/cgroups/utils.go
@@ -393,6 +393,10 @@ func WriteCgroupProc(dir string, pid int) error {
                return nil
        }
 
+       if err := os.MkdirAll(dir, 0o755); err != nil {
+               return err
+       }
+
        file, err := OpenFile(dir, CgroupProcesses, os.O_WRONLY)
        if err != nil {
                return fmt.Errorf("failed to write %v: %w", pid, err)

编译命令:

# runc_dev-main.tar 使用 make runcimage 命令生成,生成时间较长,一般10h
$ docker load -i runc_dev-main.tar
$ docker tag runc_dev:main runc_dev:release-1-1
$ make releaseall

本文发表于 0001-01-01,最后修改于 0001-01-01。

本站永久域名「 jiavvc.top 」,也可搜索「 极客油画 」找到我。


上一篇 « 下一篇 »

赞赏支持

请我吃鸡腿 =^_^=

i ysf

云闪付

i wechat

微信

推荐阅读

Big Image