除了kubelet必须运行在宿主机之外，其他所有的k8s组件都可以容器化部署。

kubelet启动时会自动加载/etc/kubernetes/manifests/目录下所有的Pod YAML文件，然后在这台机器上启动它们，这种方式启动的pod叫做“Static Pod”。

kubeadm init命令会为集群生成一个bootstrap token，kubeadm join命令需要使用到这个token，该token是其他节点申请加入这个集群的令牌。

在token生成之后，kubeadm会将ca.crt等Master节点的重要信息，通过ConfigMap的方式保存在Etcd当中，供后续部署 Node 节点使用。这个ConfigMap的名字是cluster-info，在kube-public命名空间下。

为何kubeadm join命令需要这个token呢？

任何一台机器想要成为Kubernetes集群中的一个节点，就必须访问集群的apiserver服务进行注册，要访问apiserver就必须要有相应的CA证书。
为了能够一键安装，我们就不能让用户去Master节点上手动拷贝这些文件，所以kubeadm至少需要对apiserver发起一次“不安全模式”的访问，从而拿到保存在ConfigMap中的cluster-info。而bootstrap token扮演的就是这个过程中的安全验证的角色。
因为是“不安全模式”，所以bootstrap token有过期时间。

1. 架构

                              .---{etcd}
                             /
+----------------------------|------------------+
|Master  {Controller}    {  API }----{Scheduler}|
|        {  Manager }----{Server}               |
+----------------------------\-\----------------+
                             |\ \
                             \ \ \
+-----------------------------\-\-\-------------+
|        +------------------+  | \ \            |
|        |    Node          |  |  \ *--.        |
|      /(|{Networking}      |  |   \    \__     |
| CNI<|  |                  |  |    \      |    |
|+-----X(|{ kubelet  }------|--+ +----+  +-\--+ |
||CRI<|  |                  |    |Node|  |Node| |
||     \(|{ContainerRuntime}--+  |    |  |    | |
||       |                  | |  +----+  +----+ |
|+-CSI---|{VolumePlugin}    | |                 |
||       |                  | |->OCI            |
|+-grpc--|{DevicePlugin}    | |                 |
|        |                  | |                 |
|        |{LinuxOS}  -------|-+                 |
|        +------------------+                   |
|                                               |
+-----------------------------------------------+

从上图可以看出，k8s集群由Master和Node两种节点组成，而这两种角色分别对应着控制节点和计算节点。其中，控制节点，由负责API服务的kube-apiserver、负责调度的kube-scheduler，以及负责容器编排的kube-controller-manager组成。整个集群的持久化数据由kube-apiserver处理后保存在Etcd中。

计算节点最核心的部分，则是一个叫做kubelet的组件。

使用CRI(container runtime interface)接口和各种容器运行时打交道。
使用gRPC协议同DevicePlugin插件(管理GPU等宿主机物理设备)打交道。
使用CNI(container networking interface)接口和各种网络插件打交道。
使用CSI(container storage interface)接口和各种持久化存储打交道。

2. 概念备忘

污点与容忍

给某个节点设置污点： kubectl taint nodes node1 foo=bar:NoSchedule
给所有节点去除污点： kubectl taint nodes --all node-role.kubernetes.io/master-

apiVersion: v1
kind: Pod
...
spec:
  tolerations:
  - key: "node-role.kubernetes.io/master"
    operator: "Exists"
    effect: "NoSchedule"

3. kubelet

在k8s社区中，与kubelet以及容器运行时管理相关的内容，都属于SIG-Node的范畴。无论如何，我都不太建议你对kubelet的代码进行大量的改动。保持kubelet根上游基本一致的重要性，就跟保持kube-apiserver跟上游一致是一个道理。当然，kubelet本身，也是按照“控制器”模式来工作的，可以用如下所示的一幅示意图来表示：

                                                                                             4-sources
                       chooseRuntime                                                        api-server(primary,watch)
                      (dockershim,remote)                                                   http endpoint(pull)
                              |           AddPodAdmitHandler                                http server(push)
                              | NewContainerGC    |         Eviction  .---<----.            file(pull)
                              |         |         |            |     {NodeStatus}
       +-------+------+->--+--+----+----+---------+-------------.     *---->---*.-<---.
       |       |      |    |       |                           . \ **    ./    {Network}   
registerListers|      |    |       |                     . *                * .{ Status}
               |      |    | NetGenericPLEG           .*                        *->--.*--<-.
      diskSpaceManager|    |                         *                            * {Status }
                      |    |                        *                              *{Manager}
                oomWatcher |                       *                                **-->--*
                           |                      *                                  * {PLEG}
                 InitNetworkPlugin               *                                    ->*>-*
                                                (       SyncLoop                       )  .-<---.
                                                *      <-chan kubetypes.PodUpdate*     | {volume }
                                                *       <chan *pleg.PodLifecycleEvent  V\{Manager}
                                                (            periodic sync events      )  *->---*
                                                 *            housekeeping events     * .--<--.
                                                  *                                  * { image }
                                                   *                                * \{Manager}
+------------------------+                          *                              *    *-->--*
|PodUpdateWorker(e.g.ADD)|                           *.                          .*
|*generate pod status    |                             * .                    . *
|*check volume status    |                                 *   .        .   *  \
|call runtime to start   |                                         **           \
|containers              |              HandlerPods                              \
+------------------<-----+<----{Add,Update,Remove,Delete,...}---------------------*

从上图可以看到，kubelet的工作核心，就是一个控制循环，即：SyncLoop(图中的大圆圈)。而驱动这个控制循环运行的事件，包括四种：

Pod更新事件；
Pod 生命周期变化;
kubelet 本身设置的执行周期；
定时的清理事件。

所以，跟其他控制器类似，kubelet 启动的时候，要做的第一件事情，就是设置 Listers，也就是注册它所关心的各种事件的 Informer。这些 Informer，就是 SyncLoop 需要处理的数据的来源。

此外，kubelet 还负责维护着很多很多其他的子控制循环（也就是图中的小圆圈）。这些控制循环的名字，一般被称作某某 Manager，比如 Volume Manager、Image Manager、Node Status Manager 等等。

不难想到，这些控制循环的责任，就是通过控制器模式，完成 kubelet 的某项具体职责。比如 Node Status Manager，就负责响应 Node 的状态变化，然后将 Node 的状态收集起来，并通过 Heartbeat 的方式上报给 APIServer。再比如 CPU Manager，就负责维护该 Node 的 CPU 核的信息，以便在 Pod 通过 cpuset 的方式请求 CPU 核的时候，能够正确地管理 CPU 核的使用量和可用量。

那么这个 SyncLoop，又是如何根据 Pod 对象的变化，来进行容器操作的呢？

实际上，kubelet 也是通过 Watch 机制，监听了与自己相关的 Pod 对象的变化。当然，这个 Watch 的过滤条件是该 Pod 的 nodeName 字段与自己相同。kubelet 会把这些 Pod 的信息缓存在自己的内存里。

而当一个 Pod 完成调度、与一个 Node 绑定起来之后，这个 Pod 的变化就会触发 kubelet 在控制循环里注册的 Handler，也就是上图中的 HandlePods 部分。此时，通过检查该 Pod 在 kubelet 内存里的状态，kubelet 就能够判断出这是一个新调度过来的 Pod，从而触发 Handler 里 ADD 事件对应的处理逻辑。

在具体的处理过程当中，kubelet 会启动一个名叫 Pod Update Worker 的、单独的 Goroutine 来完成对 Pod 的处理工作。比如，如果是 ADD 事件的话，kubelet 就会为这个新的 Pod 生成对应的 Pod Status，检查 Pod 所声明使用的 Volume 是不是已经准备好。然后，调用下层的容器运行时（比如 Docker），开始创建这个 Pod 所定义的容器。而如果是 UPDATE 事件的话，kubelet 就会根据 Pod 对象具体的变更情况，调用下层容器运行时进行容器的重建工作。

在这里需要注意的是，kubelet 调用下层容器运行时的执行过程，并不会直接调用 Docker 的 API，而是通过一组叫作 CRI（Container Runtime Interface，容器运行时接口）的 gRPC 接口来间接执行的。

+------------------------------------------------------------------------------------------------------------+
|                                                 +---------------------------------------------------------+|
|                                                 |                          Management                     ||
|                                                 |+-<-----+                 +----------+      .----<----.  ||
|                                                 |Vschedul|<pod,node list---|api-server|     /Workloads  \ ||
|                  .------------------------------||-ing   |---------bind--->|  {etcd}  |----VOrchestration|||
|                 /                               |+---->--+                 +----------+     *----->-----* ||
|                /                                +---------------------------------------------------------+|
|            pod/       CRI Spec-----+                                                                       |
|+-------------V---------------------+---------------------------------------+                               |
||                           Kubelet |Sendbox:                               |client api                     |
||  -------        -------           |Create/Delete/List  +--dockershim------+----------->docker             |
|| /kubelet\  pod /Generic\ CRI grpc |Container:          |                  |                               |
||^SyncLoop V---->Runtime |----------|------------------->|                  |                               |
|| \       /      \SyncPod/          |Create/Start/Exec   |                  |                               |
||  -------        -------           |Image:              +--remote(no-op)---|->CRI shim->ContainerRuntime   |
||                                   |Pull/List                              |                               |
|+-----------------------------------+---------------------------------------+                               |
+------------------------------------------------------------------------------------------------------------+

CRI 机制能够发挥作用的核心，就在于每一种容器项目现在都可以自己实现一个 CRI shim，自行对 CRI 请求进行处理。

CNCF 里的 containerd 项目，就可以提供一个典型的 CRI shim 的能力，即：将 Kubernetes 发出的 CRI 请求，转换成对 containerd 的调用，然后创建出 runC 容器。而 runC 项目，才是负责执行我们前面讲解过的设置容器 Namespace、Cgroups 和 chroot 等基础操作的组件。

CRI接口的定义如下:

// RuntimeService interface should be implemented by a container runtime.
// The methods should be thread-safe.
type RuntimeService interface {
        RuntimeVersioner
        ContainerManager
        PodSandboxManager
        ContainerStatsManager

        // UpdateRuntimeConfig updates runtime configuration if specified
        UpdateRuntimeConfig(runtimeConfig *runtimeapi.RuntimeConfig) error
        // Status returns the status of the runtime.
        Status() (*runtimeapi.RuntimeStatus, error)
}
// RuntimeVersioner contains methods for runtime name, version and API version.
type RuntimeVersioner interface {
        // Version returns the runtime name, runtime version and runtime API version
        Version(apiVersion string) (*runtimeapi.VersionResponse, error)
}
// ContainerManager contains methods to manipulate containers managed by a
// container runtime. The methods are thread-safe.
type ContainerManager interface {
        // CreateContainer creates a new container in specified PodSandbox.
        CreateContainer(podSandboxID string, config *runtimeapi.ContainerConfig, sandboxConfig *runtimeapi.PodSandboxConfig) (string, error)
        // StartContainer starts the container.
        StartContainer(containerID string) error
        // StopContainer stops a running container with a grace period (i.e., timeout).
        StopContainer(containerID string, timeout int64) error
        // RemoveContainer removes the container.
        RemoveContainer(containerID string) error
        // ListContainers lists all containers by filters.
        ListContainers(filter *runtimeapi.ContainerFilter) ([]*runtimeapi.Container, error)
        // ContainerStatus returns the status of the container.
        ContainerStatus(containerID string) (*runtimeapi.ContainerStatus, error)
        // UpdateContainerResources updates the cgroup resources for the container.
        UpdateContainerResources(containerID string, resources *runtimeapi.LinuxContainerResources) error
        // ExecSync executes a command in the container, and returns the stdout output.
        // If command exits with a non-zero exit code, an error is returned.
        ExecSync(containerID string, cmd []string, timeout time.Duration) (stdout []byte, stderr []byte, err error)
        // Exec prepares a streaming endpoint to execute a command in the container, and returns the address.
        Exec(*runtimeapi.ExecRequest) (*runtimeapi.ExecResponse, error)
        // Attach prepares a streaming endpoint to attach to a running container, and returns the address.
        Attach(req *runtimeapi.AttachRequest) (*runtimeapi.AttachResponse, error)
        // ReopenContainerLog asks runtime to reopen the stdout/stderr log file
        // for the container. If it returns error, new container log file MUST NOT
        // be created.
        ReopenContainerLog(ContainerID string) error
}
// PodSandboxManager contains methods for operating on PodSandboxes. The methods
// are thread-safe.
type PodSandboxManager interface {
        // RunPodSandbox creates and starts a pod-level sandbox. Runtimes should ensure
        // the sandbox is in ready state.
        RunPodSandbox(config *runtimeapi.PodSandboxConfig, runtimeHandler string) (string, error)
        // StopPodSandbox stops the sandbox. If there are any running containers in the
        // sandbox, they should be force terminated.
        StopPodSandbox(podSandboxID string) error
        // RemovePodSandbox removes the sandbox. If there are running containers in the
        // sandbox, they should be forcibly removed.
        RemovePodSandbox(podSandboxID string) error
        // PodSandboxStatus returns the Status of the PodSandbox.
        PodSandboxStatus(podSandboxID string) (*runtimeapi.PodSandboxStatus, error)
        // ListPodSandbox returns a list of Sandbox.
        ListPodSandbox(filter *runtimeapi.PodSandboxFilter) ([]*runtimeapi.PodSandbox, error)
        // PortForward prepares a streaming endpoint to forward ports from a PodSandbox, and returns the address.
        PortForward(*runtimeapi.PortForwardRequest) (*runtimeapi.PortForwardResponse, error)
}
// ContainerStatsManager contains methods for retrieving the container
// statistics.
type ContainerStatsManager interface {
        // ContainerStats returns stats of the container. If the container does not
        // exist, the call returns an error.
        ContainerStats(containerID string) (*runtimeapi.ContainerStats, error)
        // ListContainerStats returns stats of all running containers.
        ListContainerStats(filter *runtimeapi.ContainerStatsFilter) ([]*runtimeapi.ContainerStats, error)
}

// ImageManagerService interface should be implemented by a container image
// manager.
// The methods should be thread-safe.
type ImageManagerService interface {
        // ListImages lists the existing images.
        ListImages(filter *runtimeapi.ImageFilter) ([]*runtimeapi.Image, error)
        // ImageStatus returns the status of the image.
        ImageStatus(image *runtimeapi.ImageSpec) (*runtimeapi.Image, error)
        // PullImage pulls an image with the authentication config.
        PullImage(image *runtimeapi.ImageSpec, auth *runtimeapi.AuthConfig, podSandboxConfig *runtimeapi.PodSandboxConfig) (string, error)
        // RemoveImage removes the image.
        RemoveImage(image *runtimeapi.ImageSpec) error
        // ImageFsInfo returns information of the filesystem that is used to store images.
        ImageFsInfo() ([]*runtimeapi.FilesystemUsage, error)
}

本文发表于 0001-01-01，最后修改于 0001-01-01。

本站永久域名「 jiavvc.top 」，也可搜索「极客油画」找到我。

1. 架构

2. 概念备忘

3. kubelet

推荐阅读