Skip to main content

使用 Kubeadm 安装 HA 集群

本页面显示如何通过kubeadm安装一个3 Master节点的高可用Kubernetes集群。有关在执行此安装过程后如何使用 kubeadm 创建集群的信息,请参见kubeadm官方文档。


准备开始

  • 一台兼容的 Linux 主机。Kubernetes 项目为基于Debian和Ubuntu的Linux发行版以及一些不提供包管理器的发行版提供通用的指令

  • 每台机器 8 GB 或更多的 RAM (如果少于这个数字将会影响你应用的运行内存)

  • 2 CPU 核或更多

  • 集群中的所有机器的网络彼此均能相互连接(公网和内网都可以)

  • 节点之中不可以有重复的主机名、MAC 地址或 product_uuid。请参见这里了解更多详细信息。

  • 开启机器上的某些端口。

  • 禁用交换分区。为了保证 kubelet 正常工作,你必须禁用交换分区。

info

操作系统推荐使用Ubuntu Server 1804以上版本

允许 iptables 检查桥接流量

确保 br_netfilter 模块被加载。这一操作可以通过运行 lsmod | grep br_netfilter 来完成。若要显式加载该模块,可执行 sudo modprobe br_netfilter

为了让 Linux 节点上的 iptables 能够正确地查看桥接流量,需要确保在您的 sysctl 配置中将 net.bridge.bridge-nf-call-iptables 设置为 1。例如:

cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
br_netfilter
EOF

cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
EOF
sudo sysctl --system

安装运行时

为了在 Pod 中运行容器,Kubernetes 使用 容器运行时(Container Runtime)。

默认情况下,Kubernetes 使用容器运行时接口 Container Runtime Interface 来与你所选择的容器运行时交互。

如果不指定运行时,则 kubeadm 会自动尝试检测到系统上已经安装的运行时, 方法是扫描一组众所周知的 Unix 域套接字。 下面的表格列举了一些容器运行时及其对应的套接字路径:

运行时域套接字
Docker/var/run/dockershim.sock
containerd/run/containerd/containerd.sock
CRI-O/var/run/crio/crio.sock

如果同时检测到 Docker 和 containerd,则优先选择 Docker。 如果检测到其他两个或多个运行时,kubeadm 输出错误信息并退出。

tip

kubelet 通过内置的 dockershim CRI 实现与 Docker 集成。

参阅容器运行时 以了解更多信息。

安装 kubeadm

您需要在每台机器上安装以下的软件包:

  • kubeadm:用来初始化集群的指令。

  • kubelet:在集群中的每个节点上用来启动 Pod 和容器等。

  • kubectl:用来与集群通信的命令行工具。

kubeadm 不能 帮您安装或者管理 kubelet 或 kubectl,所以您需要 确保它们与通过 kubeadm 安装的控制平面的版本相匹配。 如果不这样做,则存在发生版本偏差的风险,可能会导致一些预料之外的错误和问题。

1.更新 apt 包索引并安装使用 Kubernetes apt 仓库所需要的包
sudo apt-get update
sudo apt-get install -y apt-transport-https ca-certificates curl

2.下载 Google Cloud 公开签名秘钥:

sudo curl -fsSLo /usr/share/keyrings/kubernetes-archive-keyring.gpg https://packages.cloud.google.com/apt/doc/apt-key.gpg

3.添加 Kubernetes apt 仓库:

echo "deb [signed-by=/usr/share/keyrings/kubernetes-archive-keyring.gpg] https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list

4.更新 apt 包索引,安装 kubelet、kubeadm 和 kubectl,并锁定其版本:

sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl

配置 containerd和 cri 工具

cat <<EOF | sudo tee /etc/modules-load.d/containerd.conf
overlay
br_netfilter
EOF

sudo modprobe overlay
sudo modprobe br_netfilter

cat << EOF | sudo tee /etc/sysctl.d/99-kubernetes-cri.conf
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
net.bridge.bridge-nf-call-ip6tables = 1
EOF

# Apply sysctl params without reboot
sudo sysctl --system

cat << EOF | sudo tee /etc/crictl.yaml
runtime-endpoint: unix:///run/containerd/containerd.sock
image-endpoint: unix:///run/containerd/containerd.sock
timeout: 10
debug: false
EOF

sudo mkdir -p /etc/containerd
sudo containerd config default | sudo tee /etc/containerd/config.toml
sed -i 's#root = "/var/lib/containerd"#root = "/data/containerd"#' /etc/containerd/config.toml
systemctl restart containerd

部署 etcd 高可用集群

  1. 将 kubelet 配置为 etcd 的服务管理器

由于 etcd 是首先创建的,因此你必须通过创建具有更高优先级的新文件来覆盖 kubeadm 提供的 kubelet 单元文件。

cat << EOF > /etc/systemd/system/kubelet.service.d/20-etcd-service-manager.conf
[Service]
ExecStart=
# 将下面的 "systemd" 替换为你的容器运行时所使用的 cgroup 驱动。
# kubelet 的默认值为 "cgroupfs"。
ExecStart=/usr/bin/kubelet --address=127.0.0.1 --pod-manifest-path=/etc/kubernetes/manifests --cgroup-driver=systemd
Restart=always
EOF

systemctl daemon-reload
systemctl restart kubelet

检查 kubelet 的状态以确保其处于运行状态:

systemctl status kubelet
  1. 为 kubeadm 创建配置文件。
# HOST地址根据自己的实际服务器地址进行更换
export HOST0=10.36.0.68
export HOST1=10.36.0.69
export HOST2=10.36.0.70
mkdir -p /tmp/${HOST0}/ /tmp/${HOST1}/ /tmp/${HOST2}/
ETCDHOSTS=(${HOST0} ${HOST1} ${HOST2})
NAMES=("infra0" "infra1" "infra2")
for i in "${!ETCDHOSTS[@]}"; do
HOST=${ETCDHOSTS[$i]}
NAME=${NAMES[$i]}
cat << EOF > /tmp/${HOST}/kubeadmcfg.yaml
apiVersion: "kubeadm.k8s.io/v1beta2"
kind: ClusterConfiguration
etcd:
local:
serverCertSANs:
- "${HOST}"
peerCertSANs:
- "${HOST}"
extraArgs:
initial-cluster: infra0=https://${ETCDHOSTS[0]}:2380,infra1=https://${ETCDHOSTS[1]}:2380,infra2=https://${ETCDHOSTS[2]}:2380
initial-cluster-state: new
name: ${NAME}
listen-peer-urls: https://${HOST}:2380
listen-client-urls: https://${HOST}:2379
advertise-client-urls: https://${HOST}:2379
initial-advertise-peer-urls: https://${HOST}:2380
EOF
done
  1. 生成etcd集群的证书
kubeadm init phase certs etcd-ca
kubeadm init phase certs etcd-server --config=/tmp/${HOST2}/kubeadmcfg.yaml
kubeadm init phase certs etcd-peer --config=/tmp/${HOST2}/kubeadmcfg.yaml
kubeadm init phase certs etcd-healthcheck-client --config=/tmp/${HOST2}/kubeadmcfg.yaml
kubeadm init phase certs apiserver-etcd-client --config=/tmp/${HOST2}/kubeadmcfg.yaml
cp -R /etc/kubernetes/pki /tmp/${HOST2}/
find /etc/kubernetes/pki -not -name ca.crt -not -name ca.key -type f -delete
kubeadm init phase certs etcd-server --config=/tmp/${HOST1}/kubeadmcfg.yaml
kubeadm init phase certs etcd-peer --config=/tmp/${HOST1}/kubeadmcfg.yaml
kubeadm init phase certs etcd-healthcheck-client --config=/tmp/${HOST1}/kubeadmcfg.yaml
kubeadm init phase certs apiserver-etcd-client --config=/tmp/${HOST1}/kubeadmcfg.yaml
cp -R /etc/kubernetes/pki /tmp/${HOST1}/
find /etc/kubernetes/pki -not -name ca.crt -not -name ca.key -type f -delete
kubeadm init phase certs etcd-server --config=/tmp/${HOST0}/kubeadmcfg.yaml
kubeadm init phase certs etcd-peer --config=/tmp/${HOST0}/kubeadmcfg.yaml
kubeadm init phase certs etcd-healthcheck-client --config=/tmp/${HOST0}/kubeadmcfg.yaml
kubeadm init phase certs apiserver-etcd-client --config=/tmp/${HOST0}/kubeadmcfg.yaml
find /tmp/${HOST2} -name ca.key -type f -delete
find /tmp/${HOST1} -name ca.key -type f -delete
  1. 拷贝etcd集群证书

将目录/tmp/\${HOST1}和/tmp/\${HOST2}下的文件分别拷贝到对应master节点的/etc/kubernetes目录下 保证文件结构如下

/tmp/${HOST0,1,2}
└── kubeadmcfg.yaml
---
/etc/kubernetes
└── kubeadmcfg.yaml
└── pki
├── apiserver-etcd-client.crt
├── apiserver-etcd-client.key
└── etcd
├── ca.crt
├── ca.key
├── healthcheck-client.crt
├── healthcheck-client.key
├── peer.crt
├── peer.key
├── server.crt
└── server.key
  1. 创建etcd实例
  • master1节点
kubeadm init phase etcd local --config=/tmp/${HOST0}/kubeadmcfg.yaml
  • master2/master3节点
kubeadm init phase etcd local --config=/etc/kubernetes/kubeadmcfg.yaml
  1. 查看 etcd 集群状态
docker run --rm -it <etcd_pod_id> etcdctl \
--cert /etc/kubernetes/pki/etcd/peer.crt \
--key /etc/kubernetes/pki/etcd/peer.key \
--cacert /etc/kubernetes/pki/etcd/ca.crt \
--endpoints https://10.36.0.68:2379 endpoint health --cluster

部署 Master 集群

  1. 修改kubelet配置文件
cat << \EOF | sudo tee /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
# Note: This dropin only works with kubeadm and kubelet v1.11+
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --pod-manifest-path=/etc/kubernetes/manifests --cgroup-driver=systemd --container-runtime=remote --container-runtime-endpoint=/run/containerd/containerd.sock"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
# This is a file that "kubeadm init" and "kubeadm join" generates at runtime, populating the KUBELET_KUBEADM_ARGS variable dynamically
EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
# This is a file that the user can use for overrides of the kubelet args as a last resort. Preferably, the user should use
# the .NodeRegistration.KubeletExtraArgs object in the configuration files instead. KUBELET_EXTRA_ARGS should be sourced from this file.
EnvironmentFile=-/etc/default/kubelet
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS
EOF

systemctl daemon-reload
systemctl restart kubelet
systemctl status kubelet
  1. 生成kubeadm配置
cat << EOF | sudo tee /root/kubeadm.conf
apiVersion: kubeadm.k8s.io/v1beta2
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: 10.36.0.68
bindPort: 6443
nodeRegistration:
criSocket: /run/containerd/containerd.sock

---
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
kubernetesVersion: v1.20.15
imageRepository: registry.cn-hangzhou.aliyuncs.com/google_containers/
FeatureGates:
EphemeralContainers: true
ServiceAppProtocol: true
clusterName: cloud-m8
certificatesDir: /etc/kubernetes/pki
dns:
type: CoreDNS
etcd:
external:
endpoints:
- https://10.36.0.68:2379
- https://10.36.0.69:2379
- https://10.36.0.70:2379
caFile: "/etc/kubernetes/pki/etcd/ca.crt"
certFile: "/etc/kubernetes/pki/apiserver-etcd-client.crt"
keyFile: "/etc/kubernetes/pki/apiserver-etcd-client.key"
networking:
dnsDomain: cluster.local
serviceSubnet: 10.96.0.0/12
### 必配,calico后期如果要用bgp协议,需要提前规划
podSubnet: "10.59.0.0/16"

apiServer:
timeoutForControlPlane: 4m0s
certSANs:
### 必配,api服务的证书签名ip和域名,地址根据实际情况修改
- 127.0.0.1
- 10.36.0.68
- 10.36.0.69
- 10.36.0.70
- 10.96.0.1
- "*.kubegems.io"
controllerManager: {}
scheduler: {}

---
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration

# 开启ipvs模式,可选择iptables
mode: "ipvs"

# nodeport绑定地址,1.1.1.0/24就只绑定在这个网段
nodePortAddresses: null

# nodeport的端口范围
configSyncPeriod: 0s

### 必配
# 容器的CIDR网段,用来做snat地址伪装
clusterCIDR: 10.59.0.0/16

# 关闭设置conntrack跟踪表
conntrack:
maxPerCore: null
min: null
tcpCloseWaitTimeout: null
tcpEstablishedTimeout: null

# 关闭ipvs的设置,以操作系统初始化为主
ipvs:
excludeCIDRs: null
minSyncPeriod: 0s
scheduler: ""
strictARP: false
syncPeriod: 0s
tcpFinTimeout: 0s
tcpTimeout: 0s
udpTimeout: 0s

---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
maxPods: 250
failSwapOn: false
rotateCertificates: true
featureGates:
EphemeralContainers: true
ServiceAppProtocol: true
  1. 初始化kubernets master1节点
swapoff -a
kubeadm init \
--config=/root/kubeadm.conf \
--ignore-preflight-errors all \
--cri-socket /var/run/containerd/containerd.sock \
--feature-gates EphemeralContainers=true,ServiceAppProtocol=true
  1. 拷贝kubectl配置文件
mkdir /root/.kube
cp /etc/kubernetes/admin.conf /root/.kube/config
chown -R root.root/root/.kube/config

kubeadm token create --ttl 0
  1. master2/master3 节点安装
  • 加入kubernetes集群
kubeadm join 10.36.0.68:6443 \
--token wi0cmt.0o8xodk3thphulkj \
--discovery-token-ca-cert-hash sha256:ae184f561109c7aecfadaa12a5bf6d871c6a0cced0ff4f77313b4639f0fd8f33 \
--cri-socket /run/containerd/containerd.sock \
--ignore-preflight-errors all
info

token为kubeadm初始化后提供.
discovery-token-ca-cert-hash为kubeadm初始化后提供。

  • master1节点打包配置文件
cd /etc/kubernetes
tar cvzf api.config.tar.gz admin.conf audit.yaml controller-manager.conf scheduler.conf pki/sa.key pki/sa.pub pki/front-proxy-client.crt pki/front-proxy-client.key pki/front-proxy-ca.key pki/front-proxy-ca.crt pki/apiserver-kubelet-client.crt pki/apiserver-kubelet-client.key pki/apiserver.crt pki/apiserver.key pki/ca.key
  • api.config.tar.gz拷贝到master2,master3节点
mv api.config.tar.gz /etc/kubernetes
tar xf api.config.tar.gz
  1. 拷贝api组件manifest文件
cat /etc/kubernetes/manifests/kube-apiserver.yaml
cat /etc/kubernetes/manifests/kube-controller-manager.yaml
cat /etc/kubernetes/manifests/kube-scheduler.yaml
info

kube-apiserver.yaml 文件拷贝到master2和master3上时,需要将文件中的master1节点ip信息换成对应机器上的ip。


部署 Kubelet节点

  1. 容器数据盘 (可选)

根据服务器上的硬盘灵活规划数据盘的挂载,通常将sdb设备格式化后挂在到/data目录下

msfs.xfs /dev/sdb
mkdir /data
mount /dev/sdb /data
echo "/dev/sdb none xfs defaults 0 0" >> /etc/fstab
info

sdb设备要灵活选择,要保证数据的安全性,如果服务器有多块硬盘的话,至少要做RAID1以上的模式

  1. 配置nginx apiserver
apt install -y nginx

cat << EOF | sudo tee /etc/nginx/nginx.conf
user root;
worker_processes 2;
worker_cpu_affinity auto;
pid /run/nginx.pid;
load_module /usr/share/nginx/modules/ngx_stream_module.so;
worker_rlimit_nofile 64512;
worker_shutdown_timeout 10s;
events {
multi_accept on;
worker_connections 16384;
use epoll;
}
stream {
upstream upstream_balancer {
least_conn;
# kubernetes master的api地址
server 10.36.0.68:6443 max_fails=2 fail_timeout=20s;
server 10.36.0.69:6443 max_fails=2 fail_timeout=20s;
server 10.36.0.70:6443 max_fails=2 fail_timeout=20s;
}
log_format log_stream [$time_local] $protocol $status $bytes_sent $bytes_received $session_time;
access_log /var/log/nginx/access.log log_stream ;
error_log /var/log/nginx/error.log;
server {
listen 127.0.0.1:6443;
proxy_timeout 1200s;
proxy_pass upstream_balancer;
}
}
EOF

nginx -t && nginx -s reload
  1. 安装运行时
  1. 配置containerd和cri工具**
cat <<EOF | sudo tee /etc/modules-load.d/containerd.conf
overlay
br_netfilter
EOF

sudo modprobe overlay
sudo modprobe br_netfilter

cat << EOF | sudo tee /etc/sysctl.d/99-kubernetes-cri.conf
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
net.bridge.bridge-nf-call-ip6tables = 1
EOF

# Apply sysctl params without reboot
sudo sysctl --system

cat << EOF | sudo tee /etc/crictl.yaml
runtime-endpoint: unix:///run/containerd/containerd.sock
image-endpoint: unix:///run/containerd/containerd.sock
timeout: 10
debug: false
EOF

sudo mkdir -p /etc/containerd
sudo containerd config default | sudo tee /etc/containerd/config.toml

# 将 containerd 的数据持久化到 sdb 设备
sed -i 's#root = "/var/lib/containerd"#root = "/data/containerd"#' /etc/containerd/config.toml
systemctl restart containerd

5.加入kuernetes集群

kubeadm join 127.0.0.1:6443 \
--token wi0cmt.0o8xodk3thphulkj \
--discovery-token-ca-cert-hash sha256:ae184f561109c7aecfadaa12a5bf6d871c6a0cced0ff4f77313b4639f0fd8f33 \
--cri-socket /run/containerd/containerd.sock \
--ignore-preflight-errors all
info

token为kubeadm初始化后提供,如果您没有这个数据,可以通过下述命令获取

kubeadm token ls

TOKEN TTL EXPIRES USAGES DESCRIPTION EXTRA GROUPS
8ewj1p.9r9hcjoqgajrj4gi 23h 2018-06-12T02:51:28Z authentication, The default bootstrap system:
signing token generated by bootstrappers:
'kubeadm init'. kubeadm:
default-node-token

discovery-token-ca-cert-hash为kubeadm初始化后提供。如果您没有这个数据,可以通过执行下述命令获取

openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | \
openssl dgst -sha256 -hex | sed 's/^.* //'

配置 Containerd for GPU

  1. 安装contaienrd nvidia runtime
curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \
sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \
sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt-get update

apt install nvidia-container-runtime -y
  1. 配置containerd和cri工具
cat << EOF | sudo tee /etc/containerd/config.toml
version = 2
root = "/data/containerd"
state = "/run/containerd"
plugin_dir = ""
disabled_plugins = []
required_plugins = []
oom_score = 0
[grpc]
address = "/run/containerd/containerd.sock"
tcp_address = ""
tcp_tls_cert = ""
tcp_tls_key = ""
uid = 0
gid = 0
max_recv_message_size = 16777216
max_send_message_size = 16777216
[ttrpc]
address = ""
uid = 0
gid = 0
[debug]
address = ""
uid = 0
gid = 0
level = ""
[metrics]
address = ""
grpc_histogram = false
[cgroup]
path = ""
[timeouts]
"io.containerd.timeout.shim.cleanup" = "5s"
"io.containerd.timeout.shim.load" = "5s"
"io.containerd.timeout.shim.shutdown" = "3s"
"io.containerd.timeout.task.state" = "2s"
[plugins]
[plugins."io.containerd.gc.v1.scheduler"]
pause_threshold = 0.02
deletion_threshold = 0
mutation_threshold = 100
schedule_delay = "0s"
startup_delay = "100ms"
[plugins."io.containerd.grpc.v1.cri"]
disable_tcp_service = true
stream_server_address = "127.0.0.1"
stream_server_port = "0"
stream_idle_timeout = "4h0m0s"
enable_selinux = false
selinux_category_range = 1024
sandbox_image = "k8s.gcr.io/pause:3.2"
stats_collect_period = 10
systemd_cgroup = false
enable_tls_streaming = false
max_container_log_line_size = 16384
disable_cgroup = false
disable_apparmor = false
restrict_oom_score_adj = false
max_concurrent_downloads = 3
disable_proc_mount = false
unset_seccomp_profile = ""
tolerate_missing_hugetlb_controller = true
disable_hugetlb_controller = true
ignore_image_defined_volumes = false
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
default_runtime_name = "runc"
no_pivot = false
disable_snapshot_annotations = true
discard_unpacked_layers = false
[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
runtime_type = ""
runtime_engine = ""
runtime_root = ""
privileged_without_host_devices = false
base_runtime_spec = ""
[plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
runtime_type = ""
runtime_engine = ""
runtime_root = ""
privileged_without_host_devices = false
base_runtime_spec = ""
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runtime.v1.linux"
runtime_engine = ""
runtime_root = ""
privileged_without_host_devices = false
base_runtime_spec = ""
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
max_conf_num = 1
conf_template = ""
[plugins."io.containerd.grpc.v1.cri".registry]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["https://registry-1.docker.io"]
[plugins."io.containerd.grpc.v1.cri".image_decryption]
key_model = ""
[plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
tls_cert_file = ""
tls_key_file = ""
[plugins."io.containerd.internal.v1.opt"]
path = "/opt/containerd"
[plugins."io.containerd.internal.v1.restart"]
interval = "10s"
[plugins."io.containerd.metadata.v1.bolt"]
content_sharing_policy = "shared"
[plugins."io.containerd.monitor.v1.cgroups"]
no_prometheus = false
[plugins."io.containerd.runtime.v1.linux"]
shim = "containerd-shim"
runtime = "nvidia-container-runtime"
runtime_root = ""
no_shim = false
shim_debug = false
[plugins."io.containerd.runtime.v2.task"]
platforms = ["linux/amd64"]
[plugins."io.containerd.service.v1.diff-service"]
default = ["walking"]
[plugins."io.containerd.snapshotter.v1.devmapper"]
root_path = ""
pool_name = ""
base_image_size = ""
async_remove = false
EOF

systemctl restart containerd
  1. 验证安装结果
 ctr image pull docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04
ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 cuda-11.0.3-base-ubuntu20.04 nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 34C P8 9W / 70W | 0MiB / 1µ5109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |