Kubernetes集群的自动故障转移方案

张开发
2026/6/7 18:46:43 15 分钟阅读
Kubernetes集群的自动故障转移方案
Kubernetes集群的自动故障转移方案 硬核开场各位技术老铁今天咱们聊聊Kubernetes集群的自动故障转移方案。别跟我扯那些理论直接上干货在云原生时代系统的高可用性是重中之重。不搞自动故障转移那你的集群可能在节点故障时直接崩溃让你的服务中断用户流失。 核心概念自动故障转移是什么自动故障转移是指当集群中的某个组件或节点发生故障时系统能够自动检测到故障并将工作负载转移到健康的节点上确保服务的持续运行。在Kubernetes中自动故障转移涉及多个层面包括控制平面、etcd、工作节点和应用层面。自动故障转移的核心原则快速检测及时检测到故障的发生自动转移自动将工作负载转移到健康节点最小影响尽量减少故障对服务的影响可靠性确保故障转移过程的可靠性可观测性提供故障转移的可观测性 实践指南1. 控制平面高可用多Master节点配置# kubeadm配置文件 apiVersion: kubeadm.k8s.io/v1beta2 kind: ClusterConfiguration kubernetesVersion: v1.21.0 controlPlaneEndpoint: load-balancer:6443 apiServer: certSANs: - load-balancer - 192.168.1.100 etcd: external: endpoints: - https://etcd-1:2379 - https://etcd-2:2379 - https://etcd-3:2379 caFile: /etc/kubernetes/pki/etcd/ca.crt certFile: /etc/kubernetes/pki/apiserver-etcd-client.crt keyFile: /etc/kubernetes/pki/apiserver-etcd-client.key负载均衡器配置# Nginx负载均衡器配置 upstream kubernetes-apiserver { server master-1:6443 max_fails3 fail_timeout30s; server master-2:6443 max_fails3 fail_timeout30s; server master-3:6443 max_fails3 fail_timeout30s; } server { listen 6443; server_name load-balancer; location / { proxy_pass https://kubernetes-apiserver; proxy_ssl_certificate /etc/nginx/ssl/cert.pem; proxy_ssl_certificate_key /etc/nginx/ssl/key.pem; proxy_ssl_session_cache shared:SSL:10m; proxy_ssl_session_timeout 10m; } }2. etcd高可用etcd集群配置# etcd集群配置 apiVersion: etcd.database.coreos.com/v1beta2 kind: EtcdCluster metadata: name: etcd-cluster namespace: kube-system spec: size: 3 version: 3.4.13 repository: gcr.io/etcd-development/etcd pod: resources: requests: cpu: 100m memory: 256Mi limits: cpu: 500m memory: 512Mi storage: class: standard size: 10Gietcd备份配置# etcd备份CronJob apiVersion: batch/v1 kind: CronJob metadata: name: etcd-backup namespace: kube-system spec: schedule: 0 */6 * * * jobTemplate: spec: template: spec: containers: - name: etcd-backup image: bitnami/etcd:3.4.13 command: - /bin/sh - -c - | ETCDCTL_API3 etcdctl --endpointshttps://etcd-1:2379,https://etcd-2:2379,https://etcd-3:2379 \ --cacert/etc/kubernetes/pki/etcd/ca.crt \ --cert/etc/kubernetes/pki/apiserver-etcd-client.crt \ --key/etc/kubernetes/pki/apiserver-etcd-client.key \ snapshot save /backup/etcd-snapshot-$(date %Y%m%d%H%M%S).db volumeMounts: - name: backup mountPath: /backup - name: etcd-certs mountPath: /etc/kubernetes/pki/etcd readOnly: true - name: apiserver-certs mountPath: /etc/kubernetes/pki readOnly: true volumes: - name: backup persistentVolumeClaim: claimName: etcd-backup-pvc - name: etcd-certs hostPath: path: /etc/kubernetes/pki/etcd - name: apiserver-certs hostPath: path: /etc/kubernetes/pki restartPolicy: OnFailure3. 工作节点故障转移PodDisruptionBudget配置apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: app-pdb namespace: default spec: minAvailable: 2 selector: matchLabels: app: web节点亲和性和反亲和性apiVersion: apps/v1 kind: Deployment metadata: name: web-app namespace: default spec: replicas: 3 selector: matchLabels: app: web template: metadata: labels: app: web spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - web topologyKey: kubernetes.io/hostname containers: - name: web image: your-registry/web-app:v1.0 ports: - containerPort: 80804. 应用故障转移健康检查配置apiVersion: apps/v1 kind: Deployment metadata: name: web-app namespace: default spec: replicas: 3 selector: matchLabels: app: web template: metadata: labels: app: web spec: containers: - name: web image: your-registry/web-app:v1.0 ports: - containerPort: 8080 livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 successThreshold: 1 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 15 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 3 successThreshold: 1自动重启策略apiVersion: apps/v1 kind: Deployment metadata: name: web-app namespace: default spec: replicas: 3 selector: matchLabels: app: web template: metadata: labels: app: web spec: restartPolicy: Always containers: - name: web image: your-registry/web-app:v1.0 ports: - containerPort: 80805. 集群自动扩缩容Cluster Autoscaler配置apiVersion: apps/v1 kind: Deployment metadata: name: cluster-autoscaler namespace: kube-system spec: replicas: 1 selector: matchLabels: app: cluster-autoscaler template: metadata: labels: app: cluster-autoscaler spec: containers: - name: cluster-autoscaler image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.0 command: - ./cluster-autoscaler - --v4 - --nodes3:10:node-group-1 - --nodes2:5:node-group-2 - --expanderleast-waste - --balance-similar-node-groups - --skip-nodes-with-system-podsfalse - --skip-nodes-with-local-storagefalse resources: requests: cpu: 100m memory: 300Mi limits: cpu: 200m memory: 500Mi 最佳实践1. 控制平面高可用多Master节点部署至少3个Master节点确保控制平面的高可用负载均衡器使用负载均衡器分发API服务器的请求etcd集群部署3个或5个etcd节点确保数据的一致性和可用性定期备份定期备份etcd数据确保在灾难发生时能够快速恢复2. 工作节点故障转移PodDisruptionBudget为关键应用设置PodDisruptionBudget确保在节点故障时至少有一定数量的Pod运行节点亲和性和反亲和性使用节点亲和性和反亲和性将Pod分布在不同的节点上健康检查为Pod配置合理的健康检查及时发现和处理不健康的Pod自动重启配置适当的重启策略确保Pod在故障时能够自动重启3. 应用故障转移多副本为应用部署多个副本确保在单个Pod故障时服务不中断健康检查配置详细的健康检查包括存活探针和就绪探针滚动更新使用滚动更新策略确保在更新过程中服务不中断优雅终止配置优雅终止时间确保Pod在终止时能够完成正在处理的请求4. 监控与告警节点监控监控节点的健康状态包括CPU、内存、磁盘等指标Pod监控监控Pod的运行状态包括重启次数、健康检查状态等控制平面监控监控控制平面组件的健康状态告警配置设置合理的告警规则及时发现和处理故障5. 灾备方案跨区域部署在多个区域部署集群确保在区域级故障时能够快速切换数据备份定期备份集群数据包括etcd数据、应用数据等灾难恢复演练定期进行灾难恢复演练确保在灾难发生时能够快速恢复文档完善完善灾难恢复文档确保在灾难发生时能够按照文档进行恢复 实战案例案例某电商平台的Kubernetes集群故障转移实践背景该电商平台的Kubernetes集群需要确保高可用性以应对节点故障和其他异常情况。解决方案控制平面高可用部署3个Master节点和3个etcd节点使用负载均衡器分发API服务器的请求工作节点故障转移为关键应用设置PodDisruptionBudget使用节点反亲和性将Pod分布在不同的节点上应用故障转移为应用配置详细的健康检查部署多个副本使用滚动更新策略监控与告警部署Prometheus和Grafana监控集群的健康状态设置合理的告警规则灾备方案定期备份etcd数据制定灾难恢复计划成果集群的可用性达到99.99%在节点故障时服务中断时间从分钟级缩短到秒级系统的稳定性显著提高运维团队的工作负担减轻 常见坑点etcd集群故障etcd集群配置不当导致数据不一致负载均衡器配置错误负载均衡器配置错误导致请求分发失败健康检查配置不合理健康检查配置不合理导致Pod频繁重启资源不足集群资源不足导致故障转移失败网络问题网络配置不当导致节点间通信失败监控不足缺乏对集群的监控无法及时发现故障灾备方案不完善灾备方案不完善在灾难发生时无法快速恢复 总结Kubernetes集群的自动故障转移是确保系统高可用性的关键。通过合理的控制平面高可用配置、工作节点故障转移、应用故障转移和监控告警可以显著提高集群的可用性和稳定性。记住自动故障转移不是一次性的配置而是需要持续的维护和优化。只有通过不断地测试和改进才能确保集群在面对各种故障时能够快速、可靠地进行故障转移。最后送给大家一句话高可用性不是设计出来的而是测试出来的。只有通过不断地模拟故障和演练才能确保集群的自动故障转移机制能够正常工作。各位老铁加油

更多文章