Linux CFS 的 runnable_avg:可运行状态的负载跟踪

张开发
2026/6/8 2:44:21 15 分钟阅读
Linux CFS 的 runnable_avg:可运行状态的负载跟踪
一、简介在现代多核处理器架构中操作系统的调度器面临着前所未有的挑战如何在保证公平性的同时实现高效的负载均衡如何在节能与性能之间取得平衡如何准确预测任务的资源需求以优化系统响应。Linux内核的完全公平调度器Completely Fair Scheduler, CFS通过引入Per Entity Load TrackingPELT算法为这些问题提供了优雅的解决方案。runnable_avg作为PELT算法的核心指标之一专门用于跟踪调度实体sched_entity处于可运行状态runnable的时间比例。与util_avg跟踪实际运行时间和load_avg跟踪加权负载不同runnable_avg精确刻画了任务在就绪队列中的等待特性是负载均衡器判断CPU繁忙程度、识别性能瓶颈的关键依据。掌握runnable_avg的工作原理对于以下场景至关重要数据中心运维识别调度延迟导致的性能退化实时系统调优评估任务的响应时间保证云原生开发优化容器化应用的CPU配额分配学术研究分析操作系统调度算法的实际行为二、核心概念2.1 PELT算法基础PELTPer Entity Load Tracking是Linux内核中用于跟踪调度实体负载的指数加权移动平均算法。其核心思想是将时间划分为1024微秒约1ms的周期通过几何级数衰减计算历史负载的贡献。数学上PELT使用衰减因子 y 满足 y320.5 即32个周期约32ms前的贡献会衰减一半。这种设计使得最近的历史对当前负载评估影响更大同时保留了长期趋势信息。2.2 三种关键负载指标CFS调度器通过struct sched_avg结构体维护三类负载指标指标名称定义物理意义应用场景util_avgrunning% × SCHED_CAPACITY_SCALE任务实际占用CPU的时间比例CPU频率调节schedutil governorrunnable_avgrunnable% × SCHED_CAPACITY_SCALE任务处于可运行状态的时间比例负载均衡、调度延迟分析load_avgrunnable% × scale_load_down(weight)加权可运行时间考虑优先级任务迁移决策、组调度份额计算其中runnable_avg的独特价值在于它包含了任务在就绪队列中等待的时间能够反映CPU资源的竞争程度。当runnable_avg显著高于util_avg时表明任务经常处于就绪但无法立即执行的状态暗示该CPU核心可能存在过载。2.3 数据结构详解// 文件include/linux/sched.h struct sched_avg { u64 last_update_time; // 上次更新时间戳ns u64 load_sum; // 加权负载总和衰减累积 u64 runnable_load_sum; // 可运行负载总和 u32 util_sum; // 实际运行负载总和 u32 period_contrib; // 当前周期内的贡献值1024us为单位 unsigned long load_avg; // 平均加权负载 unsigned long runnable_load_avg; // 平均可运行负载即runnable_avg unsigned long util_avg; // 平均利用率 struct util_est util_est; // 利用率估计用于唤醒预测 } ____cacheline_aligned;对于任务调度实体task seload_sum等于runnable_load_sum因为任务的负载就是其可运行时间乘以权重而对于CFS运行队列cfs_rqload_sum是整个队列的权重乘以可运行时间体现了聚合负载的概念。三、环境准备3.1 硬件与软件要求组件最低要求推荐配置说明CPU架构x86_64或ARM64x86_64多核处理器需支持SMP以观察负载均衡内存2GB4GB以上用于编译内核和运行测试操作系统Linux 5.4Linux 6.2PELT算法在5.13有重要优化内核源码对应版本最新稳定版需开启CONFIG_SCHED_DEBUG编译工具gcc 9gcc 11支持内核编译调试工具perf, ftracebpftrace, trace-cmd用于动态追踪3.2 内核编译配置确保内核开启以下配置选项# 基础调度配置 CONFIG_SMPy # 对称多处理支持 CONFIG_SCHED_DEBUGy # 调度器调试接口关键 CONFIG_SCHEDSTATSy # 调度统计信息 CONFIG_FAIR_GROUP_SCHEDy # CFS组调度支持 # PELT相关通常默认开启 CONFIG_SMPy # PELT依赖SMP配置 # 调试文件系统 CONFIG_DEBUG_FSy # debugfs支持 # 可选用于高级分析 CONFIG_FUNCTION_TRACERy # 函数追踪 CONFIG_DYNAMIC_FTRACEy # 动态ftrace CONFIG_BPFy # eBPF支持3.3 实验环境搭建# 1. 挂载debugfs必需 sudo mount -t debugfs none /sys/kernel/debug # 2. 验证调度器调试接口 ls /sys/kernel/debug/sched/ # 预期输出sched_debug sched_features sched_tunable 等 # 3. 安装性能分析工具Ubuntu/Debian示例 sudo apt-get update sudo apt-get install -y linux-tools-common linux-tools-generic \ trace-cmd kernelshark bpftrace # 4. 验证perf工具 perf --version # 5. 创建测试工作目录 mkdir -p ~/cfs-runnable-avg-lab cd ~/cfs-runnable-avg-lab四、应用场景runnable_avg在以下具体场景中发挥关键作用场景一微服务延迟故障排查某云原生平台的API网关出现偶发延迟通过监控发现特定容器的runnable_avg持续高于util_avg达3倍以上而CPU使用率util_avg仅40%。深入分析表明该节点CFS的sched_min_granularity_ns配置过小1ms导致大量短任务频繁切换就绪队列堆积。调整sched_latency_ns至48ms后runnable_avg与util_avg比值降至1.5P99延迟下降60%。场景二异构计算负载均衡在ARM big.LITTLE架构服务器上轻量级任务的runnable_avg若在小核LITTLE上持续超过阈值如80%负载均衡器将其迁移至大核big利用大核更高的SCHED_CAPACITY_SCALE降低就绪等待时间。内核通过cpu_util()函数综合util_avg和util_est评估CPU容量而runnable_avg提供了任务拥挤度的独立视角避免仅依据利用率导致的误判。场景三实时任务干扰分析工业控制系统中CFS后台任务的高runnable_avg可能预示其将频繁抢占CPU影响SCHED_FIFO实时任务的响应。通过sched_setscheduler()将关键任务提升至实时类并监控CFS任务的runnable_avg变化可量化调度策略调整的效果。五、实际案例与步骤5.1 实验一观察单个任务的runnable_avg目标理解任务在不同负载模式下的runnable_avg表现。# 步骤1创建CPU密集型任务持续运行 # 文件cpu_intensive.c cat cpu_intensive.c EOF #include stdio.h #include unistd.h #include sched.h #include sys/types.h int main() { pid_t pid getpid(); printf(CPU intensive task started, PID: %d\n, pid); // 绑定到特定CPU以便观察 cpu_set_t cpuset; CPU_ZERO(cpuset); CPU_SET(0, cpuset); sched_setaffinity(pid, sizeof(cpu_set_t), cpuset); // 纯计算循环无I/O volatile long counter 0; while (1) { counter; // 每1亿次迭代打印一次避免输出干扰 if (counter % 100000000 0) { printf(Progress: %ld\n, counter); } } return 0; } EOF gcc -o cpu_intensive cpu_intensive.c -O2 # 步骤2创建I/O交替型任务运行睡眠交替 cat io_alternate.c EOF #include stdio.h #include unistd.h #include sched.h #include sys/types.h int main() { pid_t pid getpid(); printf(I/O alternate task started, PID: %d\n, pid); cpu_set_t cpuset; CPU_ZERO(cpuset); CPU_SET(0, cpuset); sched_setaffinity(pid, sizeof(cpu_set_t), cpuset); volatile long counter 0; while (1) { // 运行50ms long start counter; while (counter - start 50000000) { counter; } // 睡眠50ms usleep(50000); } return 0; } EOF gcc -o io_alternate io_alternate.c -O2 # 步骤3创建监控脚本读取sched信息 cat monitor_runnable_avg.sh EOF #!/bin/bash # 监控指定PID的runnable_avg和util_avg # 用法./monitor_runnable_avg.sh PID PID$1 INTERVAL${2:-1} # 默认1秒采样 if [ -z $PID ]; then echo Usage: $0 PID [interval_seconds] exit 1 fi echo Monitoring PID $PID, interval ${INTERVAL}s echo Timestamp,PID,runnable_avg,util_avg,load_avg,sum_exec_runtime while true; do if [ -f /proc/$PID/sched ]; then TIMESTAMP$(date %s.%N) SCHED_INFO$(cat /proc/$PID/sched 2/dev/null) # 提取关键字段 RUNNABLE_AVG$(echo $SCHED_INFO | grep se.avg.runnable_load_avg | awk {print $3}) UTIL_AVG$(echo $SCHED_INFO | grep se.avg.util_avg | awk {print $3}) LOAD_AVG$(echo $SCHED_INFO | grep se.avg.load_avg | awk {print $3}) EXEC_TIME$(echo $SCHED_INFO | grep se.sum_exec_runtime | awk {print $3}) echo $TIMESTAMP,$PID,$RUNNABLE_AVG,$UTIL_AVG,$LOAD_AVG,$EXEC_TIME else echo Process $PID not found break fi sleep $INTERVAL done EOF chmod x monitor_runnable_avg.sh # 步骤4执行实验 # 终端1启动CPU密集型任务 ./cpu_intensive CPU_PID$! echo CPU task PID: $CPU_PID # 终端2启动监控观察30秒 ./monitor_runnable_avg.sh $CPU_PID 0.5 cpu_task_metrics.csv MONITOR_PID$! sleep 30 kill $MONITOR_PID # 步骤5分析结果 echo CPU密集型任务统计 tail -20 cpu_task_metrics.csv # 清理 kill $CPU_PID 2/dev/null预期结果分析CPU密集型任务的runnable_avg应接近util_avg两者比值约1.0-1.2因为任务几乎不进入睡眠可运行即运行。load_avg与runnable_avg数值接近对于默认优先级任务因为任务权重为1024nice 0scale_load_down(1024)约为335。5.2 实验二多任务竞争下的runnable_avg膨胀目标观察当多个任务竞争同一CPU时runnable_avg如何反映就绪队列延迟。# 步骤1创建可配置数量的CPU密集型任务 cat cpu_contention.c EOF #include stdio.h #include unistd.h #include stdlib.h #include sched.h #include sys/wait.h int main(int argc, char *argv[]) { int num_tasks argc 1 ? atoi(argv[1]) : 4; int duration argc 2 ? atoi(argv[2]) : 30; printf(Creating %d CPU-intensive tasks on CPU 0 for %d seconds\n, num_tasks, duration); cpu_set_t cpuset; CPU_ZERO(cpuset); CPU_SET(0, cpuset); pid_t pids[100]; for (int i 0; i num_tasks i 100; i) { pid_t pid fork(); if (pid 0) { // 子进程 sched_setaffinity(0, sizeof(cpu_set_t), cpuset); volatile long counter 0; alarm(duration); // 自动终止 while (1) counter; return 0; } else if (pid 0) { pids[i] pid; printf(Started task %d, PID: %d\n, i, pid); } } // 父进程等待 for (int i 0; i num_tasks; i) { waitpid(pids[i], NULL, 0); } printf(All tasks completed\n); return 0; } EOF gcc -o cpu_contention cpu_contention.c -O2 # 步骤2创建系统级监控脚本 cat monitor_system_load.sh EOF #!/bin/bash # 监控系统级cfs_rq的runnable_avg # 输出时间戳,CPU,cfs_rq_runnable_avg,cfs_rq_util_avg,nr_running INTERVAL${1:-0.5} DURATION${2:-30} echo timestamp,cpu,runnable_avg,util_avg,nr_running,load_avg END_TIME$(($(date %s) DURATION)) while [ $(date %s) -lt $END_TIME ]; do TIMESTAMP$(date %s.%N) # 读取CPU 0的cfs_rq信息从sched_debug提取 DEBUG_INFO$(cat /sys/kernel/debug/sched/sched_debug 2/dev/null) # 提取CPU 0的信息需要解析sched_debug输出 # 注意sched_debug格式较复杂这里简化处理 # 实际应解析cpu#0段下的cfs_rq信息 # 替代方案通过/proc/schedstat获取 # 格式cpu0 0 0 0 0 0 0 0 0 0 0 0 0 # 第3个字段是nr_running第4个是nr_switches SCHEDSTAT$(cat /proc/schedstat | head -1) NR_RUNNING$(echo $SCHEDSTAT | awk {print $3}) # 使用perf获取更详细的调度统计 echo $TIMESTAMP,0,NA,NA,$NR_RUNNING,NA sleep $INTERVAL done EOF chmod x monitor_system_load.sh # 步骤3使用perf sched记录调度事件 # 终端1启动perf记录 sudo perf sched record -- sleep 30 # 终端2启动4个竞争任务 ./cpu_contention 4 30 # 步骤4分析perf结果 sudo perf sched latency --sort max # 关注Max delay列这反映了任务在就绪队列的最大等待时间 sudo perf sched map # 可视化调度时间线 # 步骤5通过trace-cmd获取runnable_avg变化 sudo trace-cmd start -e sched:sched_stat_runtime \ -e sched:sched_stat_wait \ -e sched:sched_stat_sleep ./cpu_contention 4 10 sudo trace-cmd stop sudo trace-cmd report sched_trace.txt # 分析统计平均等待时间 cat sched_trace.txt | grep sched_stat_wait | \ awk {sum$5; count} END {print Average wait time:, sum/count/1000000, ms}关键观察 当4个任务竞争1个CPU时每个任务的runnable_avg将接近SCHED_CAPACITY_SCALE通常1024而util_avg仅约25625%。runnable_avg与util_avg的比值约4:1直接反映了任务的就绪队列等待时间比例。5.3 实验三内核模块读取runnable_avg目标编写内核模块直接访问sched_avg结构体获取精确的负载值。// 文件runnable_avg_kmod.c // 功能通过内核模块导出指定任务的runnable_avg #include linux/module.h #include linux/kernel.h #include linux/sched.h #include linux/sched/signal.h #include linux/pid.h #include linux/proc_fs.h #include linux/seq_file.h #include linux/uaccess.h static int target_pid 0; module_param(target_pid, int, 0644); MODULE_PARM_DESC(target_pid, Target PID to monitor); static struct proc_dir_entry *proc_entry; static void print_sched_avg(struct seq_file *m, struct task_struct *p) { struct sched_entity *se p-se; struct sched_avg *avg se-avg; // 注意访问其他任务的sched_avg需要持有rcu锁或任务锁 // 这里简化处理实际生产代码需要更严格的同步 seq_printf(m, Task: %s (PID: %d)\n, p-comm, p-pid); seq_printf(m, \n); seq_printf(m, last_update_time: %llu ns\n, avg-last_update_time); seq_printf(m, load_sum: %llu\n, avg-load_sum); seq_printf(m, runnable_sum: %llu\n, avg-runnable_load_sum); seq_printf(m, util_sum: %u\n, avg-util_sum); seq_printf(m, period_contrib: %u\n, avg-period_contrib); seq_printf(m, load_avg: %lu\n, avg-load_avg); seq_printf(m, runnable_avg: %lu\n, avg-runnable_load_avg); seq_printf(m, util_avg: %lu\n, avg-util_avg); seq_printf(m, util_est.enqueued: %u\n, avg-util_est.enqueued); seq_printf(m, util_est.ewma: %u\n, avg-util_est.ewma); // 计算关键比率 if (avg-util_avg 0) { unsigned long ratio (avg-runnable_load_avg * 100) / avg-util_avg; seq_printf(m, \nrunnable/util ratio: %lu%%\n, ratio); seq_printf(m, Interpretation: ); if (ratio 110) seq_printf(m, Low contention, task runs immediately when ready\n); else if (ratio 200) seq_printf(m, Moderate contention, occasional scheduling delays\n); else seq_printf(m, High contention, significant time spent in runqueue\n); } // 显示任务状态 seq_printf(m, \nCurrent state: ); if (p-state TASK_RUNNING) seq_printf(m, RUNNING (on CPU or in runqueue)\n); else if (p-state TASK_INTERRUPTIBLE) seq_printf(m, SLEEPING (interruptible)\n); else if (p-state TASK_UNINTERRUPTIBLE) seq_printf(m, SLEEPING (uninterruptible)\n); else seq_printf(m, Other (%ld)\n, p-state); seq_printf(m, on_rq: %u\n, se-on_rq); } static int runnable_avg_show(struct seq_file *m, void *v) { struct task_struct *p; if (target_pid 0) { seq_printf(m, Usage: echo PID /proc/runnable_avg_target\n); seq_printf(m, Then cat /proc/runnable_avg_info\n); return 0; } rcu_read_lock(); p find_task_by_vpid(target_pid); if (!p) { rcu_read_unlock(); seq_printf(m, PID %d not found\n, target_pid); return 0; } get_task_struct(p); rcu_read_unlock(); print_sched_avg(m, p); put_task_struct(p); return 0; } static int runnable_avg_open(struct inode *inode, struct file *file) { return single_open(file, runnable_avg_show, NULL); } static ssize_t runnable_avg_target_write(struct file *file, const char __user *buffer, size_t count, loff_t *pos) { char buf[32]; int pid; if (count sizeof(buf)) count sizeof(buf) - 1; if (copy_from_user(buf, buffer, count)) return -EFAULT; buf[count] \0; if (kstrtoint(buf, 10, pid)) return -EINVAL; target_pid pid; printk(KERN_INFO runnable_avg_kmod: Now monitoring PID %d\n, pid); return count; } static const struct proc_ops runnable_avg_info_ops { .proc_open runnable_avg_open, .proc_read seq_read, .proc_lseek seq_lseek, .proc_release single_release, }; static const struct proc_ops runnable_avg_target_ops { .proc_write runnable_avg_target_write, }; static int __init runnable_avg_init(void) { proc_entry proc_mkdir(runnable_avg, NULL); if (!proc_entry) return -ENOMEM; proc_create(info, 0444, proc_entry, runnable_avg_info_ops); proc_create(target, 0222, proc_entry, runnable_avg_target_ops); printk(KERN_INFO runnable_avg_kmod: Module loaded\n); return 0; } static void __exit runnable_avg_exit(void) { remove_proc_subtree(runnable_avg, NULL); printk(KERN_INFO runnable_avg_kmod: Module unloaded\n); } MODULE_LICENSE(GPL); MODULE_AUTHOR(CFS Researcher); MODULE_DESCRIPTION(Export runnable_avg information for analysis); module_init(runnable_avg_init); module_exit(runnable_avg_exit);编译与使用# 创建Makefile cat Makefile EOF obj-m runnable_avg_kmod.o KDIR ? /lib/modules/$(shell uname -r)/build all: make -C $(KDIR) M$(PWD) modules clean: make -C $(KDIR) M$(PWD) clean EOF # 编译模块 make # 加载模块 sudo insmod runnable_avg_kmod.ko # 使用示例 # 1. 启动测试任务 ./cpu_intensive TEST_PID$! # 2. 设置监控目标 echo $TEST_PID | sudo tee /proc/runnable_avg/target # 3. 读取负载信息 cat /proc/runnable_avg/info # 4. 对比/proc接口 cat /proc/$TEST_PID/sched | grep avg # 清理 sudo rmmod runnable_avg_kmod kill $TEST_PID5.4 实验四bpftrace实时追踪runnable_avg变化目标使用eBPF技术动态追踪runnable_avg的更新事件。# 安装bpftrace如未安装 sudo apt-get install bpftrace # 创建追踪脚本 cat trace_runnable_avg.bt EOF #!/usr/bin/env bpftrace #include linux/sched.h // 追踪update_load_avg的调用观察runnable_avg变化 kprobe:update_load_avg { $se (struct sched_entity *)arg1; $cfs_rq (struct cfs_rq *)arg0; // 获取任务信息 $task (struct task_struct *)((u64)$se - ((u64)((struct task_struct *)0)-se)); $pid $task-pid; $comm $task-comm; // 只追踪特定PID通过环境变量传入 if (target_pid ! 0 $pid ! target_pid) { return; } // 记录旧的runnable_avg old_runnable_avg[$pid] $se-avg.runnable_load_avg; } kretprobe:update_load_avg /old_runnable_avg[tid] ! 0/ { $se (struct sched_entity *)reg(si); // 尝试恢复se指针 $task (struct task_struct *)((u64)$se - ((u64)((struct task_struct *)0)-se)); $pid $task-pid; $old old_runnable_avg[$pid]; $new $se-avg.runnable_load_avg; $delta (int64)$new - (int64)$old; printf(PID %d (%s): runnable_avg %lu - %lu (delta: %ld) at %llu ns\n, $pid, $comm, $old, $new, $delta, nsecs); delete(old_runnable_avg[$pid]); } // 追踪任务入队/出队事件理解runnable_avg的触发场景 tracepoint:sched:sched_wakeup { printf([WAKEUP] PID %d - CPU %d at %llu\n, args-pid, args-target_cpu, nsecs); } tracepoint:sched:sched_switch { $prev (struct task_struct *)args-prev_comm; $next (struct task_struct *)args-next_comm; printf([SWITCH] %s (PID %d, state %ld) - %s (PID %d) at %llu\n, args-prev_comm, args-prev_pid, args-prev_state, args-next_comm, args-next_pid, nsecs); } // 初始化从参数获取目标PID BEGIN { printf(Tracing runnable_avg updates. Target PID: %d\n, $1); target_pid $1; printf(Hit Ctrl-C to stop.\n); } END { clear(old_runnable_avg); clear(target_pid); } EOF chmod x trace_runnable_avg.bt # 使用示例 # 1. 启动测试任务 ./cpu_intensive TEST_PID$! echo Test PID: $TEST_PID # 2. 运行追踪持续10秒 sudo timeout 10 ./trace_runnable_avg.bt $TEST_PID -o runnable_avg_trace.log # 3. 分析输出 echo 追踪结果分析 grep runnable_avg runnable_avg_trace.log | head -20 # 统计更新频率 echo 更新事件总数 grep -c runnable_avg runnable_avg_trace.log # 清理 kill $TEST_PID六、常见问题与解答Q1: 为什么我的任务runnable_avg很高但util_avg很低现象runnable_avg接近1024满刻度util_avg仅100-200。原因这表明任务频繁处于就绪状态但无法获得CPU时间。可能原因包括CPU过载同一CPU上有更高优先级任务如实时任务或更高nice值的CFS任务亲和性限制任务被绑定到特定CPU而该CPU繁忙组调度限制任务所属cgroup的cpu.cfs_quota_us已耗尽排查命令# 检查CPU亲和性 taskset -pc PID # 检查cgroup限制 cat /proc/PID/cgroup cat /sys/fs/cgroup/cpu$(cat /proc/PID/cgroup | grep cpu | cut -d: -f3)/cpu.cfs_quota_us # 检查实时任务干扰 chrt -p PID # 查看任务调度策略 ps -eo pid,comm,rtprio,cls | grep -E FIFO|RR # 列出所有实时任务Q2:runnable_avg和load_avg有什么区别解答runnable_avgrunnable_load_sum / divider其中runnable_load_sum对任务而言就是load_sum因为任务的可运行负载等于其存在时间load_avgload_sum × weight / divider考虑了任务优先级权重对于普通任务nice 0两者数值接近对于高优先级任务nice -10load_avg显著高于runnable_avg因为其权重更大权重约3倍反映了其对系统资源的压力更大。Q3: 如何清零或重置runnable_avg解答runnable_avg是历史累积值无法直接清零。但以下操作会间接影响任务迁移迁移到新CPU时attach_entity_load_avg会根据新CPU的PELT时钟重新对齐产生软重置效果长时间睡眠任务睡眠超过32ms×N个周期后历史负载会指数衰减至接近0显式等待可通过sched_yield()让出CPU但这会增加runnable_avg因为任务仍在就绪状态注意生产环境不应尝试重置负载指标这会破坏调度器的预测准确性。Q4: 为什么cat /proc/PID/sched看不到runnable_avg字段解答内核配置可能未开启CONFIG_SCHED_DEBUG。检查方法# 检查内核配置 zcat /proc/config.gz | grep CONFIG_SCHED_DEBUG # 或 grep CONFIG_SCHED_DEBUG /boot/config-$(uname -r) # 如果未开启需重新编译内核并开启该选项 # 或者使用内核模块方式实验三直接读取内存Q5:runnable_avg在负载均衡中的具体作用解答在kernel/sched/fair.c的load_balance()函数中调度器使用weighted_cpuload()获取CPU的runnable_load_avg作为负载评估依据// 简化逻辑 static unsigned long weighted_cpuload(struct rq *rq) { return cfs_rq_runnable_load_avg(rq-cfs); }当源CPU的runnable_load_avg高于目标CPU时触发任务迁移。runnable_avg比util_avg更适合负载均衡因为它包含就绪队列等待时间更能反映CPU的拥挤程度对短任务突发更敏感能快速响应负载变化与任务权重解耦runnable_avg不考虑权重load_avg考虑便于比较不同优先级任务的绝对等待时间七、实践建议与最佳实践7.1 性能调优建议1. 监控runnable_avg/util_avg比值# 创建持续监控脚本 cat watch_contention.sh EOF #!/bin/bash # 监控所有CFS任务的runnable/util比率识别高延迟任务 while true; do clear echo CFS Task Contention Monitor echo Timestamp: $(date) echo printf %-10s %-20s %-12s %-12s %-8s %s\n \ PID COMM RUNNABLE_AVG UTIL_AVG RATIO STATUS echo ------------------------------------------------------------------------ for pid in $(pgrep -f cpu_intensive\|io_alternate | head -20); do if [ -f /proc/$pid/sched ]; then sched$(cat /proc/$pid/sched 2/dev/null) comm$(cat /proc/$pid/comm 2/dev/null) runnable$(echo $sched | grep se.avg.runnable_load_avg | awk {print $3}) util$(echo $sched | grep se.avg.util_avg | awk {print $3}) if [ -n $runnable ] [ -n $util ] [ $util -gt 0 ]; then ratio$((runnable * 100 / util)) statusNormal [ $ratio -gt 150 ] statusHigh [ $ratio -gt 300 ] statusCritical printf %-10s %-20s %-12s %-12s %-7s%% %s\n \ $pid $comm $runnable $util $ratio $status fi fi done sleep 2 done EOF chmod x watch_contention.sh ./watch_contention.sh2. 调整调度参数优化延迟当检测到高runnable_avg/util_avg比值时可尝试# 增大调度周期减少切换开销适用于批处理任务 echo 48000000 | sudo tee /sys/kernel/debug/sched/latency_ns # 默认24ms-48ms # 增大最小粒度避免过短的时间片 echo 6000000 | sudo tee /sys/kernel/debug/sched/min_granularity_ns # 默认3ms-6ms # 注意这些参数已从sysctl移至debugfs内核5.13[^43^]3. 使用schedutilgovernor优化频率选择schedutilCPU频率调节器直接使用util_avg但可结合runnable_avg进行自定义调频策略// 自定义eBPF程序示例逻辑 if (runnable_avg util_avg * 2) { // 高就绪延迟即使util不高也提升频率 suggested_freq max_freq * 0.8; }7.2 调试技巧技巧1使用ftrace追踪PELT更新# 启用PELT相关追踪点 sudo trace-cmd start -e sched:sched_pelt_se -e sched:sched_pelt_cfs # 运行测试程序 ./cpu_contention 4 10 # 查看结果 sudo trace-cmd show | grep pelt技巧2解析sched_debug输出# 获取详细的CFS运行队列信息 sudo cat /sys/kernel/debug/sched/sched_debug | grep -A 20 cfs_rq # 关注字段 # - .runnable_load_avg: cfs_rq级别的可运行负载 # - .util_avg: CPU实际利用率 # - .h_nr_running: 可运行任务数技巧3利用perf sched分析调度延迟# 记录调度事件 sudo perf sched record -- sleep 10 # 生成延迟报告 sudo perf sched latency --sort max # 关注Max delay和Avg delay这与runnable_avg高度相关7.3 常见错误与避免方案错误做法后果正确方案仅依赖top的CPU%判断负载无法识别就绪队列延迟结合runnable_avg和perf sched latency分析盲目提升实时任务优先级导致CFS任务饿死runnable_avg虚高使用SCHED_DEADLINE或限制实时任务CPU亲和性忽略sched_min_granularity_ns配置过小导致频繁切换增大runnable_avg根据任务特性调整交互式任务3ms批处理任务6-10ms在负载均衡中仅使用util_avg无法识别CPU拥挤但利用率不高的场景负载均衡决策应优先考虑runnable_load_avg八、总结与应用场景8.1 核心要点回顾本文深入剖析了Linux CFS调度器中的runnable_avg机制涵盖以下核心内容理论基础runnable_avg基于PELT算法通过指数衰减几何级数跟踪任务的可运行时间比例与util_avg实际运行和load_avg加权负载形成互补的负载评估体系。数据结构struct sched_avg维护runnable_load_sum累积和和runnable_load_avg平均值通过__update_load_avg_se()和update_cfs_rq_load_avg()在每次tick和任务状态变更时更新。实践方法通过/proc/PID/sched接口、内核模块、eBPF追踪三种方式观测runnable_avg并提供了多任务竞争实验、内核模块编程、eBPF追踪等实战代码。调优应用runnable_avg/util_avg比值是识别调度延迟的关键指标可用于指导调度参数调整、CPU频率选择、负载均衡策略优化。8.2 典型应用场景场景A云原生调度优化在Kubernetes集群中kubelet可扩展以暴露容器的runnable_avg指标。当Pod的runnable_avg持续高于util_avg200%以上时触发Pod迁移到负载更低的节点而非仅依据CPU使用率util_avg决策从而改善长尾延迟。场景B游戏/交互式应用优化游戏引擎可通过sched_getattr获取当前线程的runnable_avg需内核支持当检测到主线程runnable_avg异常升高时动态降低后台任务优先级或触发警告防止卡顿。场景C学术研究在调度算法研究中runnable_avg提供了量化调度延迟的客观指标。研究者可通过修改kernel/sched/pelt.c中的衰减因子y实验不同历史权重对负载预测准确性的影响发表论文时引用内核源码中的PELT实现。8.3 进阶学习路径源码精读深入阅读kernel/sched/pelt.c和kernel/sched/fair.c中的update_load_avg()系列函数理解propagate_entity_load_avg()的层级传播机制。组调度探索在启用CONFIG_FAIR_GROUP_SCHED的内核上研究task group的runnable_avg如何在CPU之间传播理解h_load层次负载的计算。WALT对比对比PELT与WALTWindow Assisted Load Tracking常见于Android内核在runnable_avg计算上的差异分析各自的适用场景。调度器开发尝试实现自定义调度策略利用runnable_avg作为任务选择的辅助依据提交补丁至Linux内核邮件列表参与社区讨论。

更多文章