
您所在的位置:网站首页 linux内核分析工具 Perf的原理、编译以及使用


2023-10-21 07:31| 来源: 网络整理| 查看: 265

​1、背景 1.1 性能分析

系统级性能优化通常包括两个阶段:性能剖析(performance profiling)和代码优化。性能剖析的目标是寻找性能瓶颈,查找引发性能问题的原因及热点代码。代码优化的目标是针对具体性能问题而优化代码或编译选项,以改善软件性能。一般在工作中比较关心的是性能瓶颈,特别是算法。


1.2 术语和缩写 Perf

perf是一款Linux性能分析工具。Linux性能计数器是一个新的基于内核的子系统,它提供一个性能分析框架,比如硬件(CPU、PMU(Performance Monitoring Unit))功能和软件(软件计数器、tracepoint)功能。通过perf,应用程序可以利用PMU、tracepoint和内核中的计数器来进行性能统计。它不但可以分析制定应用程序的性能问题(per thread),也可以用来分析内核的性能问题。

2、Perf工具概述 2.1 背景知识 2.1.1 tracepoints



2.1.2 硬件特性之cache


2.2 调优方向

Hardware Event由PMU部件产生,在特定的条件下探测性能事件是否发生以及发生的次数

Software Event是内核产生的事件,分布在各个功能模块中,统计和操作系统相关性能事件。比如进程切换,ticks等。

Tracepoint Event是内核中静态tracepoint所触发的事件,这些tracepoint用来判断程序运行期间内核的行为细节。比如slab分配器的分配次数等。

2.3 火焰图

火焰图(FlameGraph)是由Linux性能优化大师BrendanGregg发明的,和所有其他的trace和profiling方法不同的是,Flame Graph以一个全局的视野来看待时间分布,它从底部往顶部,列出所有可能的调用栈。其他的呈现方法,一般只能列出单一的调用栈或者非层次化的时间分布。




3.1 PMU


PerformanceMonitor Unit,性能监视单元,其为CPU提供的一个单元,属于硬件的范畴。通过访问相关的寄存器能读取到CPU的一些性能数据,目前大部分CPU都会提供相应的PMU。其包括各种core, offcore和uncore事件


3.2 在PC的ubuntu系统上支持perf


sudo apt install linux-tools-common 以及相关的linux的内核工具

sudo apt install linux-tools-5.4.0-56-generic 此时就可以直接使用perf,用于先期的实验。



c r o s s c o m p i l e = a r m − l i n u x − g n x x x x x − m a k e C R O S S C O M P I L E = cross_compile = arm-linux-gnxxxxx- make CROSS_COMPILE= crossc​ompile=arm−linux−gnxxxxx−makeCROSSC​OMPILE=cross_compile ARCH=arm tools/perf clean make CROSS_COMPILE=$cross_compile ARCH=arm tools/perf 完成编译后会产生相关的工具perf 注意的是要使用该工具必须要一些库,主要有elfutils这个一般在交叉编译器中可以找到 zlib也可以在交叉编译中找到 libunwind




5.1 perf lsit


List of pre-defined events (to be used in -e): branch-instructions OR branches [Hardware event] branch-misses [Hardware event] cache-misses [Hardware event] cache-references [Hardware event] cpu-cycles OR cycles [Hardware event] instructions [Hardware event] stalled-cycles-backend OR idle-cycles-backend [Hardware event] stalled-cycles-frontend OR idle-cycles-fronten [Hardware event] alignment-faults [Software event] bpf-output [Software event] context-switches OR cs [Software event] cpu-clock [Software event] cpu-migrations OR migrations [Software event] dummy [Software event] emulation-faults [Software event] major-faults [Software event] minor-faults [Software event] page-faults OR faults [Software event] task-clock [Software event] L1-dcache-load-misses [Hardware cache event] L1-dcache-loads [Hardware cache event] L1-dcache-store-misses [Hardware cache event] L1-dcache-stores [Hardware cache event] L1-icache-load-misses [Hardware cache event] branch-load-misses [Hardware cache event] branch-loads [Hardware cache event] dTLB-load-misses [Hardware cache event] dTLB-store-misses [Hardware cache event] iTLB-load-misses [Hardware cache event] armv7_cortex_a9/br_immed_retired/ [Kernel PMU event] armv7_cortex_a9/br_mis_pred/ [Kernel PMU event] armv7_cortex_a9/br_pred/ [Kernel PMU event] armv7_cortex_a9/br_return_retired/ [Kernel PMU event] armv7_cortex_a9/cid_write_retired/ [Kernel PMU event] armv7_cortex_a9/cpu_cycles/ [Kernel PMU event] armv7_cortex_a9/exc_return/ [Kernel PMU event] armv7_cortex_a9/exc_taken/ [Kernel PMU event] armv7_cortex_a9/inst_retired/ [Kernel PMU event] armv7_cortex_a9/l1d_cache/ [Kernel PMU event] armv7_cortex_a9/l1d_cache_refill/ [Kernel PMU event] armv7_cortex_a9/l1d_tlb_refill/ [Kernel PMU event] armv7_cortex_a9/l1i_cache_refill/ [Kernel PMU event] armv7_cortex_a9/l1i_tlb_refill/ [Kernel PMU event] armv7_cortex_a9/ld_retired/ [Kernel PMU event] armv7_cortex_a9/pc_write_retired/ [Kernel PMU event] armv7_cortex_a9/st_retired/ [Kernel PMU event] armv7_cortex_a9/sw_incr/ [Kernel PMU event] armv7_cortex_a9/unaligned_ldst_retired/ [Kernel PMU event] rNNN [Raw hardware event descriptor] cpu/t1=v1[,t2=v2,t3 ...]/modifier [Raw hardware event descriptor]

解释: HW event 这个是与硬件相关的事件

SW event 这个是跟内核相关的事件

HW cache event 这个是跟cache相关的

5.2 perfbench all perf中内置的benchmark,主要是对系统性能进行摸底,目前包括两套针对调度器和内存管理子系统的benchmark。 # Executed 1000000 pipe operations between two processes Total time: 41.530 [sec] 41.530077 usecs/op 24078 ops/sec # Running mem/memcpy benchmark... # function 'default' (Default memcpy() provided by glibc) # Copying 1MB bytes ... 1.511707 GB/sec # Running mem/memset benchmark... # function 'default' (Default memset() provided by glibc) # Copying 1MB bytes ... 1.603551 GB/sec # Running futex/hash benchmark... Run summary [PID 1217]: 2 threads, each operating on 1024 [private] futexes for 10 secs. Averaged 635903 operations/sec (+- 4.24%), total secs = 10 # Running futex/wake benchmark... Run summary [PID 1217]: blocking on 0 threads (at [private] futex 0x222574), waking up 1 at a time. [Run 1]: Wokeup 0 of 0 threads in 0.0010 ms [Run 2]: Wokeup 0 of 0 threads in 0.0020 ms [Run 3]: Wokeup 0 of 0 threads in 0.0010 ms [Run 4]: Wokeup 0 of 0 threads in 0.0020 ms [Run 5]: Wokeup 0 of 0 threads in 0.0020 ms [Run 6]: Wokeup 0 of 0 threads in 0.0010 ms [Run 7]: Wokeup 0 of 0 threads in 0.0010 ms [Run 8]: Wokeup 0 of 0 threads in 0.0020 ms [Run 9]: Wokeup 0 of 0 threads in 0.0020 ms [Run 10]: Wokeup 0 of 0 threads in 0.0010 ms Wokeup 0 of 0 threads in 0.0015 ms (+-11.11%) # Running futex/wake-parallel benchmark... Run summary [PID 1217]: blocking on 2 threads (at [private] futex 0x22265c), 2 threads waking up 1 at a time. [Run 1]: Avg per-thread latency (waking 1/2 threads) in 0.0105 ms (+-14.29%) [Run 2]: Avg per-thread latency (waking 1/2 threads) in 0.1815 ms (+-95.04%) [Run 3]: Avg per-thread latency (waking 1/2 threads) in 0.0075 ms (+-6.67%) [Run 4]: Avg per-thread latency (waking 1/2 threads) in 0.2210 ms (+-8.60%) [Run 5]: Avg per-thread latency (waking 1/2 threads) in 0.0090 ms (+-11.11%) [Run 6]: Avg per-thread latency (waking 1/2 threads) in 0.0755 ms (+-89.40%) [Run 7]: Avg per-thread latency (waking 1/2 threads) in 0.0080 ms (+-12.50%) [Run 8]: Avg per-thread latency (waking 1/2 threads) in 0.0870 ms (+-87.36%) [Run 9]: Avg per-thread latency (waking 1/2 threads) in 0.1920 ms (+-95.31%) [Run 10]: Avg per-thread latency (waking 1/2 threads) in 0.0090 ms (+-22.22%) Avg per-thread latency (waking 1/2 threads) in 0.0801 ms (+-34.14%) # Running futex/requeue benchmark... Run summary [PID 1217]: Requeuing 2 threads (from [private] 0x222764 to 0x222774), 1 at a time. [Run 1]: Requeued 2 of 2 threads in 0.0130 ms [Run 2]: Requeued 2 of 2 threads in 0.0150 ms [Run 3]: Requeued 2 of 2 threads in 0.0150 ms [Run 4]: Requeued 2 of 2 threads in 0.0130 ms [Run 5]: Requeued 2 of 2 threads in 0.0130 ms [Run 6]: Requeued 2 of 2 threads in 0.0130 ms [Run 7]: Requeued 2 of 2 threads in 0.0130 ms [Run 8]: Requeued 2 of 2 threads in 0.0120 ms [Run 9]: Requeued 2 of 2 threads in 0.0140 ms [Run 10]: Requeued 2 of 2 threads in 0.0140 ms Requeued 2 of 2 threads in 0.0135 ms (+-2.28%) # Running futex/lock-pi benchmark... Run summary [PID 1217]: 2 threads doing pi lock/unlock pairing for 10 secs. Averaged 73 operations/sec (+- 0.00%), total secs = 10 5.3 perf top


PerfTop: 43 irqs/sec kernel:100.0% exact: 0.0% [4000Hz cpu-clock:pppH], (all, 2 CPUs) ------------------------------------------------------------------------------- 95.50% [kernel] [k] arch_cpu_idle 1.29% [kernel] [k] read_current_ti 0.58% [kernel] [k] gt_read_long 0.58% [kernel] [k] _raw_spin_unloc 0.48% [kernel] [k] console_unlock 0.45% [kernel] [k] __timer_delay 0.25% [kernel] [k] _raw_spin_unlock_irqres 0.09% [kernel] [k] do_raw_spin_lock 0.09% doris (deleted) pa:14004000~14725000 [.] 0x000cce08 0.07% [kernel] [k] hrtimer_nanosleep 0.07% [kernel] [k] walk_stackframe 0.07% pa:04f7b000~050a8000 [.] 0x00076abc 0.05% [kernel] [k] memset 0.04% [kernel] [k] __fget 0.04% doris (deleted) pa:14004000~14725000 [.] 0x000539f8 0.04% [kernel] [k] n_tty_poll 0.04% [kflow_videoprocess] [k] $a 0.03% [kernel] [k] _raw_write_unlock_irq 0.03% [kernel] [k] do_select 0.03% [kernel] [k] put_timespec64 Mapped keys: [d] display refresh delay. (2) [e] display entries (lines). (20) [f] profile display filter (count). (5) [F] annotate display filter (percent). (5%) [s] annotate symbol. (NULL) [S] stop annotation. [K] hide kernel symbols. (no) [U] hide user symbols. (no) [z] toggle sample zeroing. (0) [qQ] quit. Enter selection, or unmapped key to continue

第一行是CPU占用比 第二行是属性 第三行是运行的函数名 [k]是指内核空间 [.]是指用户空间


第二列:符号所在的DSO(Dynamic Shared Object),可以是应用程序、内核、动态链接库、模块。




perf kmem针对slab子系统性能分析

perf kvm针对kvm虚拟化分析

perf lock分析锁性能

perf mem分析内存slab性能

perf sched分析内核调度器性能

perf trace记录系统调用轨迹

5.4 perfrecord与perf report

record会将数据保存到perf.data中。随后,可以使用perf report进行分析。

perf record和perf report可以更精确的分析一个应用,perf record可以精确到函数级别。并且在函数里面混合显示汇编语言和代码。



perf record -F 99 -ag --call-graph dwarf -o /mnt/mmc01/ – sleep 30 perf report --stdio --no-children -g graph,0.5,callee -i > perfreport.txt

record 进行记录 然后report进行报告


65.10% swapper [kernel.kallsyms] [k] arch_cpu_idle | ---arch_cpu_idle default_idle_call do_idle cpu_startup_entry | |--36.12%--secondary_start_kernel | 0x10248c | --28.97%--rest_init start_kernel 0 3.78% ALG [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore | ---_raw_spin_unlock_irqrestore | --3.77%--grph_platform_spin_unlock graph_enqueue kdrv_grph_trigger gfx_dma_copy $a proc_ioctl proc_reg_unlocked_ioctl vfs_ioctl do_vfs_ioctl ksys_ioctl __se_sys_ioctl __hyp_idmap_text_start 2.25% ctl_ipp_buf_tsk [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore | ---_raw_spin_unlock_irqrestore | --1.86%--vk_spin_unlock_irqrestore | --0.51%--hwclock_get_longcounter 1.39% NMR_VdoTrig_D2D [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore | ---_raw_spin_unlock_irqrestore | --1.15%--vk_spin_unlock_irqrestore 1.32% kdrv_ise_proc_t [kernel.kallsyms] [k] v7_dma_clean_range | ---v7_dma_clean_range fmem_dcache_sync vos_cpu_dcache_sync $a kdrv_ise_job_process_ll kdrv_ise_proc_task kthread ret_from_fork ......


解释: perf record -F 99 -p 13204 -g – sleep 30 perf record表示记录,-F 99表示每秒99次,-p 13204是进程号,即对哪个进程进行分析,-g表示记录调用栈,sleep 30则是持续30秒



6.1 recode数据

使用perf recode记录采样数据

6.2 解析recode的数据

perf script -i &> perf.unfold

6.3 将perf.unfold中的符号进行折叠 perf.unfold &> perf.folded

6.4 最后生成svg图 perf.folded > perf.svg 这样就能总览整个性能了 在这里插入图片描述

y 轴表示调用栈,每一层都是一个函数。调用栈越深,火焰就越高,顶部就是正在执行的函数,下方都是它的父函数。

x 轴表示抽样数,如果一个函数在 x 轴占据的宽度越宽,就表示它被抽到的次数多,即执行的时间长。注意,x 轴不代表时间,而是所有的调用栈合并后,按字母顺序排列的。


颜色没有特殊含义,因为火焰图表示的是 CPU 的繁忙程度,所以一般选择暖色调。




CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3