【知乎】主流CPU性能摸底(Intel/AMD/鲲鹏/海光/飞腾)

您所在的位置:网站首页 amd和英特尔处理器对比知乎 【知乎】主流CPU性能摸底(Intel/AMD/鲲鹏/海光/飞腾)

【知乎】主流CPU性能摸底(Intel/AMD/鲲鹏/海光/飞腾)

2024-07-17 11:59| 来源: 网络整理| 查看: 265

主流CPU性能摸底(Intel/AMD/鲲鹏/海光/飞腾)

PolarDB-X

已认证账号

作者:蛰剑

前言

本文在Sysbench、TPCC等实践场景下对多款CPU的性能进行对比,同时分析各款CPU的硬件指标,最后分析不同场景下的实际性能和核心参数的关系。最终会展示Intel和AMD到底谁厉害、ARM离X86还有多大的距离、国产CPU性能到底怎么样了。

本文的姊妹篇:十年后数据库还是不敢拥抱NUMA? 主要讲述的是 同一块CPU的不同NUMA结构配置带来的几倍性能差异,这些性能差异原因也可以从本文最后时延测试数据得到印证,一起阅读效果更好。

这几款CPU的核心参数如下表:

CPU型号Hygon 7280AMD EPYC 7H12Intel 8163鲲鹏920飞腾2500物理核数3232244864超线程222路22222NUMA Node822416L1d32K32K32K64K32KL2512K512K1024K512K2048K

性能定义

同一个平台下(X86、ARM是两个平台)编译好的程序可以认为他们的 指令 数是一样的,那么执行效率就是每个时钟周期所能执行的指令数量了。

执行指令数量第一取决的就是CPU主频了,但是目前主流CPU都是2.5G左右,另外就是单核下的并行度(多发射)以及多核,再就是分支预测等,这些基本归结到了访问内存的延时。

X86和ARM这两不同平台首先指令就不一样了,然后还有上面所说的主频、内存时延的差异

IPC的说明:

IPC: insns per cycle  insn/cycles  也就是每个时钟周期能执行的指令数量,越大程序跑的越快 程序的执行时间 = 指令数/(主频*IPC) //单核下,多核的话再除以核数参与比较的几款CPU参数

先来看看测试所用到的几款CPU的主要指标,大家关注下主频、各级cache大小、NUMA结构

Hygon 7280

Hygon 7280 就是AMD Zen架构,最大IPC能到5.

架构:                           x86_64 CPU 运行模式:                   32-bit, 64-bit 字节序:                         Little Endian Address sizes:                  43 bits physical, 48 bits virtual CPU:                            128 在线 CPU 列表:                   0-127 每个核的线程数:                   2 每个座的核数:                    32 座:                             2 NUMA 节点:                      8 厂商 ID:                        HygonGenuine CPU 系列:                       24 型号:                           1 型号名称:                        Hygon C86 7280 32-core Processor 步进:                           1 CPU MHz:                       2194.586 BogoMIPS:                      3999.63 虚拟化:                         AMD-V L1d 缓存:                       2 MiB   L1i 缓存:                       4 MiB L2 缓存:                        32 MiB L3 缓存:                        128 MiB NUMA 节点0 CPU:                 0-7,64-71 NUMA 节点1 CPU:                 8-15,72-79 NUMA 节点2 CPU:                 16-23,80-87 NUMA 节点3 CPU:                 24-31,88-95 NUMA 节点4 CPU:                 32-39,96-103 NUMA 节点5 CPU:                 40-47,104-111 NUMA 节点6 CPU:                 48-55,112-119 NUMA 节点7 CPU:                 56-63,120-127

注意L1d显示为2MiB,实际是不可能有这么大的,L1、L2是每个core独享,从测试看到时延在32K有个跳跃,所以这里的L1d实际为32K。这里应该是OS展示考虑不周导致的,实际我们来看下:

[root@hygon3 15:56 /sys] #cat ./devices/system/cpu/cpu0/cache/index0/size  --- L1d 32K [root@hygon3 15:56 /sys] #cat ./devices/system/cpu/cpu0/cache/index1/size  --- L1i 64K [root@hygon3 15:57 /sys] #cat ./devices/system/cpu/cpu0/cache/index2/size  --- L2 512K [root@hygon3 15:57 /sys] #cat ./devices/system/cpu/cpu0/cache/index3/size  --- L3 8192K [root@hygon3 15:57 /sys] #cat ./devices/system/cpu/cpu0/cache/index1/type Instruction [root@hygon3 16:00 /sys] #cat ./devices/system/cpu/cpu0/cache/index0/type Data [root@hygon3 16:00 /sys] #cat ./devices/system/cpu/cpu0/cache/index3/shared_cpu_map 00000000,0000000f,00000000,0000000f   ---- 4个物理core共享8M L3 [root@hygon3 16:00 /sys] #cat devices/system/cpu/cpu0/cache/index3/shared_cpu_list 0-3,64-67 [root@hygon3 16:01 /sys] #cat ./devices/system/cpu/cpu0/cache/index2/shared_cpu_map 00000000,00000001,00000000,00000001

AMD EPYC 7H12

AMD EPYC 7H12 64-Core(ECS,非物理机),最大IPC能到5.

# lscpu Architecture:          x86_64 CPU op-mode(s):        32-bit, 64-bit Byte Order:            Little Endian CPU(s):                64 On-line CPU(s) list:   0-63 Thread(s) per core:    2 Core(s) per socket:    16 座:                    2 NUMA 节点:             2 厂商 ID:               AuthenticAMD CPU 系列:              23 型号:                  49 型号名称:               AMD EPYC 7H12 64-Core Processor 步进:                  0 CPU MHz:              2595.124 BogoMIPS:             5190.24 虚拟化:                AMD-V 超管理器厂商:           KVM 虚拟化类型:             完全 L1d 缓存:              32K L1i 缓存:              32K L2 缓存:               512K L3 缓存:               16384K NUMA 节点0 CPU:        0-31 NUMA 节点1 CPU:        32-63

Intel

这次对比测试用到了两块Intel CPU,分别是 8163、8269 。他们的信息如下,最大IPC 是4:

#lscpu Architecture:          x86_64 CPU op-mode(s):        32-bit, 64-bit Byte Order:            Little Endian CPU(s):                96 On-line CPU(s) list:   0-95 Thread(s) per core:    2 Core(s) per socket:    24 Socket(s):             2 NUMA node(s):          1 Vendor ID:             GenuineIntel CPU family:            6 Model:                 85 Model name:            Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz Stepping:              4 CPU MHz:               2499.121 CPU max MHz:           3100.0000 CPU min MHz:           1000.0000 BogoMIPS:              4998.90 Virtualization:        VT-x L1d cache:             32K L1i cache:             32K L2 cache:              1024K L3 cache:              33792K NUMA node0 CPU(s):     0-95   -----8269CY #lscpu Architecture:          x86_64 CPU op-mode(s):        32-bit, 64-bit Byte Order:            Little Endian CPU(s):                104 On-line CPU(s) list:   0-103 Thread(s) per core:    2 Core(s) per socket:    26 Socket(s):             2 NUMA node(s):          2 Vendor ID:             GenuineIntel CPU family:            6 Model:                 85 Model name:            Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz Stepping:              7 CPU MHz:               3200.000 CPU max MHz:           3800.0000 CPU min MHz:           1200.0000 BogoMIPS:              4998.89 Virtualization:        VT-x L1d cache:             32K L1i cache:             32K L2 cache:              1024K L3 cache:              36608K NUMA node0 CPU(s):     0-25,52-77 NUMA node1 CPU(s):     26-51,78-103

鲲鹏920

[root@ARM 19:15 /root/lmbench3] #numactl -H available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 node 0 size: 192832 MB node 0 free: 146830 MB node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 1 size: 193533 MB node 1 free: 175354 MB node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 2 size: 193533 MB node 2 free: 175718 MB node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 3 size: 193532 MB node 3 free: 183643 MB node distances: node   0   1   2   3  0:  10  12  20  22  1:  12  10  22  24  2:  20  22  10  12  3:  22  24  12  10  #lscpu Architecture:          aarch64 Byte Order:            Little Endian CPU(s):                96 On-line CPU(s) list:   0-95 Thread(s) per core:    1 Core(s) per socket:    48 Socket(s):             2 NUMA node(s):          4 Model:                 0 CPU max MHz:           2600.0000 CPU min MHz:           200.0000 BogoMIPS:              200.00 L1d cache:             64K L1i cache:             64K L2 cache:              512K L3 cache:              24576K NUMA node0 CPU(s):     0-23 NUMA node1 CPU(s):     24-47 NUMA node2 CPU(s):     48-71 NUMA node3 CPU(s):     72-95 Flags:                 fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm

飞腾2500

飞腾2500用nop去跑IPC的话,只能到1,但是跑其它代码能到2.33,理论值据说也是4但是我没跑到过

#lscpu Architecture:          aarch64 Byte Order:            Little Endian CPU(s):                128 On-line CPU(s) list:   0-127 Thread(s) per core:    1 Core(s) per socket:    64 Socket(s):             2 NUMA node(s):          16 Model:                 3 BogoMIPS:              100.00 L1d cache:             32K L1i cache:             32K L2 cache:              2048K L3 cache:              65536K NUMA node0 CPU(s):     0-7 NUMA node1 CPU(s):     8-15 NUMA node2 CPU(s):     16-23 NUMA node3 CPU(s):     24-31 NUMA node4 CPU(s):     32-39 NUMA node5 CPU(s):     40-47 NUMA node6 CPU(s):     48-55 NUMA node7 CPU(s):     56-63 NUMA node8 CPU(s):     64-71 NUMA node9 CPU(s):     72-79 NUMA node10 CPU(s):    80-87 NUMA node11 CPU(s):    88-95 NUMA node12 CPU(s):    96-103 NUMA node13 CPU(s):    104-111 NUMA node14 CPU(s):    112-119 NUMA node15 CPU(s):    120-127 Flags:                 fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid

单核以及超线程计算Prime性能比较

测试命令,这个测试命令无论在哪个CPU下,用2个物理核用时都是一个物理核的一半,所以这个计算是可以完全并行的

taskset -c 1 /usr/bin/sysbench --num-threads=1 --test=cpu --cpu-max-prime=50000 run //单核绑一个core; 2个thread就绑一对HT

测试结果为耗时,单位秒

测试项AMD EPYC 7H12 2.5G CentOS 7.9Hygon 7280 2.1GHz CentOSHygon 7280 2.1GHz 麒麟Intel 8269 2.50GIntel 8163 CPU @ 2.50GHzIntel E5-2682 v4 @ 2.50GHz单核 prime 50000 耗时59秒 IPC 0.5677秒 IPC 0.5589秒 IPC 0.56;83秒 0.41105秒 IPC 0.41109秒 IPC 0.39HT prime 50000 耗时57秒 IPC 0.3174秒 IPC 0.2987秒 IPC 0.2948秒 0.3560秒 IPC 0.3674秒 IPC 0.29

从上面的测试结果来看,简单纯计算场景下 AMD/海光 的单核能力还是比较强的,但是超线程完全不给力(数据库场景超线程就给力了);而Intel的超线程非常给力,一对超线程能达到单物理core的1.8倍,并且从E5到8269更是好了不少。 ARM基本都没有超线程所以没有测试鲲鹏、飞腾。

计算Prime毕竟太简单,让我们来看看他们在数据库下的真实能力吧

对比MySQL Sysbench和TPCC性能

MySQL 默认用5.7.34社区版,操作系统默认是centos,测试中所有mysqld都做了绑核,一样的压力配置尽量将CPU跑到100%, HT表示将mysqld绑定到一对HT核。

Sysbench点查

测试命令类似如下:

sysbench --test='/usr/share/doc/sysbench/tests/db/select.lua' --oltp_tables_count=1 --report-interval=1 --oltp-table-size=10000000  --mysql-port=3307 --mysql-db=sysbench_single --mysql-user=root --mysql-password='Bj6f9g96!@#'  --max-requests=0   --oltp_skip_trx=on --oltp_auto_inc=on  --oltp_range_size=5  --mysql-table-engine=innodb --rand-init=on   --max-time=300 --mysql-host=x86.51 --num-threads=4 run

测试结果分别取QPS/IPC两个数据(测试中的差异AMD、Hygon CPU跑在CentOS7.9, intel CPU、Kunpeng 920 跑在AliOS上, xdb表示用集团的xdb替换社区的MySQL Server, 麒麟是国产OS):

测试核数AMD EPYC 7H12 2.5GHygon 7280 2.1GHygon 7280 2.1GHz 麒麟Intel 8269 2.50GIntel 8163 2.50GIntel 8163 2.50G XDB5.7鲲鹏 920-4826 2.6G鲲鹏 920-4826 2.6G XDB8.0FT2500 alisql 8.0 本地–socket单核24674 0.5413441 0.4610236 0.3928208 0.7525474 0.8429376 0.899694 0.498301 0.463602 0.53一对HT36157 0.4221747 0.3819417 0.3736754 0.4935894 0.640601 0.65无HT无HT无HT4物理核94132 0.5249822 0.4638033 0.3790434 0.69 350%87254 0.73106472 0.8334686 0.4228407 0.3914232 0.5316物理核325409 0.48171630 0.38134980 0.34371718 0.69 1500%332967 0.72446290 0.85 //16核比4核好!116122 0.3594697 0.3359199 0.6 8core:31210 0.5932物理核542192 0.43298716 0.37255586 0.33642548 0.64 2700%588318 0.67598637 0.81 CPU 2400%228601 0.36177424 0.32114020 0.65

说明:麒麟OS下CPU很难跑满,大致能跑到90%-95%左右,麒麟上装的社区版MySQL-5.7.29;飞腾要特别注意mysqld所在socket,同时以上飞腾数据都是走--socket压测锁的,32core走网络压测QPS为:99496(15%的网络损耗)

从上面的结果来看单物理核能力ARM 和 X86之间的差异还是很明显的

TPCC 1000仓

测试结果(测试中Hygon 7280分别跑在CentOS7.9和麒麟上, 鲲鹏/intel CPU 跑在AliOS、麒麟是国产OS):

TPCC测试数据,结果为1000仓,tpmC (NewOrders) ,未标注CPU 则为跑满了

测试核数Intel 8269 2.50GIntel 8163 2.50GHygon 7280 2.1GHz 麒麟Hygon 7280 2.1G CentOS 7.9鲲鹏 920-4826 2.6G鲲鹏 920-4826 2.6G XDB8.01物理核1239299024706701166194653一对HT1789215324895011778无HT无HT4物理核515254087719387 380%3004623959201018物理核1007928179939664 750%60086423684057216物理核160798 抖动140488 CPU抖动75013 1400%106419 1300-1550%70581 1200%7984424物理核188051164757 1600-2100%100841 1800-2000%130815 1600-2100%88204 1600%11535532物理核195292185171 2000-2500%116071 1900-2400%142746 1800-2400%102089 1900%14356748物理核19969l195730 2100-2600%128188 2100-2800%149782 2000-2700%116374 2500%206055 4500%

测试过程CPU均跑满(未跑满的话会标注出来),IPC跑不起来性能就必然低,超线程虽然总性能好了但是会导致IPC降低(参考前面的公式)。可以看到对本来IPC比较低的场景,启用超线程后一般对性能会提升更大一些。

TPCC并发到一定程度后主要是锁导致性能上不去,所以超多核意义不大,可以做可以做类PolarDB-X部署(mysql原生分布式)

比如在Hygon 7280 2.1GHz 麒麟上起两个MySQLD实例,每个实例各绑定32物理core,性能刚好翻倍:

32核的时候对比下MySQL 社区版在Hygon7280和Intel 8163下的表现,IPC的差异还是很明显的,基本和TPS差异一致:

从Sysbench和TPCC测试结果来看AMD和Intel差异不大,ARM和X86差异比较大,国产CPU还有很大的进步空间。就像前面所说抛开指令集的差异,主频差不多,内存管够为什么还有这么大的性能差别呢?

三款CPU的性能指标

下面让我们回到硬件本身的数据来看这个问题

先记住这个图,描述的是CPU访问寄存器、L1 cache、L2 cache等延时,关键记住他们的差异

图中数据在不同CPU型号下会有差异,但是不会有数量级的差异。从图中可以看到L2是L1时延的三倍,一次L2读取相当于CPU耗时10 cycles;一次内存操作是L1时延的120倍,读取一次内存相当于CPU耗时400个 cycles 了。也就是读一次内存的时间如果用来做i++够你加上400次了。 从这里可以看到访问内存才是CPU的最大瓶颈,所以增加了一大堆Cache,Cache的成本占到了1块CPU的50%以上

另外注意L1、L2是每个core单独占用,L3是一个Socket下所有core共享;应用绑核的意思就是让L1、L2尽量少失效,1%的L1 cache miss可能导致10%以上(增加了10倍以上的平均延时)的性能劣化。

不同的core访问不同的内存,根据core和内存的距离关系(NUMA结构)性能也会有30%到200%以上的性能下降,接下来的大量测试都会测试不同的core访问不同内存的延时。

接下来用lmbench来测试各个CPU的内存延时,stream主要用于测试带宽,对应的时延是在带宽跑满情况下的带宽;lat_mem_rd用来测试操作不同数据大小的时延。

飞腾2500

用stream测试带宽和latency,可以看到带宽随着NUMA距离增加而不断减少、同时对应的latency不断增加,到最近的NUMA Node有10%的损耗,这个损耗和numactl给出的距离完全一致。跨Socket访问内存latency是Node内的3倍,带宽是三分之一,但是Socket1性能和Socket0性能完全一致。从这个延时来看如果要是跑一个32core的实例性能一定不会太好,并且抖动剧烈

time for i in $(seq 7 8 128); do echo $i; numactl -C $i -m 0 ./bin/stream -W 5 -N 5 -M 64M; done #numactl -C 7 -m 0 ./bin/stream  -W 5 -N 5 -M 64M STREAM copy latency: 2.84 nanoseconds STREAM copy bandwidth: 5638.21 MB/sec STREAM scale latency: 2.72 nanoseconds STREAM scale bandwidth: 5885.97 MB/sec STREAM add latency: 2.26 nanoseconds STREAM add bandwidth: 10615.13 MB/sec STREAM triad latency: 4.53 nanoseconds STREAM triad bandwidth: 5297.93 MB/sec #numactl -C 7 -m 1 ./bin/stream  -W 5 -N 5 -M 64M STREAM copy latency: 3.16 nanoseconds STREAM copy bandwidth: 5058.71 MB/sec STREAM scale latency: 3.15 nanoseconds STREAM scale bandwidth: 5074.78 MB/sec STREAM add latency: 2.35 nanoseconds STREAM add bandwidth: 10197.36 MB/sec STREAM triad latency: 5.12 nanoseconds STREAM triad bandwidth: 4686.37 MB/sec #numactl -C 7 -m 2 ./bin/stream  -W 5 -N 5 -M 64M STREAM copy latency: 3.85 nanoseconds STREAM copy bandwidth: 4150.98 MB/sec STREAM scale latency: 3.95 nanoseconds STREAM scale bandwidth: 4054.30 MB/sec STREAM add latency: 2.64 nanoseconds STREAM add bandwidth: 9100.12 MB/sec STREAM triad latency: 6.39 nanoseconds STREAM triad bandwidth: 3757.70 MB/sec #numactl -C 7 -m 3 ./bin/stream  -W 5 -N 5 -M 64M STREAM copy latency: 3.69 nanoseconds STREAM copy bandwidth: 4340.24 MB/sec STREAM scale latency: 3.62 nanoseconds STREAM scale bandwidth: 4422.18 MB/sec STREAM add latency: 2.47 nanoseconds STREAM add bandwidth: 9704.82 MB/sec STREAM triad latency: 5.74 nanoseconds STREAM triad bandwidth: 4177.85 MB/sec [root@101a05001 /root/lmbench3] #numactl -C 7 -m 7 ./bin/stream  -W 5 -N 5 -M 64M STREAM copy latency: 3.95 nanoseconds STREAM copy bandwidth: 4051.51 MB/sec STREAM scale latency: 3.94 nanoseconds STREAM scale bandwidth: 4060.63 MB/sec STREAM add latency: 2.54 nanoseconds STREAM add bandwidth: 9434.51 MB/sec STREAM triad latency: 6.13 nanoseconds STREAM triad bandwidth: 3913.36 MB/sec [root@101a05001 /root/lmbench3] #numactl -C 7 -m 10 ./bin/stream  -W 5 -N 5 -M 64M STREAM copy latency: 8.80 nanoseconds STREAM copy bandwidth: 1817.78 MB/sec STREAM scale latency: 8.59 nanoseconds STREAM scale bandwidth: 1861.65 MB/sec STREAM add latency: 5.55 nanoseconds STREAM add bandwidth: 4320.68 MB/sec STREAM triad latency: 13.94 nanoseconds STREAM triad bandwidth: 1721.76 MB/sec [root@101a05001 /root/lmbench3] #numactl -C 7 -m 11 ./bin/stream  -W 5 -N 5 -M 64M STREAM copy latency: 9.27 nanoseconds STREAM copy bandwidth: 1726.52 MB/sec STREAM scale latency: 9.31 nanoseconds STREAM scale bandwidth: 1718.10 MB/sec STREAM add latency: 5.65 nanoseconds STREAM add bandwidth: 4250.89 MB/sec STREAM triad latency: 14.09 nanoseconds STREAM triad bandwidth: 1703.66 MB/sec //在另外一个Socket上测试本NUMA,和Node0性能完全一致 [root@101a0500 /root/lmbench3] #numactl -C 88 -m 11 ./bin/stream  -W 5 -N 5 -M 64M STREAM copy latency: 2.93 nanoseconds STREAM copy bandwidth: 5454.67 MB/sec STREAM scale latency: 2.96 nanoseconds STREAM scale bandwidth: 5400.03 MB/sec STREAM add latency: 2.28 nanoseconds STREAM add bandwidth: 10543.42 MB/sec STREAM triad latency: 4.52 nanoseconds STREAM triad bandwidth: 5308.40 MB/sec [root@101a0500 /root/lmbench3] #numactl -C 7 -m 15 ./bin/stream  -W 5 -N 5 -M 64M STREAM copy latency: 8.73 nanoseconds STREAM copy bandwidth: 1831.77 MB/sec STREAM scale latency: 8.81 nanoseconds STREAM scale bandwidth: 1815.13 MB/sec STREAM add latency: 5.63 nanoseconds STREAM add bandwidth: 4265.21 MB/sec STREAM triad latency: 13.09 nanoseconds STREAM triad bandwidth: 1833.68 MB/sec

Lat_mem_rd 用 core7 访问Node0和Node15对比结果,随着数据的加大,延时在加大,64M时能有3倍差距,和上面测试一致

下图 第一列 表示读写数据的大小(单位M),第二列表示访问延时(单位纳秒),一般可以看到在L1/L2/L3 cache大小的地方延时会有跳跃,远超过L3大小后,延时就是内存延时了

测试命令如下

numactl -C 7 -m 0 ./bin/lat_mem_rd -W 5 -N 5 -t 64M  //-C 7 cpu 7, -m 0 node0, -W 热身 -t stride

同样的机型,开 NUMA 比关 NUMA 时延降低了几倍,同时带宽提升了几倍,所以一定要开NUMA

鲲鹏920

#for i in $(seq 0 15); do echo core:$i; numactl -N $i -m 7 ./bin/stream  -W 5 -N 5 -M 64M; done STREAM copy latency: 1.84 nanoseconds STREAM copy bandwidth: 8700.75 MB/sec STREAM scale latency: 1.86 nanoseconds STREAM scale bandwidth: 8623.60 MB/sec STREAM add latency: 2.18 nanoseconds STREAM add bandwidth: 10987.04 MB/sec STREAM triad latency: 3.03 nanoseconds STREAM triad bandwidth: 7926.87 MB/sec #numactl -C 7 -m 1 ./bin/stream  -W 5 -N 5 -M 64M STREAM copy latency: 2.05 nanoseconds STREAM copy bandwidth: 7802.45 MB/sec STREAM scale latency: 2.08 nanoseconds STREAM scale bandwidth: 7681.87 MB/sec STREAM add latency: 2.19 nanoseconds STREAM add bandwidth: 10954.76 MB/sec STREAM triad latency: 3.17 nanoseconds STREAM triad bandwidth: 7559.86 MB/sec #numactl -C 7 -m 2 ./bin/stream  -W 5 -N 5 -M 64M STREAM copy latency: 3.51 nanoseconds STREAM copy bandwidth: 4556.86 MB/sec STREAM scale latency: 3.58 nanoseconds STREAM scale bandwidth: 4463.66 MB/sec STREAM add latency: 2.71 nanoseconds STREAM add bandwidth: 8869.79 MB/sec STREAM triad latency: 5.92 nanoseconds STREAM triad bandwidth: 4057.12 MB/sec [root@ARM 19:14 /root/lmbench3] #numactl -C 7 -m 3 ./bin/stream  -W 5 -N 5 -M 64M STREAM copy latency: 3.94 nanoseconds STREAM copy bandwidth: 4064.25 MB/sec STREAM scale latency: 3.82 nanoseconds STREAM scale bandwidth: 4188.67 MB/sec STREAM add latency: 2.86 nanoseconds STREAM add bandwidth: 8390.70 MB/sec STREAM triad latency: 4.78 nanoseconds STREAM triad bandwidth: 5024.25 MB/sec #numactl -C 24 -m 3 ./bin/stream  -W 5 -N 5 -M 64M STREAM copy latency: 4.10 nanoseconds STREAM copy bandwidth: 3904.63 MB/sec STREAM scale latency: 4.03 nanoseconds STREAM scale bandwidth: 3969.41 MB/sec STREAM add latency: 3.07 nanoseconds STREAM add bandwidth: 7816.08 MB/sec STREAM triad latency: 5.06 nanoseconds STREAM triad bandwidth: 4738.66 MB/sec

海光7280

可以看到跨NUMA(这里设置的一个Socket为一个NUMA Node,等同于跨Socket)RT从1.5上升到2.5,这个数据比鲲鹏920要好很多。 这里还会测试同一块CPU设置不同数量的NUMA Node对性能的影响,所以接下来的测试会列出NUMA Node数量

[root@hygon8 14:32 /root/lmbench-master] #lscpu 架构:                           x86_64 CPU 运行模式:                   32-bit, 64-bit 字节序:                         Little Endian Address sizes:                   43 bits physical, 48 bits virtual CPU:                             128 在线 CPU 列表:                  0-127 每个核的线程数:                 2 每个座的核数:                   32 座:                             2 NUMA 节点:                      8 厂商 ID:                        HygonGenuine CPU 系列:                       24 型号:                           1 型号名称:                       Hygon C86 7280 32-core Processor 步进:                           1 CPU MHz:                        2194.586 BogoMIPS:                       3999.63 虚拟化:                         AMD-V L1d 缓存:                       2 MiB L1i 缓存:                       4 MiB L2 缓存:                        32 MiB L3 缓存:                        128 MiB NUMA 节点0 CPU:                 0-7,64-71 NUMA 节点1 CPU:                 8-15,72-79 NUMA 节点2 CPU:                 16-23,80-87 NUMA 节点3 CPU:                 24-31,88-95 NUMA 节点4 CPU:                 32-39,96-103 NUMA 节点5 CPU:                 40-47,104-111 NUMA 节点6 CPU:                 48-55,112-119 NUMA 节点7 CPU:                 56-63,120-127 //可以看到7号core比15、23、31号core明显要快,因为就近访问Node 0的内存,没有跨NUMA Node(跨Die) [root@hygon8 14:32 /root/lmbench-master] #time for i in $(seq 7 8 64); do echo $i; numactl -C $i -m 0 ./bin/stream -W 5 -N 5 -M 64M; done 7 STREAM copy latency: 1.38 nanoseconds     STREAM copy bandwidth: 11559.53 MB/sec STREAM scale latency: 1.16 nanoseconds STREAM scale bandwidth: 13815.87 MB/sec STREAM add latency: 1.40 nanoseconds STREAM add bandwidth: 17145.85 MB/sec STREAM triad latency: 1.44 nanoseconds STREAM triad bandwidth: 16637.18 MB/sec 15 STREAM copy latency: 1.67 nanoseconds STREAM copy bandwidth: 9591.77 MB/sec STREAM scale latency: 1.56 nanoseconds STREAM scale bandwidth: 10242.50 MB/sec STREAM add latency: 1.45 nanoseconds STREAM add bandwidth: 16581.00 MB/sec STREAM triad latency: 2.00 nanoseconds STREAM triad bandwidth: 12028.83 MB/sec 23 STREAM copy latency: 1.65 nanoseconds STREAM copy bandwidth: 9701.49 MB/sec STREAM scale latency: 1.53 nanoseconds STREAM scale bandwidth: 10427.98 MB/sec STREAM add latency: 1.42 nanoseconds STREAM add bandwidth: 16846.10 MB/sec STREAM triad latency: 1.97 nanoseconds STREAM triad bandwidth: 12189.72 MB/sec 31 STREAM copy latency: 1.64 nanoseconds STREAM copy bandwidth: 9742.86 MB/sec STREAM scale latency: 1.52 nanoseconds STREAM scale bandwidth: 10510.80 MB/sec STREAM add latency: 1.45 nanoseconds STREAM add bandwidth: 16559.86 MB/sec STREAM triad latency: 1.92 nanoseconds STREAM triad bandwidth: 12490.01 MB/sec 39 STREAM copy latency: 2.55 nanoseconds STREAM copy bandwidth: 6286.25 MB/sec STREAM scale latency: 2.51 nanoseconds STREAM scale bandwidth: 6383.11 MB/sec STREAM add latency: 1.76 nanoseconds STREAM add bandwidth: 13660.83 MB/sec STREAM triad latency: 3.68 nanoseconds STREAM triad bandwidth: 6523.02 MB/sec

如果这块芯片在BIOS里设置Die interleaving,那么4块Die当成一个NUMA Node吐出来给OS

#lscpu 架构:                           x86_64 CPU 运行模式:                   32-bit, 64-bit 字节序:                         Little Endian Address sizes:                   43 bits physical, 48 bits virtual CPU:                             128 在线 CPU 列表:                  0-127 每个核的线程数:                 2 每个座的核数:                   32 座:                             2 NUMA 节点:                      2 厂商 ID:                        HygonGenuine CPU 系列:                       24 型号:                           1 型号名称:                       Hygon C86 7280 32-core Processor 步进:                           1 CPU MHz:                        2108.234 BogoMIPS:                       3999.45 虚拟化:                         AMD-V L1d 缓存:                       2 MiB L1i 缓存:                       4 MiB L2 缓存:                        32 MiB L3 缓存:                        128 MiB //注意这里bios配置了Die Interleaving Enable //表示每路内多个Die内存交织分配,这样整个一个socket就是一个大Die NUMA 节点0 CPU:                 0-31,64-95   NUMA 节点1 CPU:                 32-63,96-127 //enable Die interleaving 后继续streaming测试 //最终测试结果表现就是7/15/23/31 core性能一致,因为他们属于同一个NUMA Node,内存交织分配 //可以看到同一路下的四个Die内存交织访问,所以4个Node内存延时一样了(被平均),都不如8Node就近快 [root@hygon3 16:09 /root/lmbench-master] #time for i in $(seq 7 8 64); do echo $i; numactl -C $i -m 0 ./bin/stream -W 5 -N 5 -M 64M; done 7 STREAM copy latency: 1.48 nanoseconds STREAM copy bandwidth: 10782.58 MB/sec STREAM scale latency: 1.20 nanoseconds STREAM scale bandwidth: 13364.38 MB/sec STREAM add latency: 1.46 nanoseconds STREAM add bandwidth: 16408.32 MB/sec STREAM triad latency: 1.53 nanoseconds STREAM triad bandwidth: 15696.00 MB/sec 15 STREAM copy latency: 1.51 nanoseconds STREAM copy bandwidth: 10601.25 MB/sec STREAM scale latency: 1.24 nanoseconds STREAM scale bandwidth: 12855.87 MB/sec STREAM add latency: 1.46 nanoseconds STREAM add bandwidth: 16382.42 MB/sec STREAM triad latency: 1.53 nanoseconds STREAM triad bandwidth: 15691.48 MB/sec 23 STREAM copy latency: 1.50 nanoseconds STREAM copy bandwidth: 10700.61 MB/sec STREAM scale latency: 1.27 nanoseconds STREAM scale bandwidth: 12634.63 MB/sec STREAM add latency: 1.47 nanoseconds STREAM add bandwidth: 16370.67 MB/sec STREAM triad latency: 1.55 nanoseconds STREAM triad bandwidth: 15455.75 MB/sec 31 STREAM copy latency: 1.50 nanoseconds STREAM copy bandwidth: 10637.39 MB/sec STREAM scale latency: 1.25 nanoseconds STREAM scale bandwidth: 12778.99 MB/sec STREAM add latency: 1.46 nanoseconds STREAM add bandwidth: 16420.65 MB/sec STREAM triad latency: 1.61 nanoseconds STREAM triad bandwidth: 14946.80 MB/sec 39 STREAM copy latency: 2.35 nanoseconds STREAM copy bandwidth: 6807.09 MB/sec STREAM scale latency: 2.32 nanoseconds STREAM scale bandwidth: 6906.93 MB/sec STREAM add latency: 1.63 nanoseconds STREAM add bandwidth: 14729.23 MB/sec STREAM triad latency: 3.36 nanoseconds STREAM triad bandwidth: 7151.67 MB/sec 47 STREAM copy latency: 2.31 nanoseconds STREAM copy bandwidth: 6938.47 MB/sec

intel 8269CY

lscpu Architecture:          x86_64 CPU op-mode(s):        32-bit, 64-bit Byte Order:            Little Endian CPU(s):                104 On-line CPU(s) list:   0-103 Thread(s) per core:    2 Core(s) per socket:    26 Socket(s):             2 NUMA node(s):          2 Vendor ID:             GenuineIntel CPU family:            6 Model:                 85 Model name:            Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz Stepping:              7 CPU MHz:               3200.000 CPU max MHz:           3800.0000 CPU min MHz:           1200.0000 BogoMIPS:              4998.89 Virtualization:        VT-x L1d cache:             32K L1i cache:             32K L2 cache:              1024K L3 cache:              36608K NUMA node0 CPU(s):     0-25,52-77 NUMA node1 CPU(s):     26-51,78-103 [root@numaopen /home/ren/lmbench3] #time for i in $(seq 0 8 51); do echo $i; numactl -C $i -m 0 ./bin/stream -W 5 -N 5 -M 64M; done 0 STREAM copy latency: 1.15 nanoseconds STREAM copy bandwidth: 13941.80 MB/sec STREAM scale latency: 1.16 nanoseconds STREAM scale bandwidth: 13799.89 MB/sec STREAM add latency: 1.31 nanoseconds STREAM add bandwidth: 18318.23 MB/sec STREAM triad latency: 1.56 nanoseconds STREAM triad bandwidth: 15356.72 MB/sec 16 STREAM copy latency: 1.12 nanoseconds STREAM copy bandwidth: 14293.68 MB/sec STREAM scale latency: 1.13 nanoseconds STREAM scale bandwidth: 14162.47 MB/sec STREAM add latency: 1.31 nanoseconds STREAM add bandwidth: 18293.27 MB/sec STREAM triad latency: 1.53 nanoseconds STREAM triad bandwidth: 15692.47 MB/sec 32 STREAM copy latency: 1.52 nanoseconds STREAM copy bandwidth: 10551.71 MB/sec STREAM scale latency: 1.52 nanoseconds STREAM scale bandwidth: 10508.33 MB/sec STREAM add latency: 1.38 nanoseconds STREAM add bandwidth: 17363.22 MB/sec STREAM triad latency: 2.00 nanoseconds STREAM triad bandwidth: 12024.52 MB/sec 40 STREAM copy latency: 1.49 nanoseconds STREAM copy bandwidth: 10758.50 MB/sec STREAM scale latency: 1.50 nanoseconds STREAM scale bandwidth: 10680.17 MB/sec STREAM add latency: 1.34 nanoseconds STREAM add bandwidth: 17948.34 MB/sec STREAM triad latency: 1.98 nanoseconds STREAM triad bandwidth: 12133.22 MB/sec 48 STREAM copy latency: 1.49 nanoseconds STREAM copy bandwidth: 10736.56 MB/sec STREAM scale latency: 1.50 nanoseconds STREAM scale bandwidth: 10692.93 MB/sec STREAM add latency: 1.34 nanoseconds STREAM add bandwidth: 17902.85 MB/sec STREAM triad latency: 1.96 nanoseconds STREAM triad bandwidth: 12239.44 MB/sec

Intel(R) Xeon(R) CPU E5-2682 v4

#time for i in $(seq 0 8 51); do echo $i; numactl -C $i -m 0 ./bin/stream -W 5 -N 5 -M 64M; done 0 STREAM copy latency: 1.59 nanoseconds STREAM copy bandwidth: 10092.31 MB/sec STREAM scale latency: 1.57 nanoseconds STREAM scale bandwidth: 10169.16 MB/sec STREAM add latency: 1.31 nanoseconds STREAM add bandwidth: 18360.83 MB/sec STREAM triad latency: 2.28 nanoseconds STREAM triad bandwidth: 10503.81 MB/sec 8 STREAM copy latency: 1.55 nanoseconds STREAM copy bandwidth: 10312.14 MB/sec STREAM scale latency: 1.56 nanoseconds STREAM scale bandwidth: 10283.70 MB/sec STREAM add latency: 1.30 nanoseconds STREAM add bandwidth: 18416.26 MB/sec STREAM triad latency: 2.23 nanoseconds STREAM triad bandwidth: 10777.08 MB/sec 16 STREAM copy latency: 2.02 nanoseconds STREAM copy bandwidth: 7914.25 MB/sec STREAM scale latency: 2.02 nanoseconds STREAM scale bandwidth: 7919.85 MB/sec STREAM add latency: 1.39 nanoseconds STREAM add bandwidth: 17276.06 MB/sec STREAM triad latency: 2.92 nanoseconds STREAM triad bandwidth: 8231.18 MB/sec 24 STREAM copy latency: 1.99 nanoseconds STREAM copy bandwidth: 8032.18 MB/sec STREAM scale latency: 1.98 nanoseconds STREAM scale bandwidth: 8061.12 MB/sec STREAM add latency: 1.39 nanoseconds STREAM add bandwidth: 17313.94 MB/sec STREAM triad latency: 2.88 nanoseconds STREAM triad bandwidth: 8318.93 MB/sec #lscpu Architecture:          x86_64 CPU op-mode(s):        32-bit, 64-bit Byte Order:            Little Endian CPU(s):                64 On-line CPU(s) list:   0-63 Thread(s) per core:    2 Core(s) per socket:    16 Socket(s):             2 NUMA node(s):          2 Vendor ID:             GenuineIntel CPU family:            6 Model:                 79 Model name:            Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz Stepping:              1 CPU MHz:               2500.000 CPU max MHz:           3000.0000 CPU min MHz:           1200.0000 BogoMIPS:              5000.06 Virtualization:        VT-x L1d cache:             32K L1i cache:             32K L2 cache:              256K L3 cache:              40960K NUMA node0 CPU(s):     0-15,32-47 NUMA node1 CPU(s):     16-31,48-63

stream对比数据

总结下几个CPU用stream测试访问内存的RT以及抖动和带宽对比数据

最小RT最大RT最大copy bandwidth最小copy bandwidth申威3231(2NUMA Node)7.098.752256.59 MB/sec1827.88 MB/sec飞腾2500(16 NUMA Node)2.8410.345638.21 MB/sec1546.68 MB/sec鲲鹏920(4 NUMA Node)1.843.878700.75 MB/sec4131.81 MB/sec海光7280(8 NUMA Node)1.382.5811591.48 MB/sec6206.99 MB/sec海光5280(4 NUMA Node)1.222.5213166.34 MB/sec6357.71 MB/secIntel8269CY(2 NUMA Node)1.121.5214293.68 MB/sec10551.71 MB/secIntel E5-2682(2 NUMA Node)1.582.0210092.31 MB/sec7914.25 MB/sec

从以上数据可以看出这5款CPU性能一款比一款好,飞腾2500慢的core上延时快到intel 8269的10倍了,平均延时5倍以上了。延时数据基本和单核上测试Sysbench TPS一致。

lat_mem_rd对比数据

用不同的NUMA Node上的core 跑lat_mem_rd测试访问Node0内存的RT,只取最大64M的时延,时延和Node距离完全一致,这里就不再列出测试原始数据了。

RT变化飞腾2500(16 NUMA Node)core:0 149.976core:8 168.805core:16 191.415core:24 178.283core:32 170.814core:40 185.699core:48 212.281core:56 202.479core:64 426.176core:72 444.367core:80 465.894core:88 452.245core:96 448.352core:104 460.603core:112 485.989core:120 490.402鲲鹏920(4 NUMA Node)core:0 117.323core:24 135.337core:48 197.782core:72 219.416海光7280(8 NUMA Node 和 2 NUMA Node)core0 106.839core8 168.583core16 163.925core24 163.690core32 289.628core40 288.632core48 236.615core56 291.880分割行enabled Die interleaving core:0 153.005core:16 152.458core:32 272.057core:48 269.441海光5280(4 NUMA Node)core:0 102.574core:8 160.989core:16 286.850core:24 231.197Intel 8269CY(2 NUMA Node)core:0 69.792core:26 93.107申威3231(2 NUMA Node)core:0 215.146core:32 282.443

测试命令:

for i in $(seq 0 8 127); do echo core:$i; numactl -C $i -m 0 ./bin/lat_mem_rd -W 5 -N 5 -t 64M; done >lat.log 2>&1

测试结果和numactl -H 看到的Node distance完全一致,芯片厂家应该就是这样测试然后把这个延迟当做距离写进去了

最后用一张实际测试Intel E5 的L1 、L2、L3 cache延时图来加深印象,可以看到在每级cache大小附近时延有个跳跃:

纵坐标是访问延时 纳秒,横坐标是cache大小 M,为什么上图没放内存延时,因为延时太大,放出来就把L1、L2的跳跃台阶压平了

结论

X86比ARM性能要好

AMD和Intel单核基本差别不大,Intel适合核多的大实例,AMD适合云上拆分售卖

国产CPU还有比较大的进步空间

性能上的差异在数据库场景下归因下来主要在CPU访问内存的时延上

跨NUMA Node时延差异很大,一定要开 NUMA 让CPU就近访问内存,尤其是国产CPU

数据库场景下大实例因为锁导致CPU很难跑满,建议搞多个mysqld实例

如果你一定要知道一块CPU性能的话先看 内存延时 而不是 主频,各种CPU自家打榜一般都是简单计算场景,内存访问不多,但是实际业务中大部分时候又是高频访问内存的。

参考资料

十年后数据库还是不敢拥抱NUMA?

Intel PAUSE指令变化是如何影响自旋锁以及MySQL的性能的

lmbench测试要考虑cache等

CPU的制造和概念

CPU 性能和Cache Line

Perf IPC以及CPU性能 CPU性能和CACHE

编辑于 2022-11-14 15:43・IP 属地浙江

中央处理器 (CPU)

数据库

NUMA

评论千万条,友善第一条

20 条评论

默认

最新

卡拉克西

前两个我没啥意见,后三个有脸说是主流?

2022-07-13 · IP 属地云南

小子

这样树新风才能显出你的能耐

2022-09-15 · IP 属地贵州

Bill

美国一年大概20万毕业生,中国一年有420万,是人家的20多倍,这也是美国人胆寒的。再给两年时间,情况就不一样了。再给五年时间,就会变天。

04-06 · IP 属地上海

Bill

补充一下,这里的毕业生是指工科,工程师。

04-06 · IP 属地上海

Zeratulll

信创市场里后三个确实也是主流。

2022-07-19 · IP 属地浙江

texttime vage

Intel 8163 IPC是0.67,和我在PostgreSQL下测得数据基本一致。Oracle可以达到更高的IPC。从8163的perf结果中,看不出来访存在总周期中的占比。可以添加几个cycle_activity.cycles_l1d_miss、cycle_activity.stalls_mem_any,看看访存耗用的周期占比。

2022-07-12 · IP 属地浙江

龚佶敏

感谢分享。特别是这一条:“跨NUMA Node时延差异很大,一定要开 NUMA 让CPU就近访问内存,尤其是国产CPU”貌似我们遇到的某些性能问题跟这个有一定的关系。跟厂商客服说NUMA问题他们好像听不懂,转发此文后,他们就比较认真了。

2022-11-17 · IP 属地北京

赵欣

7h12是64物理核吧

昨天 10:31 · IP 属地浙江

swordholder

没测试龙芯吗

2022-09-22 · IP 属地广东

plantegg

点赞超过200就加入龙芯,好像up主都喜欢这么说

2022-11-07 · IP 属地北京

未设置

龙芯最新的服务器CPU(3C5000、3D5000)给个人一直很少,除非和龙芯的大销售关系很好能帮你搞……普通人想买到很麻烦的

02-08 · IP 属地山东

羽生扶苏

问一下现在鲲鹏怎么样

2022-09-14 · IP 属地江西

常成

期待倚天710的测试结果

2022-07-24 · IP 属地湖南

南昌之星售票系统

请问一些 NUMA-Aware 的调度能减小这个访问内存速度的影响吗?

2022-07-23 · IP 属地浙江

plantegg

南昌之星售票系统

morsel-driver 说的是:15721.courses.cs.cmu.edu 这篇论文吧,morsel-driver 其实不依赖 numa-aware, 这很像协程,尽量按CPU core数来设置任务线程数,一个core一个线程,尽量不切换,实际效果上导致了更高的cache命中率和更少的下文切换。这个工作放到 numa-interleave 下一样对性能提升很大,只是 numa-aware 效果更好了。反过来没有morsel-driver/协程,numa-aware也比numa-interleave好。你说的论文很有意思

,好像生产上还没有用过?

2022-07-27 · IP 属地北京

南昌之星售票系统

plantegg

哦哦哦,俺是说morsel这种用户/上层的调度。俺懂你意思了,感谢

2022-07-26 · IP 属地浙江

plantegg

不能,再去读下: PolarDB-X:十年后数据库还是不敢拥抱NUMA?

2022-07-26 · IP 属地北京

南昌之星售票系统

plantegg

俺不是说减小访问速度,是说增加缓存命中率,减小它带来的影响

2022-07-26 · IP 属地浙江

plantegg

南昌之星售票系统

numa-aware 改变的是访问内存,cache命中率在core内部,我理解是不可以的。两者没关系,cache命中率基本靠绑核

2022-07-26 · IP 属地北京

sstack

主流?

2022-07-13 · IP 属地广东



【本文地址】


今日新闻


推荐新闻


    CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3