GPU速度太慢问题排查

2023-09-12 00:23| 来源: 网络整理| 查看: 265

# 多卡 GPU 速度太慢问题排查 # 1、根据参考资料 [1] 查看 GPU 之间的通信方式 nvidia-smi topo --matrix 1 GPU0 GPU1 CPU Affinity NUMA Affinity GPU0 X NODE 0-15,32-47 0 GPU1 NODE X 0-15,32-47 0 Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks 1234567891011121314

在双卡 RTX TITAN 的工作站上看到 GPU0 和 GPU1 是通过 NODE 方式通信的

GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU0 X PIX SYS SYS 0-19 N/A GPU1 PIX X SYS SYS 0-19 N/A GPU2 SYS SYS X PIX 0-19 N/A GPU3 SYS SYS PIX X 0-19 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks 12345678910111213141516

在四卡 RTX 2080Ti 的服务器上是上述输出，

GPU0 与 GPU1 是 PIX 通信方式， GPU2 与 GPU3 是 PIX 通信方式 GPU0 与 GPU2、GPU3 是 SYS 通信方式 GPU1 与 GPU2、GPU3 是 SYS 通信方式 # 2、测试通信带宽

使用 CUDA Samples 提供的 P2P 带宽测试脚本

https://github.com/NVIDIA/cuda-samples/tree/master/Samples/p2pBandwidthLatencyTest

测试方法：

在 https://github.com/NVIDIA/cuda-samples/releases 中找到对应 CUDA 版本的 CUDA Samples nvcc -V 查看 CUDA 版本例如我本地是CUDA-10.2 ，然后 Download 以及 unzip，随后按照下面的命令编译并运行 > cd cuda-samples-10.2/Samples/p2pBandwidthLatencyTest > make > ./p2pBandwidthLatencyTest 123Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 0 556.92 8.65 1 8.58 457.95 1234Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) DD 0 1 2 3 0 526.19 8.27 10.05 16.41 1 8.37 527.06 10.12 10.53 2 10.47 9.67 526.39 5.80 3 12.30 8.93 8.20 436.79 123456

GPU 之间的通信带宽属实有点低，尝试通过安装 nvlink 来解决

------------2021/09/19更新------------

更换显卡插槽后再度测试一下

> nvidia-smi topo --matrix GPU0 GPU1 CPU Affinity NUMA Affinity GPU0 X SYS 0-15,32-47 0 GPU1 SYS X 16-31,48-63 1 1234> ./p2pBandwidthLatencyTest Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 0 554.55 11.42 1 11.30 558.10 12345

发现更换插槽后使用了不同组的 cpu ，通信速度有所加快

# 3、查看 nvlink 的状态 nvidia-smi nvlink --status 1GPU 0: TITAN RTX (UUID: GPU-486a9b0f-d80e-668e-fa93-cf5988109248) Link 0: Link 1: GPU 1: TITAN RTX (UUID: GPU-726723cc-04b1-d7a2-7638-0b73154449bc) Link 0: Link 1: 123456

处于 inactive 状态

# 4、安装 NCCL

kvstore

4.1 下载 NCCL

https://developer.nvidia.com/nccl

4.2 安装 NCCL

sudo dpkg -i nccl-local-repo-ubuntu1804-2.10.3-cuda10.2_1.0-1_amd64.deb sudo apt update sudo apt install libnccl2 libnccl-dev 123# 4.3 查看 NCCL 安装情况 > dpkg -l|grep nccl ii libnccl-dev 2.10.3-1+cuda10.2 amd64 NVIDIA Collective Communication Library (NCCL) Development Files ii libnccl2 2.10.3-1+cuda10.2 amd64 NVIDIA Collective Communication Library (NCCL) Runtime ii libvncclient1:amd64 0.9.11+dfsg-1ubuntu1.4 amd64 API to write one's own VNC server - client library ii nccl-local-repo-ubuntu1804-2.10.3-cuda10.2 1.0-1 amd64 nccl-local repository configuration files ii nccl-repo-ubuntu1804-2.7.8-ga-cuda10.2 1-1 amd64 nccl repository configuration files 1234567

接着测试通信带宽并无变化，看来不是 NCCL 的原因

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 0 555.65 8.89 1 8.74 453.78 1234# 5、可能的问题

由于客观原因未排查出具体原因，列出可能的两点原因，希望对你有所帮助

温度太高：据参考资料[2]，温度在 85-90 度会到显卡的温度墙，会降频运行，解决方案可能是改装成水冷或者刀片式服务器，控制显卡温度缺少 nvlink：根据 nvidia-smi nvlink --status 得知，解决方案是购买显卡桥接器看是否能加大 GPU 间通信带宽 # 6、根据参考资料[3] 进行排查

------------2021/09/19更新------------

6.1 查看 nvlink 安装情况

> nvidia-smi topo -p2p n GPU0 GPU1 GPU0 X NS GPU1 NS X Legend: X = Self OK = Status Ok CNS = Chipset not supported GNS = GPU not supported TNS = Topology not supported NS = Not supported U = Unknown 123456789101112131415

显示结果为 Not supported

6.2 查看 P2P 通信情况

> cd ~/NVIDIA_CUDA-10.2_Samples/0_Simple/simpleP2P > make > ./simpleP2P [./simpleP2P] - Starting... Checking for multiple GPUs... CUDA-capable device count: 2 Checking GPU(s) for support of peer to peer memory access... > Peer access from TITAN RTX (GPU0) -> TITAN RTX (GPU1) : No > Peer access from TITAN RTX (GPU1) -> TITAN RTX (GPU0) : No Two or more GPUs with Peer-to-Peer access capability are required for ./simpleP2P. Peer to Peer access is not available amongst GPUs in the system, waiving test. 12345678910111213

结果显示在不使用桥接器的情况下，双卡无法实现 P2P 通信

6.3 开启 TCC 计算模式

nvidia-smi -i 0 -dm TCC 1# 6、参考资料

[1] https://www.microway.com/hpc-tech-tips/nvidia-smi_control-your-gpus/

[2] https://bizon-tech.com/blog/bizon-z9000-8gpu-deeplearning-server-rtx2080ti-titan-benchamarks-review#:~:text=Idle%20temperature%3A%2040C%20Max%20load%20noise%20level%20%28deep,%28deep%20learning%29%3A%206X%20times%20lower%20%28150%20vs.%20900%29

[3] https://www.cnblogs.com/devilmaycry812839668/p/13264080.html

【本文地址】

GPU速度太慢问题排查

GPU速度太慢问题排查

今日新闻

推荐新闻