
您所在的位置:网站首页 gpu0-3d卡顿 pytorch多卡训练DDP卡死问题排查


2023-04-22 07:07| 来源: 网络整理| 查看: 265




使用nvtop工具查看,发现GPU0会被分配nproc_per_node对应数量的process,表现与预期N卡N线不符。 调用DDP部分代码展示如下:

model = MyNet(config).cuda() model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[config.LOCAL_RANK], output_device=config.LOCAL_RANK, broadcast_buffers=False, find_unused_parameters=True)




使用torch.multiprocessing手动spawn 使用自动初始化


import torch import torch.distributed as dist import torch.nn as nn import torch.optim as optim from torch.nn.parallel import DistributedDataParallel as DDP class ToyModel(nn.Module): def __init__(self): super(ToyModel, self).__init__() self.net1 = nn.Linear(10, 10) self.relu = nn.ReLU() self.net2 = nn.Linear(10, 5) def forward(self, x): return self.net2(self.relu(self.net1(x))) def demo_basic(): dist.init_process_group("nccl") rank = dist.get_rank() print(f"Start running basic DDP example on rank {rank}.") # create model and move it to GPU with id rank device_id = rank % torch.cuda.device_count() model = ToyModel().to(device_id) ddp_model = DDP(model, device_ids=[device_id]) loss_fn = nn.MSELoss() optimizer = optim.SGD(ddp_model.parameters(), lr=0.001) optimizer.zero_grad() outputs = ddp_model(torch.randn(20, 10)) labels = torch.randn(20, 5).to(device_id) loss_fn(outputs, labels).backward() optimizer.step() if __name__ == "__main__": demo_basic()


$ torchrun --nproc_per_node=8 ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** Start running basic DDP example on rank 7. Start running basic DDP example on rank 4. Start running basic DDP example on rank 2. Start running basic DDP example on rank 6. Start running basic DDP example on rank 1. Start running basic DDP example on rank 3. Start running basic DDP example on rank 0. Start running basic DDP example on rank 5. 解决

参考官方案例后,基本确定是cuda device分配出现问题。 修改mian函数如下:

dist.init_process_group("nccl") rank = dist.get_rank() print(f"Start running basic DDP example on rank {rank}.") # create model and move it to GPU with id rank device_id = rank % torch.cuda.device_count() model = MyNet(config).to(device_id) ddp_model = DDP(model, broadcast_buffers=False, find_unused_parameters=True)


device_ids (list of python:int or torch.device) : CUDA devices. 1) For single-device modules, device_ids can contain exactly one device id, which represents the only CUDA device where the input module corresponding to this process resides. Alternatively, device_ids can also be None. 2) For multi-device modules and CPU modules, device_ids must be None. When device_ids is None for both cases, both the input data for the forward pass and the actual module must be placed on the correct device. (default: None) output_device (int or torch.device) : Device location of output for single-device CUDA modules. For multi-device modules and CPU modules, it must be None, and the module itself dictates the output location. (default: device_ids[0] for single-device modules)


"RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1"

看来模型中还存在参数与输入不在一张卡的问题,由于数据集采用numpy格式的pickle进行data feed存在转换, 因此改动思路是在所有layer调用的forward函数中偷传device_id参数,从而定转换后cuda tensor保存位置。

def forward(self, input, device): input = torch.from_numpy(input).float().cuda(device, non_blocking=True)

简化版的input.cuda()方法会自动分配current_cuda_device = cuda:0导致错误。


Getting Started with Distributed Data Parallel — PyTorch Tutorials 1.12.1+cu102 documentation python - Stuck at this error "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu" - Stack Overflow




CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3