pytorch

2023-08-21 04:02| 来源: 网络整理| 查看: 265

我是 Pytorch DstributedDataParallel() 的新手，但我发现大部分教程都保存了本地排名 0 训练期间的模型。这意味着如果我得到 3 台机器，每台机器都有 4 个 GPU，最后我会得到 3 个模型，每台机器都可以保存。

例如在 pytorch ImageNet第 252 行的教程:

if not args.multiprocessing_distributed or (args.multiprocessing_distributed and args.rank % ngpus_per_node == 0): save_checkpoint({...}) 如果rank % ngpus_per_node == 0，他们会保存模型.

据我所知，DistributedDataParallel() 将自动减少后端的损失，无需做任何进一步的工作，每个进程都可以在此基础上自动同步损失。每个流程上的所有模型只会在流程结束时略有不同。这意味着我们只需要保存一个模型就足够了。

那么我们为什么不把模型保存在 rank == 0 , 但是 rank % ngpus_per_node == 0 ?

如果我有多个模型，我应该使用哪个模型？

如果这是在分布式学习中保存模型的正确方法，我应该合并它们，使用其中一个，还是根据所有三个模型推断结果？

如果我错了，请告诉我。

最佳答案

到底是怎么回事如果我在任何地方错了，请纠正我您所指的更改是在 2018 中介绍的。通过this commit并描述为:

in multiprocessing mode, only one process will write the checkpoint

以前，那些被保存没有任何if block 所以每个 GPU 上的每个节点都会保存一个模型，这确实很浪费，并且很可能会在每个节点上多次覆盖保存的模型。现在，我们正在讨论分布式多处理(可能有很多工作人员，每个工作人员可能有多个 GPU)。args.rank因此，每个进程在脚本内部由 this line 修改:args.rank = args.rank * ngpus_per_node + gpu 其中有以下评论:

For multiprocessing distributed training, rank needs to be the global rank among all the processes

因此args.rank 是所有节点中所有 GPU 中的唯一 ID (或者看起来是这样)。如果是这样，每个节点都有ngpus_per_node (在这个训练代码中，假设每个人都拥有与我收集的相同数量的 GPU)，那么模型只保存在每个节点上的一个(最后一个)GPU 上。在您的示例中，3机器和 4您将获得的 GPU 3保存的模型(希望我能正确理解这段代码，因为它非常复杂)。如果您使用 rank==0只有一个型号每个世界 (其中 world 将被定义为 n_gpus * n_nodes )将被保存。问题第一个问题

So why don't we just save model on rank == 0, but rank % ngpus_per_node == 0 ?

我将从您的假设开始，即:

To the best of my knowledge, DistributedDataParallel() will automatic do all reduce to the loss on the backend, without doing any further job, every process can sync the loss automatically base on that.

准确来说，跟损失无关，而是gradient根据文档(重点是我的)，对权重进行累积和应用更正:

This container parallelizes the application of the given module by splitting the input across the specified devices by chunking in the batch dimension. The module is replicated on each machine and each device, and each such replica handles a portion of the input. During the backwards pass, gradients from each node are averaged.

因此，当使用某些权重创建模型时，它会在所有设备(每个节点的每个 GPU)上复制。现在每个 GPU 获得一部分输入(例如，对于等于 1024 的总批量大小，4 个节点，每个节点都有 4 个 GPU，每个 GPU 将获得 64 个元素)，计算前向传递，损失，执行反向传播通过 .backward()张量法。现在所有梯度均由 all-gather 平均，参数在 root 上进行了优化机器和参数分布到所有节点，因此模块的状态在所有机器上始终相同。备注 :我不确定这种平均是如何发生的(我没有在文档中明确说明)，尽管我假设这些首先在 GPU 上平均，然后在所有节点上平均，因为这将是我认为最有效的。现在，为什么要为每个 node 保存模型？在这种情况下？原则上你只能保存一个(因为所有模块都完全相同)，但它有一些缺点:假设保存模型的节点崩溃并且文件丢失。你必须重做所有的东西。保存每个模型不是太昂贵的操作(每个 epoch 执行一次或更少)，因此可以为每个节点/工作人员轻松完成你必须重新开始训练。这意味着必须将模型复制到每个工作人员(以及一些必要的元数据，尽管我认为这里不是这种情况)无论如何，节点都必须等待每个前向传递完成(因此可以平均梯度)，如果模型保存需要大量时间，则会浪费 GPU/CPU 处于空闲状态(或者必须应用其他一些同步方案，我认为 PyTorch 中没有)。如果您查看整体情况，这使得它有点“免费”。问题 2(和 3)

And which model should I used for if I get multiple model?

没关系，因为它们都将完全相同，因为通过优化器将相同的校正应用于具有相同初始权重的模型。你可以使用类似的东西来加载你保存的.pth模型:import torch device = torch.device("cuda" if torch.cuda.is_available() else "cpu") parallel_model = torch.nn.DataParallel(MyModelGoesHere()) parallel_model.load_state_dict( torch.load("my_saved_model_state_dict.pth", map_location=str(device)) ) # DataParallel has model as an attribute usable_model = parallel_model.model

关于pytorch - 如何从 DistributedDataParallel 学习中保存和加载模型，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/61037819/

【本文地址】

pytorch

pytorch

今日新闻

推荐新闻