Pytorch 深度学习结果无法复现的解决办法

您所在的位置：网站首页 › 复现论文结果不一致怎么办 › Pytorch 深度学习结果无法复现的解决办法

Pytorch 深度学习结果无法复现的解决办法

2024-07-13 18:13| 来源: 网络整理| 查看: 265

解决方案：在你的train 开头加上以下这一段代码 ##model repertition seed = 42 random.seed(seed) # os.environ['PYTHONHASHSEED'] = str(seed) # 为了禁止hash随机化，使得实验可复现 np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed(seed) torch.cuda.manual_seed_all(seed) # torch.backends.cudnn.benchmark = False # torch.backends.cudnn.deterministic = True # print(f"Random seed set as {seed}")

* 如果奏效了，就不用往下看了！

1. 在你的train.py 或者 main.py开头加上这一段代码，就可以固定所有的随机种子，包括numpy, python, pytorch(cpu, gpu). 使用deterministic = True的代码会让你的训练速度变慢，但是可以使你同样的Input得到相同的测试精度或者误差。

def seed_torch(seed: int = 42) -> None: random.seed(seed) os.environ['PYTHONHASHSEED'] = str(seed) # 为了禁止hash随机化，使得实验可复现 os.environ["CUDA_LAUNCH_BLOCKING"] = "1" os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":16:8" np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed(seed) torch.cuda.manual_seed_all(seed) # if you are using multi-GPU. torch.backends.cudnn.benchmark = False torch.backends.cudnn.deterministic = True # torch.use_deterministic_algorithms(True) # 检测是否使用了随机算法，有使用随机算法就会报错，你需要一一解决 print(f"Random seed set as {seed}") seed_torch()

2. pytorch 官方教程 minist.py 复现结果测试 , 可以100%复现测试精度，和损失。

""" `Learn the Basics `_ || **Quickstart** || `Tensors `_ || `Datasets & DataLoaders `_ || `Transforms `_ || `Build Model `_ || `Autograd `_ || `Optimization `_ || `Save & Load Model `_ Quickstart =================== This section runs through the API for common tasks in machine learning. Refer to the links in each section to dive deeper. Working with data ----------------- PyTorch has two `primitives to work with data `_: ``torch.utils.data.DataLoader`` and ``torch.utils.data.Dataset``. ``Dataset`` stores the samples and their corresponding labels, and ``DataLoader`` wraps an iterable around the ``Dataset``. """ import torch import random import os import numpy as np from torch import nn from torch.utils.data import DataLoader from torchvision import datasets from torchvision.transforms import ToTensor def seed_torch(seed: int = 42) -> None: random.seed(seed) os.environ['PYTHONHASHSEED'] = str(seed) # 为了禁止hash随机化，使得实验可复现 os.environ["CUDA_LAUNCH_BLOCKING"] = "1" os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":16:8" np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed(seed) torch.cuda.manual_seed_all(seed) # if you are using multi-GPU. torch.backends.cudnn.benchmark = False torch.backends.cudnn.deterministic = True torch.use_deterministic_algorithms(True) print(f"Random seed set as {seed}") seed_torch() ###################################################################### # PyTorch offers domain-specific libraries such as `TorchText `_, # `TorchVision `_, and `TorchAudio `_, # all of which include datasets. For this tutorial, we will be using a TorchVision dataset. # # The ``torchvision.datasets`` module contains ``Dataset`` objects for many real-world vision data like # CIFAR, COCO (`full list here `_). In this tutorial, we # use the FashionMNIST dataset. Every TorchVision ``Dataset`` includes two arguments: ``transform`` and # ``target_transform`` to modify the samples and labels respectively. # Download training data from open datasets. training_data = datasets.FashionMNIST( root="data", train=True, download=True, transform=ToTensor(), ) # Download test data from open datasets. test_data = datasets.FashionMNIST( root="data", train=False, download=True, transform=ToTensor(), ) ###################################################################### # We pass the ``Dataset`` as an argument to ``DataLoader``. This wraps an iterable over our dataset, and supports # automatic batching, sampling, shuffling and multiprocess data loading. Here we define a batch size of 64, i.e. each element # in the dataloader iterable will return a batch of 64 features and labels. batch_size = 64 # Create data loaders. train_dataloader = DataLoader(training_data, batch_size=batch_size) test_dataloader = DataLoader(test_data, batch_size=batch_size) for X, y in test_dataloader: print(f"Shape of X [N, C, H, W]: {X.shape}") print(f"Shape of y: {y.shape} {y.dtype}") break ###################################################################### # Read more about `loading data in PyTorch `_. # ###################################################################### # -------------- # ################################ # Creating Models # ------------------ # To define a neural network in PyTorch, we create a class that inherits # from `nn.Module `_. We define the layers of the network # in the ``__init__`` function and specify how data will pass through the network in the ``forward`` function. To accelerate # operations in the neural network, we move it to the GPU if available. # Get cpu or gpu device for training. device = "cuda" if torch.cuda.is_available() else "cpu" print(f"Using {device} device") # Define model class NeuralNetwork(nn.Module): def __init__(self): super(NeuralNetwork, self).__init__() self.flatten = nn.Flatten() self.linear_relu_stack = nn.Sequential( nn.Linear(28*28, 512), nn.ReLU(), nn.Linear(512, 512), nn.ReLU(), nn.Linear(512, 10) ) def forward(self, x): x = self.flatten(x) logits = self.linear_relu_stack(x) return logits model = NeuralNetwork().to(device) print(model) ###################################################################### # Read more about `building neural networks in PyTorch `_. # ###################################################################### # -------------- # ##################################################################### # Optimizing the Model Parameters # ---------------------------------------- # To train a model, we need a `loss function `_ # and an `optimizer `_. loss_fn = nn.CrossEntropyLoss() optimizer = torch.optim.SGD(model.parameters(), lr=1e-3) ####################################################################### # In a single training loop, the model makes predictions on the training dataset (fed to it in batches), and # backpropagates the prediction error to adjust the model's parameters. def train(dataloader, model, loss_fn, optimizer): size = len(dataloader.dataset) model.train() for batch, (X, y) in enumerate(dataloader): X, y = X.to(device), y.to(device) # Compute prediction error pred = model(X) loss = loss_fn(pred, y) # Backpropagation optimizer.zero_grad() loss.backward() optimizer.step() if batch % 100 == 0: loss, current = loss.item(), batch * len(X) print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]") ############################################################################## # We also check the model's performance against the test dataset to ensure it is learning. def test(dataloader, model, loss_fn): size = len(dataloader.dataset) num_batches = len(dataloader) model.eval() test_loss, correct = 0, 0 with torch.no_grad(): for X, y in dataloader: X, y = X.to(device), y.to(device) pred = model(X) test_loss += loss_fn(pred, y).item() correct += (pred.argmax(1) == y).type(torch.float).sum().item() test_loss /= num_batches correct /= size print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n") ############################################################################## # The training process is conducted over several iterations (*epochs*). During each epoch, the model learns # parameters to make better predictions. We print the model's accuracy and loss at each epoch; we'd like to see the # accuracy increase and the loss decrease with every epoch. run_times = 2 epochs = 5 for t in range(epochs): print(f"Epoch {t + 1}\n-------------------------------") train(train_dataloader, model, loss_fn, optimizer) test(test_dataloader, model, loss_fn) print("Done!") ###################################################################### # Read more about `Training your model `_. # ###################################################################### # -------------- # ###################################################################### # Saving Models # ------------- # A common way to save a model is to serialize the internal state dictionary (containing the model parameters). # torch.save(model.state_dict(), "model.pth") # print("Saved PyTorch Model State to model.pth") ###################################################################### # Loading Models # ---------------------------- # # The process for loading a model includes re-creating the model structure and loading # the state dictionary into it. model = NeuralNetwork() model.load_state_dict(torch.load("model.pth")) ############################################################# # This model can now be used to make predictions. classes = [ "T-shirt/top", "Trouser", "Pullover", "Dress", "Coat", "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot", ] model.eval() x, y = test_data[0][0], test_data[0][1] with torch.no_grad(): pred = model(x) predicted, actual = classes[pred[0].argmax(0)], classes[y] print(f'Predicted: "{predicted}", Actual: "{actual}"') ###################################################################### # Read more about `Saving & Loading your model `_. #

3. 在我自己的项目中，尽管使用了固定随机种子代码，不同次训练的Loss误差还在1% - 2% 的水平，我分析的原因是代码中使用了torch.scatter_mean(), 这是一个non-deterministic 的操作，就是说它涉及到随机的因素。但是我将torch.scatter_mean() 更换成deterministic的代码，误差还是在1%。我还没有找到有效的办法来消除这一影响。我怀疑是我的cuda 版本是11.7 而安装了pytorch cuda11.3版本，正在解决中。

另外，我可以确定在Dataloader啊loader num_workers = 0 的时候，数据加载的顺序都是一样的, 并且shuffle = True 也没有影响。

参考资料:

1. pytorch 可复现性官方文档：https://pytorch.org/docs/stable/notes/randomness.htmlhttps://pytorch.org/docs/stable/notes/randomness.html

2. pytorch non-deterministic 操作 https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html#torch.use_deterministic_algorithmshttps://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html#torch.use_deterministic_algorithms

3. 比较全面分析的深度学习复现结果的知乎文章：

https://zhuanlan.zhihu.com/p/109166845

【本文地址】

Pytorch 深度学习结果无法复现的解决办法

Pytorch 深度学习结果无法复现的解决办法

今日新闻

推荐新闻