PyTorch is one of two major machine learning libraries for implementing deep neural networks. PyTorch errors are not the most informative at times and it’s frustrating to having to debug them from extensive search on stack-overflow and github issues. This post is a collection of common issues and how to resolve them. I will update this periodically as and when I hit new issues.
Cuda capability sm_86 not compatible
When using python virtual environments pip compiles pytorch with cuda 10 by default which does not support the newer GPUs which requilre sm_86. You will be greeted with the following error:
TX A5000 with CUDA capability sm_86 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70. If you want to use the RTX A5000 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
You can resolve this by manually specifying cuda 11 for pip installation. The way to do it is
pip install --upgrade pip setuptools wheel pip install torch==1.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
Pytorch Lightning multi-gpu DDP killed
Pytorch Lightning is a nifty library for enabling easy usage of PyTorch without having to worry about hardware, training code etc allowing you to focus on your research idea. For multi-gpu training, Distributed Data Parallel also referred to as DDP is faster than Data Parallel (also referred to as DP).
In pytorch lightning when using multiple gpus, you specify the accelerator as either
ddp. When using
sometimes the training doesn’t start and dies with the message
killed without throwing any error message. This is often
common when using kubernetes clusters (I discovered this on the Nautilus cluster used by UCSD).
To fix this, set
auto_select_gpus = True in the trainer configuration. For eg
Trainer = ( gpus = 2, auto_select_gpus = True, plugins=DDPPlugin(find_unused_parameters = False) )
Finding out which device a tensor is on
Clearing GPU cache
numpy to tensor expected Double bug got float
You would expect that converting from a numpy tensor to double would be simple.
The numpy dtype
np.double is not the equivalent
torch.double. Instead you need to use
a = np.array([1,2,3]).astype(np.single)