As I outlined here I am stuck using old versions of pytorch and torchvision due to hardware e.g. using ppc64le IBM architectures.
For this reason, I am having issues when sending and receiving checkpoints between different computers, clusters and my personal mac. I wonder if there is any way to load models in a way to avoid this issue? e.g. perhaps saving models in with a old and new format when using 1.6.x. Of course for the 1.3.1 to 1.6.x is impossible but at leat I was hoping something would work.
Any advice? Of course my ideal solution is that I don't have to worry about it and I can always load and save my checkpoints and everything I usually pickle uniformly across all my hardware.
The first error I got was a zip jit error:
RuntimeError: /home/miranda9/data/f.pt is a zip archive (did you mean to use torch.jit.load()?)
so I used that (and other pickle libraries):
# %%
import torch
from pathlib import Path
def load(path):
import torch
import pickle
import dill
path = str(path)
try:
db = torch.load(path)
f = db['f']
except Exception as e:
db = torch.jit.load(path)
f = db['f']
#with open():
# db = pickle.load(open(path, "r+"))
# db = dill.load(open(path, "r+"))
#raise ValueError(f'FAILED: {e}')
return db, f
p = "~/data/f.pt"
path = Path(p).expanduser()
db, f = load(path)
Din, nb_examples = 1, 5
x = torch.distributions.Normal(loc=0.0, scale=1.0).sample(sample_shape=(nb_examples, Din))
y = f(x)
print(y)
print('Success!\a')
but I get complains of different pytorch versions which I am forced to use:
Traceback (most recent call last):
File "hal_pg.py", line 27, in <module>
db, f = load(path)
File "hal_pg.py", line 16, in load
db = torch.jit.load(path)
File "/home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/jit/__init__.py", line 239, in load
cpp_module = torch._C.import_ir_module(cu, f, map_location, _extra_files)
RuntimeError: version_number <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /opt/anaconda/conda-bld/pytorch-base_1581395437985/work/caffe2/serialize/inline_container.cc:131, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 1. Your PyTorch installation may be too old. (init at /opt/anaconda/conda-bld/pytorch-base_1581395437985/work/caffe2/serialize/inline_container.cc:131)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xbc (0x7fff7b527b9c in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1d98 (0x7fff1d293c78 in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x88 (0x7fff1d2950d8 in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #3: torch::jit::import_ir_module(std::shared_ptr<torch::jit::script::CompilationUnit>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x64 (0x7fff1e624664 in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #4: <unknown function> + 0x70e210 (0x7fff7c0ae210 in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x28efc4 (0x7fff7bc2efc4 in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #26: <unknown function> + 0x25280 (0x7fff84b35280 in /lib64/libc.so.6)
frame #27: __libc_start_main + 0xc4 (0x7fff84b35474 in /lib64/libc.so.6)
any ideas how to make everything consistent across the clusters? I can't even open the pickle files.
maybe this is just impossible with the current pytorch version I am forced to use :(
RuntimeError: version_number <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /opt/anaconda/conda-bld/pytorch-base_1581395437985/work/caffe2/serialize/inline_container.cc:131, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 1. Your PyTorch installation may be too old. (init at /opt/anaconda/conda-bld/pytorch-base_1581395437985/work/caffe2/serialize/inline_container.cc:131)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xbc (0x7fff83ba7b9c in /home/miranda9/.conda/envs/automl-meta-learning_wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1d98 (0x7fff25993c78 in /home/miranda9/.conda/envs/automl-meta-learning_wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x88 (0x7fff259950d8 in /home/miranda9/.conda/envs/automl-meta-learning_wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #3: torch::jit::import_ir_module(std::shared_ptr<torch::jit::script::CompilationUnit>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x64 (0x7fff26d24664 in /home/miranda9/.conda/envs/automl-meta-learning_wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #4: <unknown function> + 0x70e210 (0x7fff8472e210 in /home/miranda9/.conda/envs/automl-meta-learning_wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x28efc4 (0x7fff842aefc4 in /home/miranda9/.conda/envs/automl-meta-learning_wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #23: <unknown function> + 0x25280 (0x7fff8d335280 in /lib64/libc.so.6)
frame #24: __libc_start_main + 0xc4 (0x7fff8d335474 in /lib64/libc.so.6)
using code:
from pathlib import Path
import torch
path = '/home/miranda9/data/dataset/'
path = Path(path).expanduser() / 'fi_db.pt'
path = str(path)
# db = torch.load(path)
# torch.jit.load(path)
db = torch.jit.load(str(path))
print(db)
related links:
- How to load checkpoints across different versions of pytorch (1.3.1 and 1.6.x) using ppc64le and x86?
- https://discuss.pytorch.org/t/how-to-load-checkpoints-across-different-versions-of-pytorch-1-3-1-and-1-6-x-using-ppc64le-and-x86/97829
- related gitissue: https://github.com/pytorch/pytorch/issues/43766
- reddit: https://www.reddit.com/r/pytorch/comments/jvza7v/how_to_load_checkpoints_across_different_versions/
This is not an ideal solution, but it works for transferring checkpoints from newer versions to older versions.
I also use ppc64le and face the same problems. It is possible to save the model in text format which can be read by any PyTorch version. I have PyTorch v1.3.0 installed on the ppc64le machine, and v1.7.0 on my notebook (which doesn't need to have a graphics card).
Step 1. Save model via the newer PyTorch version
Prior to saving, I load the model like so
Step 2. Transfer the text file
Step 3. Load the text file in old PyTorch
The model must be initialized before loading. The empty model is passed into the function.
Limitations
If your model state_dict contains something else than (str: torch.Tensor) values, this method will not work. You can inspect your state_dict contents with
Read these for understanding:
https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_models_for_inference.html
https://discuss.pytorch.org/t/how-to-load-part-of-pre-trained-model/1113