For GPU profiling purposes, I ran code to inference pre-trained models provided by PyTorch. The profiling results showed that a large portion of memory was pre-allocated with zeros, and then only the necessary amount was actually used later on. As a result, there was a significant amount of memory left unused until the process was terminated. I'm suspicious whether PyTorch's memory management is responsible for this behavior. After all, with pre-trained models, PyTorch should have a precise estimate of how much memory is required. Why does it choose to adopt such a method?
I performed inference using a pre-trained model provided by the PyTorch framework, expecting that only the necessary memory space would be allocated. However, I observed that memory allocation occurred at the beginning, even for memory that wouldn't be used until the end of the process.