I'm training reinforcement learning models using tensorflow (Python) but since few weeks I can't run my code anymore on my macbook air (Monteray 12.5) with M2 chip.
I get this error
/AppleInternal/Library/BuildRoots/20d6c351-ee94-11ec-bcaf-7247572f23b4/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:88: failed assertion `[MPSNDArrayDescriptor sliceDimension:withSubrange:] error: the range subRange.start + subRange.length does not fit in dimension[0] (7)'
When I run my code on Google Collab I notice that a lot of RAM is being used and the usage is linearly increasing over time. I don't know how it works on Mac with M2, but seeing the error it kinda looks like some memory issue ?
I'm trying to profile my code using the FIL-profile or memory-profile library but they can't output anything at the end of the code since it's crashing. The only output I get is if I ctrl+C before it crashes but I don't get much info out of it since I never catch it when the memory leak happens.
I've tried debugging the code and manually checking the RAM usage at the beginning and end of each training iteration but I don't see any trend neither. By training iteration, I mean a rollout of episodes + forward pass + gradient update. It seems like it's staying around 75% of RAM usage (up to 6.1 Gb)
The code I'm using at each debug step is this (using psutil library)
# Getting % usage of virtual_memory ( 3rd field)
print('RAM memory % used:', psutil.virtual_memory()[2])
# Getting usage of virtual_memory in GB ( 4th field)
print('RAM Used (GB):', psutil.virtual_memory()[3]/1000000000)
Anyone encountered that error ? Or do you have any hints of what it means and what I should look for ?
Thanks a lot !