Unpacking Large Msgpack Dataset in Google Colab and Saving to SQLite3 Database - Memory Issues

30 Views Asked by At

I am facing memory issues while trying to unpack a large Msgpack dataset in Google Colab for training an ML model. Despite attempting to load it in chunks, it consumes all available RAM. I have included the code I used, which should work for everyone on Colab. I need assistance in optimizing the unpacking process, especially for large datasets. There is limited documentation on large data unpacking with Msgpack, and I am uncertain if it works better on a local machine. Any help to make it work on Colab and suggestions for saving the data to an SQLite3 database would be greatly appreciated. Thank you.

!gdown "https://drive.google.com/uc?id=1uF5ohoVHWRWprfq7zrtk0rg7-747zEUH"
!gzip -d course_42.msgpack.gz

import msgpack
import gc
gc.disable()

file_obj = open('course_42.msgpack', 'rb')
unpacker = msgpack.Unpacker(file_obj, raw=False)

for unpacked in unpacker:
    print(unpacked)
    break
file_obj.close()
0

There are 0 best solutions below