Is there any advantage to switching dtypes during your process? Such as: str -> categorical (mem saving) -> str (operation) -> categorical in order to make a non-streaming operation fit into memory?

Or does just converting to categorical at the end of your operations do the same thing when it can? I'm dealing with larger than memory datasets and some of the operations aren't supported by streaming yet (.concat_list), so I want to keep things as tiny as possible during writing to file (.collect(streaming=True).write_parquet()) because sometimes my lists or strings are bigger than available memory.

Along with that, will sorting my df mid-workflow (before a non-streaming operation) reduce the memory usage?

0

There are 0 best solutions below