Is there the best way to train binary classification with 1000 parquet files?

64 Views Asked by At

I'm training a binary classification model with a huge dataset in parquet format. However, it has a lot, I cannot fill all of the data into memory. Currently, I am doing like below but I'm facing out-of-memory problem.

files = sorted(glob.glob('data/*.parquet'))

@delayed
def load_chunk(path):
    return ParquetFile(path).to_pandas()

df = dd.from_delayed([load_chunk(f) for f in chunk])
df = df.compute()

X = df.drop(['label'], axis=1)
y = df['label']
# Split the data into training and testing sets

Is there the best way to do it without out-of-memory?

1

There are 1 best solutions below

0
Kurumi Tokisaki On

You dont need to load all data at once. Depends on the classification algorithm you are using whether support incremental training. In scikit learn, all estimators implementing the partial_fit API are candidates such as SGDClassifier. if you are using tensorflow, you can use tfio.experimental.IODataset to stream you parquet to DNN you are training on.