Is there the best way to train binary classification with 1000 parquet files?

64 Views Asked by Mason At 01 August 2023 at 08:29

I'm training a binary classification model with a huge dataset in parquet format. However, it has a lot, I cannot fill all of the data into memory. Currently, I am doing like below but I'm facing out-of-memory problem.

files = sorted(glob.glob('data/*.parquet'))

@delayed
def load_chunk(path):
    return ParquetFile(path).to_pandas()

df = dd.from_delayed([load_chunk(f) for f in chunk])
df = df.compute()

X = df.drop(['label'], axis=1)
y = df['label']
# Split the data into training and testing sets

Is there the best way to do it without out-of-memory?

Original Q&A

There are 1 best solutions below

Kurumi Tokisaki On 01 August 2023 at 09:09

You dont need to load all data at once. Depends on the classification algorithm you are using whether support incremental training. In scikit learn, all estimators implementing the partial_fit API are candidates such as SGDClassifier. if you are using tensorflow, you can use tfio.experimental.IODataset to stream you parquet to DNN you are training on.

Is there the best way to train binary classification with 1000 parquet files?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in SCIKIT-LEARN

Related Questions in ARTIFICIAL-INTELLIGENCE

Related Questions in RANDOM-FOREST

Related Questions in FASTPARQUET

Trending Questions

Popular # Hahtags

Popular Questions