Good strategy training a ML model directly using data from a HDFS

277 Views Asked by noobie2023 At 27 June 2025 at 18:31

I want to train a model on a compute node but using the data (parquet format) from a storage cluster (HDFS). And I cannot copy-paste the whole dataset from HDFS onto my compute node. What would be a workable solution for this (I use python)?

I did some research and it seems Petastorm is a promising solution.

However, I came across another post saying that, quote,

The recommended workflow is:

Use Apache Spark to load and optionally preprocess data.

Use the Petastorm spark_dataset_converter method to convert data from a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader.

Feed data into a DL framework for training or inference.

I'm not sure the reason that I need PySpark here. So I'm wondering if anyone knows why? And if anyone had done similar use-case, could u please also share your solution? Thanks in advance!

Original Q&A

There are 1 best solutions below

OneCricketeer On 16 January 2023 at 23:52

If the documentation says it can use Spark dataframes, then yes, that would imply PySpark.

(Py)Spark itself has machine learning algorithms, however.

anyone knows why?

Exactly what you said - you cannot load your training dataset directly into one node.

Good strategy training a ML model directly using data from a HDFS

There are 1 best solutions below

Related Questions in HADOOP

Related Questions in PYSPARK

Related Questions in HDFS

Related Questions in PETASTORM

Trending Questions

Popular # Hahtags

Popular Questions