Good strategy training a ML model directly using data from a HDFS

234 Views Asked by At

I want to train a model on a compute node but using the data (parquet format) from a storage cluster (HDFS). And I cannot copy-paste the whole dataset from HDFS onto my compute node. What would be a workable solution for this (I use python)?

I did some research and it seems Petastorm is a promising solution.

However, I came across another post saying that, quote,

The recommended workflow is:

Use Apache Spark to load and optionally preprocess data.

Use the Petastorm spark_dataset_converter method to convert data from a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader.

Feed data into a DL framework for training or inference.

I'm not sure the reason that I need PySpark here. So I'm wondering if anyone knows why? And if anyone had done similar use-case, could u please also share your solution? Thanks in advance!

1

There are 1 best solutions below

0
On

If the documentation says it can use Spark dataframes, then yes, that would imply PySpark.

(Py)Spark itself has machine learning algorithms, however.

anyone knows why?

Exactly what you said - you cannot load your training dataset directly into one node.