Accessing HDFS Data with Optimized Locality

61 Views Asked by David Lee At 25 December 2023 at 02:40

I am wondering, how to make sure HDFS data accessing makes the best use of the local replication to minimize the use of network transfer.

I am hosting HDFS on 3 machines and replication is set to 3. Let's name them machine A, B, C. Machine A is the namenode, and these 3 are all datanode.

Currently, I am reading data like the following code

# Run this code on machine A, B, C separately
import fsspec
import pandas as pd
with fsspec.open('hdfs://machine_A_ip:9000/path/to/data.parquet', 'rb') as fp:
    df = pd.read_parquet(fp)

And I observed I have a huge traffic of network connection 100+MB/s upload and download. No matter I run on which machine (namenode or not).

I also tried hosting Dask and Ray cluster on the same machine set. But I think Dask is not supporting this feature: Does Dask communicate with HDFS to optimize for data locality? - Stack Overflow

I haven't found clues in the documentations

API Reference — fsspec: Wraps pyarrow
pyarrow.fs.HadoopFileSystem — Apache Arrow
Filesystem Interface — Apache Arrow
Issues · apache/arrow: Seems no dicussion about locality

Original Q&A

There are 2 best solutions below

mdurant On 25 December 2023 at 14:48

You are right, dask does not know where each file is in your cluster. You would need to gather this information in some other manner; pyarrow, the current interface to HDFS does not surface this.

Ayuush Saxena On 06 January 2024 at 15:51

Can explore Short Circuit Local Reads https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html

Accessing HDFS Data with Optimized Locality

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in HADOOP

Related Questions in HDFS

Related Questions in FSSPEC

Trending Questions

Popular # Hahtags

Popular Questions