Read HDFS file splits

1k Views Asked by At

With HDFS's Java API, it's straightforward to read a file sequentially reading each block at a time. Here's a simple example.

I want to be able to read the file one block at a time using something like HDFS's FileSplits. The end goal is to read a file in parallel with multiple machines, each machine reading a zone of blocks. Given a HDFS Path, how can I get the FileSplits or blocks?

Map-Reduce and other processors are not involved. This is strictly a file system level operation.

2

There are 2 best solutions below

0
On BEST ANSWER

This is how you would get the blocks locations of a File in HDFS

  Path dataset = new Path(fs.getHomeDirectory(), <path-to-file>);
  FileStatus datasetFile = fs.getFileStatus(dataset);

  BlockLocation myBlocks [] = fs.getFileBlockLocations(datasetFile,0,datasetFile.getLen());
  for(BlockLocation b : myBlocks){
    System.out.println("Length "+b.getLength());
    for(String host : b.getHosts()){
      System.out.println("host "+host);
    }
  }
0
On

This is internal HDFS code that's used to calculate file checksums, it does exactly what you need.

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-hdfs/2.4.0/org/apache/hadoop/hdfs/DFSClient.java#1863