I'm working on a job that processes a nested directory structure, containing files on multiple levels:
one/
├── three/
│ └── four/
│ ├── baz.txt
│ ├── bleh.txt
│ └── foo.txt
└── two/
├── bar.txt
└── gaa.txt
When I add one/
as an input path, no files are processed, since none are immediately available at the root level.
I read about job.addInputPathRecursively(..)
, but this seems to have been deprecated in the more recent releases (I'm using hadoop 1.0.2). I've written some code to walk the folders and add each dir with job.addInputPath(dir)
, which worked until the job crashed when trying to process a directory as an input file for some reason, e.g. - trying to fs.open(split.getPath())
, when split.getPath()
is a directory (This happens inside LineRecordReader.java
).
I'm trying to convince myself there has to be a simpler way to provide a job with a nested directory structure. Any ideas?
EDIT - apparently there's an open bug on this.
Don't know if still relevant but at least in hadoop 2.4.0 you can set property mapreduce.input.fileinputformat.input.dir.recursive to true and it will solve your problem.