I'm working on a job that processes a nested directory structure, containing files on multiple levels:
one/
├── three/
│ └── four/
│ ├── baz.txt
│ ├── bleh.txt
│ └── foo.txt
└── two/
├── bar.txt
└── gaa.txt
When I add one/ as an input path, no files are processed, since none are immediately available at the root level.
I read about job.addInputPathRecursively(..), but this seems to have been deprecated in the more recent releases (I'm using hadoop 1.0.2). I've written some code to walk the folders and add each dir with job.addInputPath(dir), which worked until the job crashed when trying to process a directory as an input file for some reason, e.g. - trying to fs.open(split.getPath()), when split.getPath() is a directory (This happens inside LineRecordReader.java).
I'm trying to convince myself there has to be a simpler way to provide a job with a nested directory structure. Any ideas?
EDIT - apparently there's an open bug on this.
Don't know if still relevant but at least in hadoop 2.4.0 you can set property mapreduce.input.fileinputformat.input.dir.recursive to true and it will solve your problem.