I'm having trouble understanding Oozie. I've got it running but the documentation and examples I have found are not clear. Can anyone help me with an example?
I have 4 or 5 hadoop streaming jobs, for each I want to delete any existing output directory and logs, e.g.
hadoop fs -rm -r /user/vm/video-output /tmp/logs/vm/logs/
run the job, e.g.
hadoop jar ~/run/hadoop-*streaming*.jar -files videoapp
-cacheArchive hdfs://localhost:54310/user/vm/input/video/video.tar.gz#video
-cacheFile hdfs://localhost:54310/user/vm/vqatsAx#vqatsAx
-cacheFile hdfs://localhost:54310/user/vm/ffmpeg#ffmpeg
-input /user/vm/input/video -output /user/vm/video-output
-mapper videoapp/video.py -cmdenv VIDEO_DIR=video
then when that is finished (how do i check this: a part-r-0000 is created?), run the next one. These jobs will be reading from and writing to HBase. I'd just like a basic outline and a few pointers this sort of thing. TIA!!
For deleting HDFS directories or logs you can use Oozie HDFS action. The oozie documentation and example for this is here oozie HDFS action. Or the example is given below. You can configure as many various actions you need in the workflow.xml.