Following the "Big Data" Lambda Architecture book, I've got an incoming directory full of typed Thift Data objects, with a DataPailStructure defined pail.meta file
I take a snapshot of this data:
Pail snapshotPail = newDataPail.snapshot(PailFactory.snapshot);
The incoming files and meta data files are duplicated, and the pail.meta file also has
structure: DataPailStructure
Now I want to shred this data, to split it into vertical partitions. As from the book, I create two PailTap objects, one for the Snapshot and the SplitDataStructure, one for the new Shredded folder.
PailTap source = dataTap(PailFactory.snapshot);
PailTap sink = splitDataTap(PailFactory.shredded);
The /Shredded folder has a pail.meta file with structure: SplitDataPailStructure
Following the instructions, I execute the JCascalog query to force the reducer:
Api.execute(sink, new Subquery(data).predicate(reduced, empty, data));
Now, in local mode, this works fine. There's a "temporary" subfolder created under /Shredded, and this is vertically partitioned with the expected "1/1" structure. In local mode, this then is moved up to the /Shredded folder, and I can consolidate and merge to master without problems.
But running inside Hadoop, it fails at this point, with an error:
cascading.tuple.TupleException: unable to sink into output identifier: /tmp/swa/shredded
...
Caused by: java.lang.IllegalArgumentException: 1/1/part-000000 is not valid with the pail structure {structure=com.hibu.pail.SplitDataPailStructure, args={}, format=SequenceFile} --> [1, _temporary, attempt_1393854491571_12900_r_000000_1, 1, 1] at com.backtype.hadoop.pail.Pail.checkValidStructure(Pail.java:563)
Needless to say, if I change the Shredded Sink structure type to DataPailStructure, then it works fine, but it's a fairly pointless operation, as everything is as it was in the Incoming folder. It's okay for now, as I'm only working with one data type, but this is going to change soon and I'll need that partition.
Any ideas? I didn't want to post all my source code here initially, but I'm almost certainly missing something.