Hadoop SequenceFile size

1.5k Views Asked by At

I'm creating a HashMap of key value pairs of a Hadoop Vector that is stored inside a SequenceFile. For efficiency purposes I want to know how long the Vector of key value pairs is so that I can initialise the HashMap with the proper size.

I have used Mahout's seqdumper and it appends a Count at the end of each dumped Vector. I have looked into its code but it used a simple iterative counter (for each row counter++) and thus isn't what I'm looking for.

Also SequenceFile.MetaData looked promising, so I looked into it. But the debugger shows that it contains no entries.

Is there some other way to quickly get something like a .size() method for a Hadoop Vector inside a SequenceFile?

Edit: Here is the output of seqdumper of what I'm turning into a Map. Specifically, each Key Value pair is a IntWritable / NamedVector pair. I wish to create a mapping from the key number to the URI String. There are in total 46599 keys value pairs,as appended by seqdumper at the end of the file.

Input Path: luceneVectors
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.math.VectorWritable
Key: 0: Value: http://data.artsholland.com/production/73adae07-78c6-4180-93a4-34802090b5f1:{22118:0.18376858424635545,20381:0.40144184831236357,53753:0.2605347739121081,51569:0.2578896608715637,21930:0.2277873354603338,63035:0.27765920678967304,36979:0.2709104089668357,68351:0.15788776111071648,19436:0.2988119565549418,17991:0.12435264873296237,10356:0.3276902508762499,3410:0.27239123806574506,62942:0.18961849195965186,32527:0.24827631823639457,69909:0.11723303910369048,19832:0.2138117449778048}
Key: 1: Value: http://data.artsholland.com/production/c9fcc92b-18bb-4bfb-af52-380707f8d0d7:{41167:0.07191351238480857,61391:0.07496730342220936,[...]
[...],19156:0.0687215948604245}
Count: 46599
1

There are 1 best solutions below

1
On

No sure, that my answer will be useful, nevertheless if you need to know how many keys in seq file, you can use MapFile instead of SequenceFile. Knowing indexInterval you can estimate number of keys by reading key file. If you set indexInterval relatively large, you can maintain small index file and still estimate number of keys. As additional bonus, you get sampling of you keys, which can help you to optimize further.

more details about different versions of SequenceFiles can be found here http://www.cloudera.com/blog/2011/01/hadoop-io-sequence-map-set-array-bloommap-files/