HBase bulk load spawn high number of reducer tasks - any workaround

1.5k Views Asked by StackUnderflow At 14 February 2011 at 16:16

HBase bulk load (using configureIncrementalLoad helper method) configures the job to create as many reducer task as the regions in the hbase table. So if there are few hundred regions then the job would spawn few hundred reducer tasks. This could get very slow on a small cluster..

Is there any workaround possible by using MultipleOutputFormat or something else?

Thanks

Original Q&A

There are 2 best solutions below

Nicolas Spiegelberg On 15 March 2011 at 16:10

Sharding the reduce stage by region is giving you a lot of long-term benefit. You get data locality once the imported data is online. You also can determine when a region has been load balanced to another server. I wouldn't be so quick to go to a coarser granularity.
Since the reduce stage is going a single file write, you should be able to setNumReduceTasks(# of hard drives). That might speed it up more.

It's very easy to get network bottlenecked. Make sure you're compressing your HFile & your intermediate MR data.

  job.getConfiguration().setBoolean("mapred.compress.map.output", true);
  job.getConfiguration().setClass("mapred.map.output.compression.codec",
      org.apache.hadoop.io.compress.GzipCodec.class,
      org.apache.hadoop.io.compress.CompressionCodec.class);
  job.getConfiguration().set("hfile.compression",
      Compression.Algorithm.LZO.getName());

Your data import size might be small enough where you should look at using a Put-based format. This will call the normal HTable.Put API and skip the reducer phase. See TableMapReduceUtil.initTableReducerJob(table, null, job).

Prasad D On 02 December 2013 at 15:48

When we use HFileOutputFormat, its overrides number of reducers whatever you set. The number of reducers is equal to number of regions in that HBase table. So decrease the number of regions if you want to control the number of reducers.

You will find a sample code here:

Hope this will be useful :)

HBase bulk load spawn high number of reducer tasks - any workaround

There are 2 best solutions below

Related Questions in HADOOP

Related Questions in HBASE

Related Questions in BULK-LOAD

Trending Questions

Popular # Hahtags

Popular Questions