Stopping scanner timeout when large number of cells

39 Views Asked by At

I have a crunch job where a cell can contain hundreds of thousands of cells (the data is split into rows keyed by location+time. For certain locations and times there can be lots of cells). The job processes each cell, but I get a scanner timeout when the number of cells is very large.

I can increase the timeouts e.g. hbase.client.scanner.timeout.period, but they have to be huge values (hours since a single cell can take 200ms) which doesn't seem ideal.

I thought I could use scan.setAllowPartialResults() and scan.setMaxResultSize(), but that only works if scan.getFilter().hasFilterRow() is false, which in my case it isn't. I also saw scanner.setMaxNumRows(), but I can't see any way to get to the scanner from crunch.

I could write all the data out to a temp location and then process it in the reduce, but that seems wrong. I feel I must be missing something about how this should be done.

How should a crunch job handle the situation where it has to process a huge number of cells without timing out? Thanks.

0

There are 0 best solutions below