How to Improve Cross Join Performance in Hive TEZ?

874 Views Asked by At

I have a hive table with 5 billion records. I want each of these 5 billion records to be joined with a hardcoded 52 records.

For achieving this I am doing a cross join like

select * 
from table1 join table 2
ON 1 = 1;

This is taking 5 hours to run with the highest possible memory parameters.

Is there any other short or easier way to achieve this in less time ?

2

There are 2 best solutions below

1
On

Your query is slow because a cross-join(Cartesian product) is processed by ONE single reducer. The cure is to enforce higher parallelism. One way is to turn the query into an inner-join, so as to utilize map-side join optimization.

with t1 as (
  selct col1, col2,..., 0 as k from table1
)
,t2 as (
  selct col3, col4,..., 0 as k from table2 
)
selct 
  *
from t1 join t2 
    on t1.k = t2.k 

Now each table (CTE) has a fake column called k with identical value 0. So it works just like a cross-join while only a map-side join operation takes place.

0
On

Turn on map-join:

set hive.auto.convert.join=true;

select * 
 from table1 cross join table2;

The table is small (52 records) and should fit into memory. Map-join operator will load small table into the distributed cache and each reducer container will use it to process data in memory, much faster than common-join.