How to Improve Cross Join Performance in Hive TEZ?

883 Views Asked by Mani At 26 June 2025 at 06:48

I have a hive table with 5 billion records. I want each of these 5 billion records to be joined with a hardcoded 52 records.

For achieving this I am doing a cross join like

select * 
from table1 join table 2
ON 1 = 1;

This is taking 5 hours to run with the highest possible memory parameters.

Is there any other short or easier way to achieve this in less time ?

Original Q&A

There are 2 best solutions below

damientseng On 26 September 2020 at 06:19

Your query is slow because a cross-join(Cartesian product) is processed by ONE single reducer. The cure is to enforce higher parallelism. One way is to turn the query into an inner-join, so as to utilize map-side join optimization.

with t1 as (
  selct col1, col2,..., 0 as k from table1
)
,t2 as (
  selct col3, col4,..., 0 as k from table2 
)
selct 
  *
from t1 join t2 
    on t1.k = t2.k

Now each table (CTE) has a fake column called k with identical value 0. So it works just like a cross-join while only a map-side join operation takes place.

leftjoin On 26 September 2020 at 12:30

Turn on map-join:

set hive.auto.convert.join=true;

select * 
 from table1 cross join table2;

The table is small (52 records) and should fit into memory. Map-join operator will load small table into the distributed cache and each reducer container will use it to process data in memory, much faster than common-join.

How to Improve Cross Join Performance in Hive TEZ?

There are 2 best solutions below

Related Questions in PERFORMANCE

Related Questions in HIVE

Related Questions in HIVEQL

Related Questions in CROSS-JOIN

Related Questions in APACHE-TEZ

Trending Questions

Popular # Hahtags

Popular Questions