Hive Query is going for full table scan when filtering on the partitions from the results of subquery/joins

566 Views Asked by At

Why query1 runs for longer time compared to Query2.

Hive Source Table Details

Columns - 166
Partitions columns - 2 columns (all are int datatypes)
Number of Partitions - 3211 partitions
Total Records - 19374461064
File Format - ORC

-- Query1

create table temp1 as 
with temp_table1 as 
(select col1 from temp_table where col1 between <start_date> and <end_date>), -- these query would get me the dates and there are derived dynamically in the script

select f1.* from sourcetable f1 join temp_table1 f2 on f1.col1=f2.col1;

-- Above query runs for longer time, one of the steps takes ~1300 mappers and ~1000 reducers. -- Execution Time - ~120 Mins

-- Query2

-- if am passing the results of temp table1 in where clause as values, then query retreives results in under ~10mins.

select f1.* from source_table where f1.col1 in (value1, value2, value3, value3... value30);
0

There are 0 best solutions below