Filter condition in WHERE clause in HiveQL query does not work properly

288 Views Asked by At

When I run the following query through beeline, one of the filter conditions in WHERE clause (array_item in (20, 30)) does not seem to work correctly.

  • Hive version: 2.1.1
with sample as (
              select 1 as col, array(10)         as array_col
    union all select 2 as col, array(10, 20, 30) as array_col
    union all select 3 as col, array(20, 30)     as array_col
    union all select 4 as col, array(30)         as array_col
)

, targets as (
    select 2 as col_target
)

select
    col
    , array_col
    , array_item
from
    sample
    lateral view explode(array_col) a as array_item
where
    true
    and array_item in (20, 30)
    and sample.col in (select col_target from targets)

Although I was expecting the following result to be displayed,

col array_col   array_item
2   [10,20,30]  20
2   [10,20,30]  30

the actual resuls returned by the query was as follows, showing that the query failed to apply the condition array_item in (20, 30)

col array_col   array_item
2   [10,20,30]  10
2   [10,20,30]  20
2   [10,20,30]  30

For verification, I also ran the following query.
This seemed to be executed correctly and returned the result same as what I expected.

with sample as (
              select 1 as col, array(10)         as array_col
    union all select 2 as col, array(10, 20, 30) as array_col
    union all select 3 as col, array(20, 30)     as array_col
    union all select 4 as col, array(30)         as array_col
)

, targets as (
    select 2 as col_target
)

select
    col
    , array_col
    , array_item
from
    sample
    lateral view explode(array_col) a as array_item
where
    true
    and array_item in (20, 30)
    and sample.col in (2)  -- changed here: "2" is same as the result of subquery

The difference between the first and second query is whether I write the subquery or the result returned by the subquery in the WHERE clause .

So I guess both queries are logically same, but they return different results.

I don't know why the first query does not consider array_item in (20, 30).

1

There are 1 best solutions below

2
Raid On

I did some testing in Hive4 Alpha2 using hadoop 3.3.1 and tez 0.10.2

Looks like some bug when hive.optimize.ppd is true for first query.

By setting hive.optimize.ppd=false first query also work same as 2nd version.