AnalysisException: Using PythonUDF in join condition of join type LeftSemi is not supported

1.6k Views Asked by At

I am not doing LeftSemi join anywhere, neither am I using a python UDF. Still I am getting this error when joining two dataframes.

df1 - one column, is primary key of the table, say "customerHash". It may be empty(In fact in my current case, it is empty).

df2 - a table which also has customerHash column, but it's primary key column is different.

result = df1\
.select("customerHash")\
.distinct()\
.join(df2, ["customerHash"], 'inner')

The code runs successfully, but when I try to display/collect/persist the result table, it throws the mentioned error. I have absolutely no idea why it's happening - My guess will be because the df1 is empty. But joins don't throw errors when tables are empty, right?

My main goal is to get only those rows of df2 whose customerHash is in df1. I could use

df2.filter(F.col("customerHash").isin(df1.select("customerHash").distinct().collect()....))

but I don't want to use it as it is very slow.

Please help!

2

There are 2 best solutions below

1
On

I met this error,I divided it into two joins to settle. for example A leftjoin B:

step 1. A innerjoin B get C (use udf)

step 2. A leftjoin C for final result

1
On

Are your dataframes sourced from the same table? I have seen this exact error when I do a join using two dataframes that are filtered view/slices of the same source df.

Although in my case I WAS applying a UDF - I did not have any leftSemi join in the code. I suspect the inner join I did have defaulted to a left semi when joining against the same dataframe source.

I worked around my issue by writing df1 and df2 out to two tables, and recreating them by reading from those tables. The inner join proceeded without errors after that.