How to do a complex conditional outer join on pandas DataFrames using pandasql

456 Views Asked by At

This is a problem that took me a long time to solve, and I wanted to share my solution. Here's the problem.

We have 2 pandas DataFrames that need to be outer joined on a very complex condition. Here was mine:

condition_statement = """
    ON (
        A.var0 = B.var0
        OR (
            A.var1 = B.var1
            AND (
                A.var2 = B.var2
                OR A.var3 = B.var3
                OR A.var4 = B.var4
                OR A.var5 = B.var5
                OR A.var6 = B.var6
                OR A.var7 = B.var7
                OR (
                    A.var8 = B.var8
                    AND A.var9 = B.var9
                )
            )
        )
    )
"""

Doing this in pandas would be a nightmare.

I like to do most of my DataFrame massaging with the pandasql package. It lets you run SQL queries on top of the DataFrames in your local environment.

The problem with pandasql is it runs on a SQLite engine, so you can't do RIGHT or FULL OUTER joins.

So how do you approach this problem?

1

There are 1 best solutions below

0
On

Well you can achieve a FULL OUTER join with two LEFT joins, a condition, and a UNION.

First, declare a snippet with the columns you want to retrieve:

select_statement = """
    SELECT 
        A.var0
        , B.var1
        , COALESCE(A.var2, B.var2) as var2
"""

Next, build a condition that represents all values in A being NULL. I built mine using the columns in my DataFrame:

where_a_is_null_statement = f"""
    WHERE 
    {" AND ".join(["A." + col + " is NULL" for col in A.columns])}
"""

Now, do the 2-LEFT-JOIN-with-a-UNION trick using all of these snippets:

sqldf(f"""
    {select_statement}
    FROM A
    LEFT JOIN B
    {condition_statement}
    UNION
    {select_statement}
    FROM B
    LEFT JOIN A
    {condition_statement}
    {where_a_is_null_statement}
""")