Snowflake "Exploding Join" issue while doing left join for multiple tables

3.3k Views Asked by At

I am trying to do some left joins on multiple tables and facing the following issue.

Row Counts of tables

Table 1: 1.6M
Table 2: 1.7M
Table 3: 1.5M

When I am doing left Join using Table 1 and 2 and following query, I get data count as 1.8 M (acceptable):

SELECT Table1.ID1, Table1.ID2, Table2.Name, Table2.City
FROM Table1
LEFT JOIN Table2
    ON Table1.ID1 = Table2.ID1
    AND Table1.ID2 = Table2.ID2
    AND Table1.Source_System = Table2.Source_System
;

Similarly when I am doing left Join using Table 1 and 3 and following query, I get data count as 1.9 M (acceptable):

SELECT Table1.ID1, Table1.ID2, Table3.Name, Table3.City 
FROM Table1
LEFT JOIN Table3
    ON Table1.ID1 = Table3.ID1 
    AND Table1.ID2 = Table3.ID2
    AND Table1.Source_System = Table3.Source_System
;

But when I am doing left Join using Table 1, 2 and 3 and following query, I get data count as 11.9 G (ISSUE):

SELECT 
    Table1.ID1, Table1.ID2,
    Table2.Name, Table2.City,
    Table3.Name as Name1, Table3.City as City1
FROM Table1
LEFT JOIN Table2
    ON Table1.ID1 = Table2.ID1
    AND Table1.ID2 = Table2.ID2
    AND Table1.Source_System = Table2.Source_System
LEFT JOIN Table3
    ON Table1.ID1 = Table3.ID1
    AND Table1.ID2 = Table3.ID2
    AND Table1.Source_System = Table3.Source_System
;
4

There are 4 best solutions below

4
On

When your left join is producing more records than the referenced table it should not be acceptable! that should signal warning in your join condition and data. Either you investigate those records in the table to avoid it in the first place or you would need to keep tweaking your SQL to satisfy clean join that produces exact reference table row count. otherwise, it is very common that left joining to another table with a small duplicate records will produce exponential row count as you are facing here.

Try reading these questions here to help here and here

Just to add about investigating and finding those rows, use following SQL to find in each table what rows that have same ID1, ID2 and Source_System columns

i.e. :-

Select ID1, ID2 ,Source_System, COUNT(*) AS NUM_RECORDS_DUPS
FROM TABLE1
GROUP BY ID1, ID2 , Source_System 
HAVING COUNT(*)>1  -- Filtering on duplicate rows that has more than a row satisfying the join condition

Use the same for each of the tables to find those records and either add another unique condition/ aggregate the table on the joining keys or ask for data cleansing ! for those records

0
On

Have you tried adding a DISTINCT clause?

SELECT DISTINCT columns, of, choice FROM Table1 LEFT JOIN Table2 on ... LEFT JOIN Table3 on ...

I think what's happening is you have dups that left join on another giant set of dups.

0
On

So it seems you have assumed the data in table1 and table2 join in a 1:1 ratio, and also assumed the table1 and table3 are also a 1:1 ratio, so assumed when those three tables joined, that ration should be in the order again of 1:1

But if half you entries in table1 are not in table2 to get the 1.8M result, the the common rows would have to be duplicated > 2.0 times that increase. If we change that from half not matching to a tenth not matching there would need to be > 10.0 duplicates. Thus to get the 4 magnitude growth you have, it seems like you have only 100th match, but greater than 100.0 duplicates, which when cross joined give the 10,000 growth in rows.

this could be seen via:

SELECT Table1.ID1, Table1.ID2, Table1.Source_System, counnt(*) as counts
FROM Table1
LEFT JOIN Table2
    ON Table1.ID1 = Table2.ID1
        AND Table1.ID2 = Table2.ID2
        AND Table1.Source_System = Table2.Source_System
GROUP BY 1,2,3 
ORDER BY counts DESC
;

this will show the total distinct pairs, and which are the worst contributors to the combination explosion

0
On

Use the proper keys to join the two tables, it solves the issue.