SQL Indexing when full joining two tables on columns that might be NULL

45 Views Asked by At

I have two tables table1 and table2, both with columns group, element_id1, element_id2, date. For every date and every group, there are multiple elements in the group. Each element has 2 ids (every element_id1 is mapped to exactly one element_id2 as vice versa, but sometimes one of them can be NULL [exists but not recorded in our table for that row]). I want to find elements that are in one table per each group and date but not in the other. For checking if two elements are equal, I match them either on element_id1 or element_id2 (depending on which one is non-NULL). For example let the tables be:

  • table1:
group element_id1 element_id2 date
G1 1 NULL 1/31/2024
G1 2 B 1/31/2024
G1 3 C 1/31/2024
G1 NULL A 2/2/2024
G1 4 D 2/2/2024
G2 1 A 1/31/2024
G2 5 NULL 2/2/2024
  • table2:
group element_id1 element_id2 date
G1 1 A 1/31/2024
G1 NULL B 1/31/2024
G1 NULL A 2/2/2024
G1 4 NULL 2/2/2024
G2 1 A 1/31/2024
G2 NULL F 2/2/2024

I want the outcome to be

group element_id1 element_id2 date Status
G1 3 C 1/31/2024 Missing from table2
G2 5 NULL 2/2/2024 Missing from table2
G2 NULL F 2/2/2024 Missing from table1

I want to set INDEX in the tables. What are the good indices here? Should I index on the combination of (group, date, element_id1, element_id2)? But the problem is sometimes one of the elemet_ids is NULL. Sould I index only on group and date in both tables?

The SQL that I use is as follows but is VERY slow:

SELECT COALESCE(t1.group, t2.group) as [Group],
       COALESCE(t1.element_id1, t2.element_id1) as element_id1,
       COALESCE(t1.element_id2, t2.element_id2) as element_id2,
       COALESCE(t1.[date], t2.[date]) as [date],
       CASE WHEN t1.[date] IS NULL THEN 'Missing from table1'
            ELSE 'Missing from table2'
       END AS [Status]
FROM
       table1 t1
FULL JOIN
       table2 t2
ON     t1.[group] = t2.[group]
       AND t1.[date] = t2.[date]
       AND (t1.element_id1 = t2.element_id1 OR t1.element_id2 = t2.element_id2)
WHERE t1.[date] IS NULL OR t2.[date] IS NULL
1

There are 1 best solutions below

0
MatBailie On

The OR condition is what's going to be the limitation.

Optimisations for that can be to split the query and UNION ALL the results, or to invert the logic and use AND instead of OR.

For example;

SELECT
  *, 'Missing from T2'
FROM
  T1
WHERE
  NOT EXISTS (
    SELECT *
      FROM T2
     WHERE T2.group       = T1.group
       AND T2.[date]      = T1.[date]
       AND T2.element_id1 = T1.element_id1
  )
  AND NOT EXISTS (
    SELECT *
      FROM T2
     WHERE T2.group       = T1.group
       AND T2.[date]      = T1.[date]
       AND T2.element_id2 = T1.element_id2
  )

UNION ALL

SELECT
  *, 'Missing from T1'
FROM
  T2
WHERE
  NOT EXISTS (
    SELECT *
      FROM T1
     WHERE T1.group       = T2.group
       AND T1.[date]      = T2.[date]
       AND T1.element_id1 = T2.element_id1
  )
  AND NOT EXISTS (
    SELECT *
      FROM T1
     WHERE T1.group       = T2.group
       AND T1.[date]      = T2.[date]
       AND T1.element_id2 = T2.element_id2
  )

Then both tables should have two indexes each:

  • (group, [date], element_id1)
  • (group, [date], element_id2)

That's a lot of code repetition, but it's also an unusual table setup and requirement.