I have two tables table1 and table2, both with columns group, element_id1, element_id2, date.
For every date and every group, there are multiple elements in the group. Each element has 2 ids (every element_id1 is mapped to exactly one element_id2 as vice versa, but sometimes one of them can be NULL [exists but not recorded in our table for that row]).
I want to find elements that are in one table per each group and date but not in the other. For checking if two elements are equal, I match them either on element_id1 or element_id2 (depending on which one is non-NULL).
For example let the tables be:
- table1:
| group | element_id1 | element_id2 | date |
|---|---|---|---|
| G1 | 1 | NULL | 1/31/2024 |
| G1 | 2 | B | 1/31/2024 |
| G1 | 3 | C | 1/31/2024 |
| G1 | NULL | A | 2/2/2024 |
| G1 | 4 | D | 2/2/2024 |
| G2 | 1 | A | 1/31/2024 |
| G2 | 5 | NULL | 2/2/2024 |
- table2:
| group | element_id1 | element_id2 | date |
|---|---|---|---|
| G1 | 1 | A | 1/31/2024 |
| G1 | NULL | B | 1/31/2024 |
| G1 | NULL | A | 2/2/2024 |
| G1 | 4 | NULL | 2/2/2024 |
| G2 | 1 | A | 1/31/2024 |
| G2 | NULL | F | 2/2/2024 |
I want the outcome to be
| group | element_id1 | element_id2 | date | Status |
|---|---|---|---|---|
| G1 | 3 | C | 1/31/2024 | Missing from table2 |
| G2 | 5 | NULL | 2/2/2024 | Missing from table2 |
| G2 | NULL | F | 2/2/2024 | Missing from table1 |
I want to set INDEX in the tables. What are the good indices here? Should I index on the combination of (group, date, element_id1, element_id2)? But the problem is sometimes one of the elemet_ids is NULL. Sould I index only on group and date in both tables?
The SQL that I use is as follows but is VERY slow:
SELECT COALESCE(t1.group, t2.group) as [Group],
COALESCE(t1.element_id1, t2.element_id1) as element_id1,
COALESCE(t1.element_id2, t2.element_id2) as element_id2,
COALESCE(t1.[date], t2.[date]) as [date],
CASE WHEN t1.[date] IS NULL THEN 'Missing from table1'
ELSE 'Missing from table2'
END AS [Status]
FROM
table1 t1
FULL JOIN
table2 t2
ON t1.[group] = t2.[group]
AND t1.[date] = t2.[date]
AND (t1.element_id1 = t2.element_id1 OR t1.element_id2 = t2.element_id2)
WHERE t1.[date] IS NULL OR t2.[date] IS NULL
The OR condition is what's going to be the limitation.
Optimisations for that can be to split the query and UNION ALL the results, or to invert the logic and use AND instead of OR.
For example;
Then both tables should have two indexes each:
(group, [date], element_id1)(group, [date], element_id2)That's a lot of code repetition, but it's also an unusual table setup and requirement.