Is there any objective reason to prefer a particular form of left anti-semi join?

223 Views Asked by At

I've seen a great many different forms of left anti-semi join. Allow me to list and name every one that comes to mind. The following queries are intended to return every ROSTER_ID that's not used by any employee and who is the owner of that roster.

--1) NOT EXISTS, with a particular column selected in the subquery
SELECT ROSTER_ID, ROSTER_OWNER
FROM ROSTERS
WHERE NOT EXISTS (SELECT EMP_ID FROM EMPLOYEES WHERE EMP_ID = ROSTERS.EMP_ID)
--2) NOT EXISTS, with a particular column selected in the subquery and TOP (1) used
SELECT ROSTER_ID, ROSTER_OWNER
FROM ROSTERS
WHERE NOT EXISTS (SELECT TOP (1) EMP_ID FROM EMPLOYEES WHERE EMP_ID = ROSTERS.EMP_ID)
--3) NOT EXISTS, with all data selected in the subquery
SELECT ROSTER_ID, ROSTER_OWNER
FROM ROSTERS
WHERE NOT EXISTS (SELECT * FROM EMPLOYEES WHERE EMP_ID = ROSTERS.EMP_ID)
--4) NOT EXISTS, with all columns selected in the subquery and TOP (1) used
SELECT ROSTER_ID, ROSTER_OWNER
FROM ROSTERS
WHERE NOT EXISTS (SELECT TOP (1) * FROM EMPLOYEES WHERE EMP_ID = ROSTERS.EMP_ID)
--5) NOT EXISTS, but just use SELECT 1
SELECT ROSTER_ID, ROSTER_OWNER
FROM ROSTERS
WHERE NOT EXISTS (SELECT 1 FROM EMPLOYEES WHERE EMP_ID = ROSTERS.EMP_ID)
--6) NOT IN
SELECT ROSTER_ID, ROSTER_OWNER
FROM ROSTERS
WHERE EMP_ID NOT IN (SELECT EMP_ID FROM EMPLOYEES)
--7) LEFT JOIN
SELECT ROSTER_ID, ROSTER_OWNER
FROM ROSTERS
LEFT OUTER JOIN EMPLOYEES
ON EMPLOYEES.EMP_ID = ROSTERS.EMP_ID
WHERE EMPLOYEES.EMP_ID IS NULL

My question is this: Is there any objective reason - e.g. performance, backwards compatibility, portability, NULL-handling, ease of testing, extensibility, etc - to prefer any particular way of doing a left anti-semi join? I'm also interested to hear subjective reasons, e.g. style concerns or clarity, but only including them would be a non-answer.

My own research only points to the below, but it's all weak and probably subjective:

  • Microsoft's U-SQL documentation hints that they prefer to use the NOT IN version (#6 of mine) in T-SQL.
  • NULL handling with IN is always cause for concern, which is a mild reason to prefer EXISTS over it.
  • If you're seriously concerned about backwards compatibility, then I think my LEFT JOIN syntax didn't work in the 1980's.
  • Some people like to use EXCEPT, but I don't think that it generalises to cases where a column appears in only the outermost SELECT.
3

There are 3 best solutions below

0
On BEST ANSWER

I don't think that any of the answers have covered everything, so here's my attempt to put all of the answers and comments together:

  • Examples 1 through 5, i.e. the EXISTS examples, are all the same. This is not easy to prove (you might find KumarHarsh's example convincing) and relies on you trusting the optimiser.
  • Don't use example #6, i.e. IN. You just can't trust it with NULLs.
  • Example #7, i.e. LEFT JOIN, can duplicate results if you use it for a normal (i.e. not anti) semijoin. There is also suspicion around how the optimiser will treat it (see the link below). Subjectively, it takes what appears to be more typing and isn't all that good at communicating your intent. Overall, it's best avoided.

In conclusion, use EXISTS. Examples #2 and #4, which use TOP, pointlessly do so, so you might as well avoid them. As for if you should use examples #1, #3, or #5, that's just a matter of taste. I think that #5 most clearly communicates intent, but that's completely a matter of opinion.

This very helpful article also suggests OUTER APPLY, but I don't think that anyone takes that seriously.

0
On

No, except for NOT IN sucking

Aaron's article already is full of juicy information, and you have already spotted the danger of using NOT IN in combination with NULLS.

The only thing I can add that I didn't notice discussed is how, when using (not) exists, the things before FROM like select column or select top 1 * are essentially junk. I even posted a recommendation to sql server to have an alternative syntax of (anti) semi join table2 on <join condition>.

8
On

EXISTS, NOT EXISTS return booolean TRUE/FALSE depending.The Select columns mention in EXISTS subquery do not matter,it only check whether it exists or not depending upon where cluase

For example,

SELECT ROSTER_ID, ROSTER_OWNER
FROM ROSTERS
WHERE NOT EXISTS (SELECT 1/0 FROM EMPLOYEES WHERE EMP_ID = ROSTERS.EMP_ID)

In above notice 1/0 wont throw error ,this mean what ?

So your 1),2),3),4),5) are all equal in performance and result.

LEFT JOIN :It is use when you require column from LEFT JOIN table in resultset.