Why there is a sort operator in execution plan when EXCEPTed queries are clustered tables?

746 Views Asked by At

I'm building a Data Warehouse and I found a problem in two tables data comparison statement. I use EXCEPT operator to compare the tables which have clustered indexes (normal int field as key). My problem is that in query execution plan there are sort operators after both clustered indexes scan. Here's is a code example:

create table temp.table_a
(
    key_a int identity,
    some_field_a int,
    some_field2_a varchar(10)
);

insert into temp.table_a
(
    some_field_a,
    some_field2_a
)
select
    n,
    'abcd'
from meta.GENERATE_SEQUENCE(1,1000);

create clustered index cix_table_a_key_a on temp.table_a (key_a);


create table temp.table_b
(
    key_b int identity,
    some_field_b int,
    some_field2_b varchar(10)
);

insert into temp.table_b
(
    some_field_b,
    some_field2_b
)
select 
    n,
    'abcd'
from meta.GENERATE_SEQUENCE(1,1000);

create clustered index cix_table_b_key_b on temp.table_b (key_b);

(GENERATE_SEQUENCE is a row generator)

Now the EXCEPT query:

select 
    key_a,
    some_field_a,
    some_field2_a
from temp.table_a

except

select 
    key_b,
    some_field_b,
    some_field2_b
from temp.table_b

Here's an image of the execution plan:

[Execution plan]

I'm aware that Merge Join needs sorted input, but isn't it already sorted enough? By this I mean that the only sorted columns we need is key_a/key_b. And this is already done because of clustered indexes. Sorts of other columns are not needed because inside every value of key_a/key_b there is only one row - nothing to sort.

So, my question are:

  1. Why there are sort operators after clustered index scans in this situation?
  2. How can I avoid these sorts when I want to use EXCEPT operator?
  3. What are the better ways (if there is any) of doing table comparison?

Thanks in advance for your answers :)

2

There are 2 best solutions below

0
dominjas On BEST ANSWER

Thank you all guys for your help. Below is full answer for my question gathered from comments and answers:

  1. Why there are sort operators after clustered index scans in this situation?

Sort operators are there because index key columns (key_a, key_b) are not unique for optimizer.

  1. How can I avoid these sorts when I want to use EXCEPT operator?

Be sure that your index key columns are unique - use UNIQUE CLUSTERED INDEX or set constraint on these fields.

  1. What are the better ways (if there is any) of doing table comparison?

Alternate solutions considering adding more columns to index key or using NOT EXISTS instead of EXCEPT were given in answer of Steve Ford (@steve-ford).

3
Steve Ford On

To answer your question isn't it already sorted enough? - No. The comparison is performed on all columns in the SELECT so that all columns need to be included in the sort.

I have included 2 possible solutions, one is to add all columns to the index, the other is to use NOT EXISTS - note this could return duplicate rows if there are duplicates in table_a.

1.) include these columns in the indexes, you will see in the SQLFiddle linked queryplan that no sort is used in this scenario.

SQL Fiddle

MS SQL Server 2017 Schema Setup:

create table table_a
(
    key_a int identity,
    some_field_a int,
    some_field2_a varchar(10)
);

WITH Tally (n) AS
(
    -- 1000 rows
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n)
)
insert into table_a
(
    some_field_a,
    some_field2_a
)
select
    n,
    'abcd'
from Tally;

create clustered index cix_table_a_key_a on table_a (key_a,some_field_a,
    some_field2_a);


create table table_b
(
    key_b int identity,
    some_field_b int,
    some_field2_b varchar(10)
);

WITH Tally (n) AS
(
    -- 1000 rows
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n)
)
insert into table_b
(
    some_field_b,
    some_field2_b
)
select 
    n,
    'abcd'
from Tally;

create clustered index cix_table_b_key_b on table_b (key_b, some_field_b, some_field2_b);

Query 1:

select 
    key_a,
    some_field_a,
    some_field2_a
from table_a

except

select 
    key_b,
    some_field_b,
    some_field2_b
from table_b

Results:

2.) Another alternative is to use NOT EXISTS instead of EXCEPT - it should be noted that these are not exactly the same as an EXCEPT effectively performs a DISTINCT too. If you add a DISTINCT to a NOT EXISTS you will get a SORT on the smaller results table.

See SQLFIDLLE here:

SQL Fiddle

MS SQL Server 2017 Schema Setup:

create table table_a
(
    key_a int identity,
    some_field_a int,
    some_field2_a varchar(10)
);

WITH Tally (n) AS
(
    -- 1000 rows
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n)
)
insert into table_a
(
    some_field_a,
    some_field2_a
)
select
    n,
    'abcd'
from Tally;

create clustered index cix_table_a_key_a on table_a (key_a);


create table table_b
(
    key_b int identity,
    some_field_b int,
    some_field2_b varchar(10)
);

WITH Tally (n) AS
(
    -- 1000 rows
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n)
)
insert into table_b
(
    some_field_b,
    some_field2_b
)
select 
    n,
    'abcd'
from Tally;

create clustered index cix_table_b_key_b on table_b (key_b);

Query 1:

select 
    key_a,
    some_field_a,
    some_field2_a
from table_a
WHERE NOT EXISTS
(
  select NULL 
  from table_b
  WHERE
    key_b = key_a AND
    some_field_b = some_field_a AND
    some_field2_b = some_field2_a
)

Results: