Adding column to primary key removes uniqueness

186 Views Asked by At

I have around 34 million rows, each with 23 columns in the store_sales table in the tpc-ds dataset .

I have a composite primary key of columns ss_item_sk and ss_ticket_number.

Upon running the query SELECT count(DISTINCT <primary key>) .. I'm able to see that it outputs the total number of rows present in the table.

Now I add another column along with the primary key, namely ss_sold_date_sk.

After this, if I run the countquery, I get fewer number of rows being printed than before. Can someone explain to me, via example, why this could happen?

TL;DR

When does adding a column to a composite primary key ever stop making it unique?

1

There are 1 best solutions below

4
On

Demo

create table mytable (c1 string,c2 string);
insert into mytable values ('A','A'),('B',null),('C',null);

select count(distinct c1) as count_distinct from mytable;

+----------------+
| count_distinct |
+----------------+
|              3 |
+----------------+

As Expected - 3 distinct values - 'A','B' and 'C'


select count(distinct concat(c1,c2)) as count_distinct from mytable;

+----------------+
| count_distinct |
+----------------+
|              1 |
+----------------+
    

As expected. Why? - see next query


select c1,c2,concat(c1,c2) as concat_c1_c2 from mytable;

+----+------+--------------+
| c1 |  c2  | concat_c1_c2 |
+----+------+--------------+
| A  | A    | AA           |
| B  | NULL | NULL         |
| C  | NULL | NULL         |
+----+------+--------------+

Concatenation with NULL yields NULL


select count(distinct c1,c2) as count_distinct from mytable;

+----------------+
| count_distinct |
+----------------+
|              1 |
+----------------+

BUG!!


Here is a work-around the bug:

select count(distinct struct(c1,c2)) as count_distinct from mytable;

+----------------+
| count_distinct |
+----------------+
|              3 |
+----------------+