Keep one occurrence of each pair by year if a condition is met

39 Views Asked by At

I have this table ("colors") in SQL:

CREATE TABLE colors (
    color1 VARCHAR(50),
    color2 VARCHAR(50),
    year INT,
    var1 INT,
    var2 INT,
    var3 INT,
    var4 INT
);


INSERT INTO colors (color1, color2, year, var1, var2, var3, var4) VALUES
    ('red', 'blue', 2010, 1, 2, 1, 2),
    ('blue', 'red', 2010, 1, 2, 1, 2),
    ('red', 'blue', 2011, 1, 2, 5, 3),
    ('blue', 'red', 2011, 5, 3, 1, 2),
    ('orange', NULL, 2010, 5, 9, NULL, NULL)
('green', 'white', 2010, 5, 9, 6, 3);

The table looks like this:

 color1 color2 year var1 var2 var3 var4
    red   blue 2010    1    2    1    2
   blue    red 2010    1    2    1    2
    red   blue 2011    1    2    5    3
   blue    red 2011    5    3    1    2
 orange   NULL 2010    5    9 NULL NULL
green    white 2010    5    9    6    3

I am trying to do the following:

  • For pairs of colors in the same year (e.g. red/blue/2010 and blue/red/2010) - if var1=var3 and var2=var4 : then keep only one pair
  • For pairs of colors in the same year - if var1!=var3 OR var2!=var4 : then keep both pairs
  • For colors that do not have pairs in the same year : keep those rows as well

The final result should look like this:

 color1 color2 year var1 var2 var3 var4
    red   blue 2010    1    2    1    2
    red   blue 2011    1    2    5    3
   blue    red 2011    5    3    1    2
 orange   NULL 2010    5    9 NULL NULL
green    white 2010    5    9    6    3

Here is my attempt to write the SQL code for this:

First I write CTEs to identify pairs - then I verify the OR conditions:

WITH pairs AS (
    SELECT *,
    CASE 
        WHEN color1 < color2 THEN color1 || color2 || CAST(year AS VARCHAR(4))
        ELSE color2 || color1 || CAST(year AS VARCHAR(4))
    END AS pair_id
    FROM colors
),
ranked_pairs AS (
    SELECT *,
    ROW_NUMBER() OVER(PARTITION BY pair_id ORDER BY color1, color2) as row_num
    FROM pairs
)
SELECT color1, color2, year, var1, var2, var3, var4
FROM ranked_pairs
WHERE row_num = 1 OR var1 != var3 OR var2 != var4;

The output looks like this:

 color1 color2 year var1 var2 var3 var4
 orange   <NA> 2010    5    9   NA   NA
   blue    red 2010    1    2    1    2
   blue    red 2011    5    3    1    2
    red   blue 2011    1    2    5    3
  green  white 2010    5    9    6    3

Am I doing this correctly? The final result looks correct but I am not confident, e. this code might not work on some fringe cases.

Thanks!

1

There are 1 best solutions below

4
Ricardo Gellman On

Your ordering colors in pair_id seems to be wrong in case the same pair represents a different order. Also you are handling null values as equal.

Please check version below:

WITH pairs AS (
    SELECT
        color1,
        color2,
        year,
        var1,
        var2,
        var3,
        var4,
        CASE 
            WHEN color1 < color2 THEN color1 || color2 || CAST(year AS VARCHAR(4))
            ELSE color2 || color1 || CAST(year AS VARCHAR(4))
        END AS pair_id
    FROM colors
),
ranked_pairs AS (
    SELECT
        color1,
        color2,
        year,
        var1,
        var2,
        var3,
        var4,
        ROW_NUMBER() OVER(PARTITION BY pair_id ORDER BY LEAST(color1, color2), GREATEST(color1, color2)) as row_num
    FROM pairs
)
SELECT color1, color2, year, var1, var2, var3, var4
FROM ranked_pairs
WHERE row_num = 1 OR (var1 = var3 AND var2 = var4) OR (var1 IS NULL AND var3 IS NULL) OR (var2 IS NULL AND var4 IS NULL);