How do I return identical values between two string aggregated columns in SQL?

89 Views Asked by At

I have 2 columns A & B in a SQL Server table that contain a string aggregated list of codes.

The codes in the lists are separated by a semicolon.

The string aggregated lists can take any length and the codes can be in any order.

What I would like to do is to create a new column C which contains the codes that appear somewhere in the same row of the lists of Column A and Column B.

Example:

Column_A Column_B Column_C
a;b;c;d;e c;a;e;i;k c;a;e
d;e;f;g e;h;i;j;d e;d

The example above returns "c", "a" and "e" for the first row of column C, because these codes are present in the same row of both Column A and Column B.

The same for the second row, here "e" and "d" are overlapping in Column A and B and thus returned in Column C.

I have tried something that works, but it does not seem like the best solution in terms of efficiency and performance. Especially because I have many (1m+) rows to check this for and the length of code lists to compare can be very long.

SELECT 
    STRING_AGG(CAST([value] AS NVARCHAR(MAX)),'; ') AS Overlapping_Code 
FROM
    (SELECT a.value 
     FROM MyTable t 
     CROSS APPLY STRING_SPLIT(t.Column_A, ';') a

     INTERSECT

     SELECT b.value 
     FROM MyTable t 
     CROSS APPLY STRING_SPLIT(t.Column_B, ';') b 
) ab

I am looking for a better solution in terms of performance and elegance to compare the string aggregated lists for two columns across (many) rows.

4

There are 4 best solutions below

0
nvogel On BEST ANSWER

Here is another possible solution:

CREATE TABLE MyTable (a VARCHAR(100) NOT NULL, b VARCHAR(100) NOT NULL /* PRIMARY KEY ?? */);

INSERT INTO MyTable VALUES ('a;b;c;d;e','c;a;e;i;k'),('d;e;f;g','e;h;i;j;d');

SELECT *,
    (SELECT STRING_AGG(x.value,';')
    FROM STRING_SPLIT(a,';') AS x
    , STRING_SPLIT(b,';') AS z
    WHERE x.value=z.value)
FROM mytable AS t;



a          b          c          
---------- ---------- -------
a;b;c;d;e  c;a;e;i;k  c;a;e
d;e;f;g    e;h;i;j;d  e;d
0
geek45 On

Try to use CHARINDEX to check which characters are in A and B.

SELECT
    Column_A,
    Column_B,
    (
        SELECT
            STRING_AGG(code, ';') WITHIN GROUP (ORDER BY code) AS common_codes
        FROM (
            SELECT DISTINCT value AS code
            FROM STRING_SPLIT(Column_A, ';')
            WHERE CHARINDEX(';' + value + ';', ';' + Column_B + ';') > 0
        ) AS common_codes
    ) AS Column_C
FROM yourdata;
1
Thom A On

One method would be to split your strings, check if the value is in the other column's dataset and then reaggregate. This differs from your attempt as this doesn't reference the same table twice, just once. This isn't ideal and the order will be arbitrary, as you are on SQL Server 2019 (not 2022):

SELECT *
FROM (VALUES('a;b;c;d;e','c;a;e;i;k','c;a;e'),
            ('d;e;f;g','e;h;i;j;d','e;d'))V(A,B,C)
     CROSS APPLY (SELECT STRING_AGG(BB.value,';') AS C
                  FROM STRING_SPLIT(V.B,';') BB
                  WHERE EXISTS (SELECT 1
                                FROM STRING_SPLIT(V.a,';') AA
                                WHERE BB.value = AA.value))E;

Ideally, as I mentioned, you should be treating your data in a normalised form. I normalise your data using the sample you've given us, and then you can use an EXISTS to get the values that are in both sets:

SELECT *
INTO #YourBadData
FROM (VALUES(1,'a;b;c;d;e','c;a;e;i;k','c;a;e'),
            (2,'d;e;f;g','e;h;i;j;d','e;d'))V(ID,A,B,C);
GO
--Create normalised tables
CREATE TABLE dbo.TableA (ID int, Code char(1));
CREATE TABLE dbo.TableB (ID int, Code char(1));
CREATE TABLE dbo.RelatedIDs (A int, B int);
GO
INSERT INTO dbo.TableA (ID, Code)
SELECT YBD.ID,
       SS.Value
FROM #YourBadData YBD
     CROSS APPLY STRING_SPLIT(YBD.A,';') SS;
GO
INSERT INTO dbo.TableB (ID, Code)
SELECT YBD.ID,
       SS.Value
FROM #YourBadData YBD
     CROSS APPLY STRING_SPLIT(YBD.B,';') SS;

GO
INSERT INTO dbo.RelatedIDs (A,B) --This is overly simplified, as I assume A and B would likely
                                 --have different IDs and come from different sources
SELECT ID, ID
FROM dbo.#YourBadData; 
GO

SELECT B.ID,
       B.Code
FROM dbo.TableB B
WHERE EXISTS (SELECT 1
              FROM dbo.RelatedIDs RI
                   JOIN dbo.TableA A ON RI.A = A.ID
              WHERE RI.B = B.ID
                AND A.Code = B.Code);


GO
--Clean up
DROP TABLE #YourBadData;
DROP TABLE dbo.TableA;
DROP TABLE dbo.TableB;
DROP TABLE RelatedIDs

db<>fiddle

0
Yitzhak Khabinsky On

Please try the following solution based on the set operator INTERSECT

SQL

-- DDL and sample data population, start
DECLARE @tbl TABLE (ID INT IDENTITY PRIMARY KEY, colA VARCHAR(20), colB VARCHAR(20));
INSERT @tbl (colA, colB) VALUES
('a;b;c;d;e', 'c;a;e;i;k'),
('d;e;f;g', 'e;h;i;j;d');
-- DDL and sample data population, end

SELECT t.*,
    (
        SELECT STRING_AGG(value, ';') WITHIN GROUP (ORDER BY value) AS common_codes
        FROM (
            SELECT value
            FROM STRING_SPLIT(ColA, ';')
            INTERSECT
            SELECT value
            FROM STRING_SPLIT(ColB, ';')
        ) AS common_codes
    ) AS ColC
FROM @tbl as t;

Output

ID colA colB ColC
1 a;b;c;d;e c;a;e;i;k a;c;e
2 d;e;f;g e;h;i;j;d d;e