Hi I'm currently trying to create a query to group customers together.
I'm currently wanting to group by "Phone Number", "Email" and maybe some other fields in the future.
My problem is that I may want to group rows which aren't directly linked but have mutual rows.
For example, in this diagram we can see that Customer 2 and Customer 5 are in no way related at all, however they both share connections with Customer 1 which would then allow them to group together.
Venn Diagram Describing my ideal Data Set
Here is another View: CustomerId | FullName | PhoneNumber | Email --|--|--|--------------------------------------------------------------------------- 1 | Bill Smith | 01612345678 | [email protected] 2 | Kelly Smith | 01612345678 | [email protected] 3 | Kevin Roach | 07111111111 | [email protected] 4 | Chris Ronald | 07222222222 | [email protected] 5 | Bill Smith | 07987654321 | [email protected]
We can see that Customer 2 is connected to Customer 1 through a mutual Phone Number, and Customer 5 is connected to 1 through a mutual Email Address. However Customer 2 and Customer 5 share no information which would've originally grouped them.
My reasoning for needing this functionality is that one person / household could have multiple customer accounts all with differing information therefore I am trying to group them up as best as I can with as much mutual information as possible.
I have already tried to use DENSE_RANK()
and GROUP BY
, however these seem the additionally partition the groups when adding more than 1 column, I want a group to be created solely on the fact that a data field matches another.
I've adopted the excellent answer from How can I combine Group Identifiers into Single Group? by GMB to your issue:
As a sample, i've added a fax field, and the customers are matched by either phone, email or fax.
It's a bit complicated to explain how it works.
First i create some sample data and put it into #data table.
Then we do two CTEs:
For example if we have following edge data:
it will take 1 => 2, and then look for 2's match in the edge data, and found 2 => 3. Then it will look for 3's matches and find 3 => 1.
So for customer 1, it will create three rows, 1 => 2, 1 => 3, 3 => 1, but then it will check so it doesn't loop back to same row already done, so 3 => 1 will be discarded because 1 it was already in 1 => 2 path.
This check is performed here:
WHERE ',' + path + ',' NOT LIKE '%,' + cast(e.custidto AS nvarchar(max))
If we don't do this check, the loop will continue forever which isn't great.
Finally, we have a lot of rows with customer and the path to each node. To get the matching node, we do a string_split and get the minimum "custIdTo" value for each customerId
EDIT: The performance of these things ain't great because of the sheer amount of potential loops and joins. An alternative could be something like this: https://stackoverflow.com/a/75560065/13061224 which is a bit more trivial but means loops