I am looking for some assistance in using Soundex() and Difference() in SQL Server with the goal of comparing one observation in a column to another observation in that same column.
Here is some context of how the data looks. All of these street addresses are intended to represent the same place, but with human error changing a small portion so it isn't exactly the same:
| Street Address | zip | state |
|---|---|---|
| 1440 thisisastreetaddress ave | 12345 | OH |
| 1440 thisismystreetaddress st | 12345 | OH |
| 1440 thisisthestreetaddress rd | 12345 | OH |
The goal would be to use Soundex() and Difference() to group addresses that are similar to each other. Problem is, the table has over 2 million customers so the solution cannot operate on massive amounts of CPU. The end product doesn't have to be pretty, I don't mind sorting manually (which I've realized is what may need to happen to an extent), but I would appreciate the script grouping them for me.
Let me know if its possible with soundex() and difference() and, if its not with those two, any other functions or ways you may go about solving this. Also may be important to include that it can be a multiple software solution as the team and myself know python, R, power BI, etc.
Thanks!