Suppose that I have a list of strings like this (the real dataset is far larger and contains other data too):
List<string> modelNames =
[
"XC60 Momentum Standard T6",
"XC60 Inscription Standard T6",
"XC60 R designStandard T6",
"XC60 T5 Powershift",
"XC60 D3 DRIVE MANUAL",
"XC60 D3 GEARTRONIC",
"XC60 D5 GEARTRONIC AWD",
"XC60 T6 AWD GEARTRONIC",
"XC60 T5 AWD R DESIGN",
"XC60 D5 GEARTRONIC AWD R DESIGN",
"XC60 T6 AWD GEARTRONIC R DESIGN",
];
And I'd like to get the closest match using strings like these:
"2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr"
"2.0 D4 Momentum Auto Euro 6 (s/s) 5dr"
"2.0 T5 SE Nav SUV 5dr Petrol Auto Euro 6 (s/s) (245 ps)"
"2.0h T8 Twin Engine 10.4kWh R-Design SUV 5dr Petrol Plug-in Hybrid Auto AWD Euro 6 (s/s) (390 ps)"
As you can see, the strings don't really match at all, but there are some aspects that match.
I'd like to produce some kind of confidence score. My thoughts were to break both sets of strings into words and see which one gets the highest number of word matches. I'm not sure whether this is the best way of doing this kind of analysis or what would be optimal and performant way to get it done in c#.
Perhaps there is a better way than trying to score the matches, like I described above?
I would be grateful for any thoughts, suggestions and pointers.
Thanks,
Kaine
I made a test with your example strings. The result is not great. At most one word matches. I think that this is not enough to make a reliable match. Also, my solution has a O(n2) time complexity, which will not scale well if you have large sets.
Setup:
Test:
Prints: