Best way to match strings from different systems

Question

Best way to match strings from different systems

83 Views Asked by Kaine At 29 January 2024 at 12:59

Suppose that I have a list of strings like this (the real dataset is far larger and contains other data too):

List<string> modelNames =
    [
        "XC60 Momentum Standard T6",
        "XC60 Inscription Standard T6",
        "XC60 R designStandard T6",
        "XC60 T5 Powershift",
        "XC60 D3 DRIVE MANUAL",
        "XC60 D3 GEARTRONIC",
        "XC60 D5 GEARTRONIC AWD",
        "XC60 T6 AWD GEARTRONIC",
        "XC60 T5  AWD R DESIGN",
        "XC60 D5 GEARTRONIC AWD R DESIGN",
        "XC60 T6 AWD GEARTRONIC R DESIGN",
    ];

And I'd like to get the closest match using strings like these:

"2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr"
"2.0 D4 Momentum Auto Euro 6 (s/s) 5dr"
"2.0 T5 SE Nav SUV 5dr Petrol Auto Euro 6 (s/s) (245 ps)"
"2.0h T8 Twin Engine 10.4kWh R-Design SUV 5dr Petrol Plug-in Hybrid Auto AWD Euro 6 (s/s) (390 ps)"

As you can see, the strings don't really match at all, but there are some aspects that match.

I'd like to produce some kind of confidence score. My thoughts were to break both sets of strings into words and see which one gets the highest number of word matches. I'm not sure whether this is the best way of doing this kind of analysis or what would be optimal and performant way to get it done in c#.

Perhaps there is a better way than trying to score the matches, like I described above?

I would be grateful for any thoughts, suggestions and pointers.

Thanks,

Kaine

Original Q&A

There are 1 best solutions below

**Olivier Jacot-Descombes** · Answer 1 · 2024-01-29T15:43:54.420000

I made a test with your example strings. The result is not great. At most one word matches. I think that this is not enough to make a reliable match. Also, my solution has a O(n²) time complexity, which will not scale well if you have large sets.

Setup:

static List<string> modelNames =
[
    "XC60 Momentum Standard T6",
    "XC60 Inscription Standard T6",
    "XC60 R designStandard T6",
    "XC60 T5 Powershift",
    "XC60 D3 DRIVE MANUAL",
    "XC60 D3 GEARTRONIC",
    "XC60 D5 GEARTRONIC AWD",
    "XC60 T6 AWD GEARTRONIC",
    "XC60 T5  AWD R DESIGN",
    "XC60 D5 GEARTRONIC AWD R DESIGN",
    "XC60 T6 AWD GEARTRONIC R DESIGN",
];
static List<string> modelNames2 =
[
    "2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr",
    "2.0 D4 Momentum Auto Euro 6 (s/s) 5dr",
    "2.0 T5 SE Nav SUV 5dr Petrol Auto Euro 6 (s/s) (245 ps)",
    "2.0h T8 Twin Engine 10.4kWh R-Design SUV 5dr Petrol Plug-in Hybrid Auto AWD Euro 6 (s/s) (390 ps)"
];

static (string name, string[] words) GetWords(string sentence)
{
    return (sentence, sentence.Split());
}

Test:

var names1 = modelNames.Select(n => GetWords(n));
var names2 = modelNames2.Select(n => GetWords(n)).ToList();
foreach (var n1 in names1) {
    int bestCount = 0;
    List<string> bestMatches = [];
    foreach (var n2 in names2) {

        int count = n1.words
            .Intersect(n2.words, StringComparer.InvariantCultureIgnoreCase)
            .Count();
        if (count > bestCount) {
            bestCount = count;
            bestMatches.Clear();
            bestMatches.Add(n2.name);
        } else if (count > 0 && count == bestCount) {
            bestMatches.Add(n2.name);
        }
    }
    Console.WriteLine($"{n1.name}  (count={bestCount})");
    foreach (var match in bestMatches) {
        Console.WriteLine($"    {match}");
    }
}
Console.ReadKey();

Prints:

XC60 Momentum Standard T6  (count=1)
    2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr
    2.0 D4 Momentum Auto Euro 6 (s/s) 5dr
XC60 Inscription Standard T6  (count=0)
XC60 R designStandard T6  (count=0)
XC60 T5 Powershift  (count=1)
    2.0 T5 SE Nav SUV 5dr Petrol Auto Euro 6 (s/s) (245 ps)
XC60 D3 DRIVE MANUAL  (count=0)
XC60 D3 GEARTRONIC  (count=0)
XC60 D5 GEARTRONIC AWD  (count=1)
    2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr
    2.0h T8 Twin Engine 10.4kWh R-Design SUV 5dr Petrol Plug-in Hybrid Auto AWD Euro 6 (s/s) (390 ps)
XC60 T6 AWD GEARTRONIC  (count=1)
    2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr
    2.0h T8 Twin Engine 10.4kWh R-Design SUV 5dr Petrol Plug-in Hybrid Auto AWD Euro 6 (s/s) (390 ps)
XC60 T5  AWD R DESIGN  (count=1)
    2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr
    2.0 T5 SE Nav SUV 5dr Petrol Auto Euro 6 (s/s) (245 ps)
    2.0h T8 Twin Engine 10.4kWh R-Design SUV 5dr Petrol Plug-in Hybrid Auto AWD Euro 6 (s/s) (390 ps)
XC60 D5 GEARTRONIC AWD R DESIGN  (count=1)
    2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr
    2.0h T8 Twin Engine 10.4kWh R-Design SUV 5dr Petrol Plug-in Hybrid Auto AWD Euro 6 (s/s) (390 ps)
XC60 T6 AWD GEARTRONIC R DESIGN  (count=1)
    2.0 D4 Momentum Pro Auto AWD Euro 6 (s/s) 5dr
    2.0h T8 Twin Engine 10.4kWh R-Design SUV 5dr Petrol Plug-in Hybrid Auto AWD Euro 6 (s/s) (390 ps)

Best way to match strings from different systems

There are 1 best solutions below

Related Questions in C#

Related Questions in STRING

Related Questions in DATA-ANALYSIS

Related Questions in FUZZY-SEARCH

Related Questions in FUZZY-COMPARISON

Trending Questions

Popular # Hahtags

Popular Questions