I'm working on a PHP-based shopping application. I have lists of strings that I know represent the same product. Those strings are likely to contain the full product name or part of it (full product name usually being brand + model).
I wonder what is the best approach to perform this extraction of the product names.
For instance, here a list of strings that represent the same product:
- Tkg BOUILLOIRE TKG - JK 1008 RWD
- Tkg Jk 1008 Rwd
- Tkg Kalorik - JK 1008 RWD - Bouilloire Électrique sans Fil 360°
- TKG Bouilloire électrique sans fil 1,7 litre 2000 watts Pois TKG Rouge et blanc
- Tkg Kalorik - JK 1008 RWD - Bouilloire Électrique sans Fil 360°
- Tkg JK 1008 RWD BOUILLOIRES
I expect to extract the product name "Tkg JK 1008 RWD". Pls note that String 4 only contains partial information.
I've tried an approach when I counted repeated words in all strings ; but from there, difficult to go further.
Would you have any clue ?
Cheers Nicolas
You could analyse how much the strings overlap (and generate list of words/substring which appear in most of them) and then pick the most relevant words.
For example, if the words appear in certain percentage of the strings, you can identify them as the most likely candidates for the product name. (So similar to what you have done but add thresholds - e.g. you can see that 5 words appear in 88% of the strings and the other ones in much lower percentage - then pick the top 5 as a product name. This is not something exact I am afraid and needs to be manually tweaked.) This should allow to gather majority of the information but will never be perfect.
Additionally, you can have a pre-defined list of brands and filter those words out. I would also account for partial matching of the words as they can be a product of manual data entry and there can always be typos. You can see how relevant this is, if you get strong enough "signal" by simply discarding them, then no need to worry.
Going even further, you can specify another filter to mark items for manual curation but this may be very time consuming.
I am afraid there is no simple answer. What you are doing is essentially text mining. I have just thrown a few ideas and starting points that can help you start.
The above would work assuming you are building some automatic crawler trying to put together date from multiple sources. If you would like to enable visitors to search your site and return the right product page for all of the queries, then I would suggest diving into some text-searching (principal data analysis anyone?). Or just use some ready-made solution.