How to extract product names from a set of strings ? (php)

1.8k Views Asked by At

I'm working on a PHP-based shopping application. I have lists of strings that I know represent the same product. Those strings are likely to contain the full product name or part of it (full product name usually being brand + model).

I wonder what is the best approach to perform this extraction of the product names.

For instance, here a list of strings that represent the same product:

  • Tkg BOUILLOIRE TKG - JK 1008 RWD
  • Tkg Jk 1008 Rwd
  • Tkg Kalorik - JK 1008 RWD - Bouilloire Électrique sans Fil 360°
  • TKG Bouilloire électrique sans fil 1,7 litre 2000 watts Pois TKG Rouge et blanc
  • Tkg Kalorik - JK 1008 RWD - Bouilloire Électrique sans Fil 360°
  • Tkg JK 1008 RWD BOUILLOIRES

I expect to extract the product name "Tkg JK 1008 RWD". Pls note that String 4 only contains partial information.

I've tried an approach when I counted repeated words in all strings ; but from there, difficult to go further.

Would you have any clue ?

Cheers Nicolas

4

There are 4 best solutions below

0
On

A first stab at implementing some ideas you guys brought.

class ProductNameExtraction {

    private $brandName = NULL;
    private $categoryName = NULL;

    private $modelName = NULL;

    /**
      * @param $A Array of string discribing the same product
      */
    public function __construct($A, $brandName, $categoryName) {
        $this->brandName = $brandName;
        $this->categoryName = $categoryName;

        $res = array();     
        foreach ($A as $k => $title) {
            $res[] = $this->cleanTitle($title);
        }

        $this->modelName = $this->computeProductName($res);
    }

    public function getModelName() {
        return $this->modelName;
    }

    private function computeProductName($A) {
        $s = NULL;

        foreach ($A as $k => $title) {
            $s .= $title . ' ';
        }
        $s = trim($s);

        $data = explode(' ', $s);

        // count most popular words
        $count = array_count_values($data);

        // Remove brand & category names
        unset($count[$this->cleanTitle($this->brandName)]);
        unset($count[$this->cleanTitle($this->categoryName)]);

        $s = '';
        $totalnb = sizeof($A);          
        foreach ($count as $k => $val) {
            if ($val / $totalnb > 0.5) {
                $s .= $k . ' ';
            }
        }

        return $s;
    }

    private function cleanTitle($title) {
        // Remove extra spaces
        $title = trim($title);
        $title = preg_replace('/\s\s+/', ' ', $title);

        // Remove noise
        $title = str_replace(' - ', ' ', $title);
        $title = str_replace(array("\r\n", "\n", "\r"), ' ', $title);

        return strtoupper($title);
    }

}
0
On

Having worked at a comparison shopping engine (though not on this problem specifically), I would guess that the problem as you described is extraordinarily hard. My suggestion is to give up and just pick the "best" of the strings rather than trying to synthesize or extract "the" product name (which is a nebulous concept anyway). Most ideas you use to try to extract the product name would yield inconsistent and frustrating results. For example, looking at just the examples you gave, naive algorithms would probably produce either cryptic results like "Jk 1008 Rwd", or something extremely vague like "Bouilloire Électrique". Even Tomas' clever and nice looking results will fail for a lot of products, or produce embarrassingly ungrammatical results. A lot of ideas that come to my mind would tend to strip out category words like "Bouilloire Électrique", which would be suboptimal for user experience and SEO.

If I were in your position, I would probably model the solution like this: compute idf weights for each of the words in the title (viewing all your products or all products in this category as the space of documents). Then convert each product string to its idf weight vector, and compute the centroid of all the weight vectors for the product. Find the string closest to that centroid, and call that the "best". Use that string as the product name. It's not perfect, but it's likely to work well in most cases. There may be a plugin or query in Lucene (or whatever search database you are using) that could do a lot of this for you.

In the list of strings you give, this method would tend to move away from the fourth, incomplete, string because it wouldn't include the highly-weighted model number 1008 (presumably not common among electric kettles). That could be a problem if you got a lot of low-information, incomplete product names. Then the centroid might not be particularly close to names containing the model number. As I said, it's a hard problem.

Other ideas:

  1. Thomas' heuristic of picking the first n most common words might work better than I'm guessing it does. Or alternatively, there might be another heuristic for detecting when it would work poorly
  2. Look for long substrings common to most of the strings, and pick the one with the highest IDF weight sum.

Further reading:

TF-IDF

Centroid

Vector Space Model

1
On

You could analyse how much the strings overlap (and generate list of words/substring which appear in most of them) and then pick the most relevant words.

For example, if the words appear in certain percentage of the strings, you can identify them as the most likely candidates for the product name. (So similar to what you have done but add thresholds - e.g. you can see that 5 words appear in 88% of the strings and the other ones in much lower percentage - then pick the top 5 as a product name. This is not something exact I am afraid and needs to be manually tweaked.) This should allow to gather majority of the information but will never be perfect.

Additionally, you can have a pre-defined list of brands and filter those words out. I would also account for partial matching of the words as they can be a product of manual data entry and there can always be typos. You can see how relevant this is, if you get strong enough "signal" by simply discarding them, then no need to worry.

Going even further, you can specify another filter to mark items for manual curation but this may be very time consuming.

I am afraid there is no simple answer. What you are doing is essentially text mining. I have just thrown a few ideas and starting points that can help you start.

The above would work assuming you are building some automatic crawler trying to put together date from multiple sources. If you would like to enable visitors to search your site and return the right product page for all of the queries, then I would suggest diving into some text-searching (principal data analysis anyone?). Or just use some ready-made solution.

1
On

Just some thoughts

<?php
// to lower case
$string = strtolower(
'Tkg BOUILLOIRE TKG - JK 10o8 RWD
Tkg Jk 10o8 Rwd
Tkg Kalorik - JK 10o8 RWD - Bouilloire Électrique sans Fil 360°
TKG Bouilloire électrique sans fil 1,7 litre 2000 watts Pois TKG Rouge et blanc
Tkg Kalorik - JK 10o8 RWD - Bouilloire Électrique sans Fil 360°
Tkg JK 10o8 RWD BOUILLOIRES'
);

// remove new lines and explode by spaces
$data = explode(' ', str_replace(array("\r\n", "\n", "\r"), ' ', $string));
// count most popular words
$count = array_count_values($data);
// sort 
arsort($count);
// get first 6 most popular words
$product = array_slice($count, 0, 6);
// print product
var_dump(implode(' ', array_keys($product)));
?>

Output is:

tkg rwd 1008 jk - bouilloire