CRF++ Template File and Sentence Syntax

586 Views Asked by At

I am trying to use CRF++ to parse product strings into various attribute classes so that I can perform product matching similar to this question.

Where I am running into an issue, however, is that CRF is not accurately predicting tags when the order of the words in the product string has not yet been seen in the training file.

As an example, I am simply using a bag-of-words template file:

#Unigrams
U00:%x[-1,0]
U00:%x[0,0]
U00:%x[1,0]

#Bigrams
B

And I run crf_learn including the following example training data:

panasonic  NOUN  B-BRAND
digital  ADJ  B-PRODUCT
monitor  NOUN  I-PRODUCT
17  #  B-SIZE
inch  #  I-SIZE

When using this training data, the model correctly parses a test string "panasonic digital monitor 17 inch" into it's correct output tags. When I use the model on a string such as "panasonic monitor digital 17 inch", however, the model does not recognize the correct tagging and instead changes the tags for 'digital' and 'monitor' to something like the following:

panasonic  NOUN  B-BRAND
monitor  NOUN  B-PRODUCT
digital  ADJ  I-PRODUCT
17  #  B-SIZE
inch  #  I-SIZE

What I need, however, is the following:

panasonic  NOUN  B-BRAND
monitor  NOUN  I-PRODUCT
digital  ADJ  B-PRODUCT
17  #  B-SIZE
inch  #  I-SIZE

Is this an issue with my template file, or is CRF inherently syntax-restricted? Or can I somehow modify the template file or training data columns to capture/ignore ordering of words in the product string?

1

There are 1 best solutions below

0
On

First, the feature definitions are wrong in the Template file.

All feature templates are identified as U00. It means there's essentially only 1 feature, not 3.

Second, I think you should try more feature templates, example:

#context of 3 words
U00:%x[-1,0]
U01:%x[0,0]
U02:%x[1,0]

#for POS Tag
U03:%x[0,1]

Hope this helps in improving the performance :)

PS: You can see https://youtu.be/GJHeTvDkIaE for understanding of CRF++ Template Files.