I am trying to use CRF++ to parse product strings into various attribute classes so that I can perform product matching similar to this question.
Where I am running into an issue, however, is that CRF is not accurately predicting tags when the order of the words in the product string has not yet been seen in the training file.
As an example, I am simply using a bag-of-words template file:
#Unigrams
U00:%x[-1,0]
U00:%x[0,0]
U00:%x[1,0]
#Bigrams
B
And I run crf_learn
including the following example training data:
panasonic NOUN B-BRAND
digital ADJ B-PRODUCT
monitor NOUN I-PRODUCT
17 # B-SIZE
inch # I-SIZE
When using this training data, the model correctly parses a test string "panasonic digital monitor 17 inch" into it's correct output tags. When I use the model on a string such as "panasonic monitor digital 17 inch", however, the model does not recognize the correct tagging and instead changes the tags for 'digital' and 'monitor' to something like the following:
panasonic NOUN B-BRAND
monitor NOUN B-PRODUCT
digital ADJ I-PRODUCT
17 # B-SIZE
inch # I-SIZE
What I need, however, is the following:
panasonic NOUN B-BRAND
monitor NOUN I-PRODUCT
digital ADJ B-PRODUCT
17 # B-SIZE
inch # I-SIZE
Is this an issue with my template file, or is CRF inherently syntax-restricted? Or can I somehow modify the template file or training data columns to capture/ignore ordering of words in the product string?
First, the feature definitions are wrong in the Template file.
All feature templates are identified as
U00
. It means there's essentially only 1 feature, not 3.Second, I think you should try more feature templates, example:
Hope this helps in improving the performance :)
PS: You can see https://youtu.be/GJHeTvDkIaE for understanding of CRF++ Template Files.