What's the difference in one word token and mulit-word token in crf++ for Chinese?

74 Views Asked by At

I use crf++ for Chinese named entity recognition.The first column in train file is token represent current word.I see someone use only one Chinese character in first column but someone use many Chinese characters like 中国。

1

There are 1 best solutions below

1
On

Chinese word could be 1 Chinese character or multiply Chinese characters:
中 represents a English word - middle.
国 represents another English word - country.
and 中国 represents English word - China.
they are same - current word - just like 'CHINA' has 5 English characters, 中国 has 2 Chinese characters - both are current word in cft++.