How does one determine what the left and right context IDs should be when building a MeCab 0.996 user dictionary with UniDic 2.3.0?

Question

How does one determine what the left and right context IDs should be when building a MeCab 0.996 user dictionary with UniDic 2.3.0?

419 Views Asked by tyknkd At 21 February 2021 at 04:01

I'm trying to build a MeCab 0.996 user dictionary with UniDic CWJ 2.3.0 on Ubuntu 20.10 using the following terminal command:

$ /usr/local/libexec/mecab/mecab-dict-index -d /usr/local/lib/unidic/unidic-cwj-2.3.0 -u ~/foo/bar/foo.dic -f utf8 -t utf8 ~/foo/bar/foo.csv

where foo.csv is:

ダイバーシティ,,,-200,名詞,普通名詞,一般,*,*,*,ダイバーシティ,ダイバーシティ-diversity,ダイバーシティ,ダイバーシティ,ダイバーシティ,ダイバーシティ,外,*,*,*,*,*,*,体,ダイバーシティ,ダイバーシティ,ダイバーシティ,ダイバーシティ,,,,,

But I get this error:

dictionary.cpp(355) [cid->left_size() == matrix.left_size() && cid->right_size() == matrix.right_size()] Context ID files(/usr/local/lib/unidic/unidic-cwj-2.3.0/left-id.def or /usr/local/lib/unidic/unidic-cwj-2.3.0/right-id.def may be broken

This unresolved GitHub issue post seems to be related but goes over my head: https://github.com/taku910/mecab/issues/42

I'm able to build a MeCab user dictionary with the older unidic-mecab-2.1.2:

$ /usr/local/libexec/mecab/mecab-dict-index -d ~/mecab/unidic-mecab-2.1.2_src/ -u ~/foo/bar/foo.dic -f utf8 -t utf8 ~/foo/bar/foo.csv
./pos-id.def is not found. minimum setting is used
emitting double-array: 100% |###########################################| 
done!

I'm also able to build a user dictionary using the reiwa.33.csv from the unidic-py documentation:

/usr/local/libexec/mecab/mecab-dict-index -d /usr/local/lib/unidic/unidic-cwj-2.3.0 -u ~/foo/bar/reiwa33.dic -f utf8 -t utf8 ~/foo/bar/reiwa.33.csv
/usr/local/lib/unidic/unidic-cwj-2.3.0/pos-id.def is not found. minimum setting is used
reading /home/foo/bar/reiwa.33.csv ... 3
emitting double-array: 100% |###########################################| 
done!

The reiwa.33.csv is:

令和,4786,4786,8205,名詞,固有名詞,一般,*,*,*,レイワ,令和,令和,レーワ,令和,レーワ,固,*,*,*,*,*,*,*,レイワ,レイワ,レイワ,レイワ,"1,0",*,*,*,*
㋿,5969,5969,2588,補助記号,一般,*,*,*,*,,㋿,㋿,,㋿,,記号,*,*,*,*,*,*,*,,,,,*,*,*,*,999999
㋿,4786,4786,3992,名詞,固有名詞,一般,*,*,*,レイワ,令和,㋿,レーワ,㋿,レーワ,固,*,*,*,*,*,*,*,レイワ,レイワ,レイワ,レイワ,"1,0",*,*,*,*

Thus, the difference between the two csv files is that the left and right context IDs are specified for each surface form (and the aType and lemma_id for some but not all entries) in the reiwa.33.csv, but not in the foo.csv.

According to the instructions for MeCab, mecab-dict-index will automatically assign the left and right IDs,and that seems to be the case with unidic-mecab-2.1.2, but not for UniDic 2.3.0.

So, I guess the question becomes: How does one determine what the left and right context IDs should be? Is there an explanation somewhere?

Original Q&A

There are 1 best solutions below

**tyknkd** · Accepted Answer · 2021-02-22T09:44:40.793000

I was able to find the answer in this Qiita post.

To determine the left and right context IDs:

View the left-id.def and right-id.def files respectively:

    $ gedit /usr/local/lib/unidic/unidic-cwj-2.3.0/left-id.def

    $ gedit /usr/local/lib/unidic/unidic-cwj-2.3.0/right-id.def

Find the line which matches the features of the word.

For general foreign loanword nouns (e.g., ダイバーシティ) without a specified accent type (aType) or accent change type (aConType) the values are:

    left-id: 15917 名詞,普通名詞,一般,*,*,*,*,*,外,*,*,*,*,*,*

    right-id: 17160 名詞,普通名詞,一般,*,*,*,*,*,外,*,*,*,*,*,*

Thus foo.csv should be:

    ダイバーシティ,15917,17160,-200,名詞,普通名詞,一般,*,*,*,ダイバーシティ,ダイバーシティ-diversity,ダイバーシティ,ダイバーシティ,ダイバーシティ,ダイバーシティ,外,*,*,*,*,*,*,体,ダイバーシティ,ダイバーシティ,ダイバーシティ,ダイバーシティ,*,*,*,*,*

Compiling the MeCab dictionary with UniDic CWJ 2.3.0 from foo.csv then works without the "left- or right-id.def may be broken error":

    $ /usr/local/libexec/mecab/mecab-dict-index -d /usr/local/lib/unidic/unidic-cwj-2.3.0/ -u ~/foo/bar/foo.dic -f utf8 -t utf8 ~/foo/bar/foo.csv
    /usr/local/lib/unidic/unidic-cwj-2.3.0/pos-id.def is not found. minimum setting is used
    reading /home/foo/bar/foo.csv ... 1
    emitting double-array: 100% |###########################################| 
    done!

Note: The values in the reiwa.33.csv appear to be for UniDic 2.1.2.

For a detailed explanation of why the left/right-id.def error occurs and how to swap all the left and right values in matrix.def, see this Japanese Stack Overflow post.

How does one determine what the left and right context IDs should be when building a MeCab 0.996 user dictionary with UniDic 2.3.0?

There are 1 best solutions below

Related Questions in UBUNTU

Related Questions in CJK

Related Questions in MECAB

Trending Questions

Popular # Hahtags

Popular Questions