I'm trying to build a MeCab 0.996 user dictionary with UniDic CWJ 2.3.0 on Ubuntu 20.10 using the following terminal command:
$ /usr/local/libexec/mecab/mecab-dict-index -d /usr/local/lib/unidic/unidic-cwj-2.3.0 -u ~/foo/bar/foo.dic -f utf8 -t utf8 ~/foo/bar/foo.csv
where foo.csv is:
ダイバーシティ,,,-200,名詞,普通名詞,一般,*,*,*,ダイバーシティ,ダイバーシティ-diversity,ダイバーシティ,ダイバーシティ,ダイバーシティ,ダイバーシティ,外,*,*,*,*,*,*,体,ダイバーシティ,ダイバーシティ,ダイバーシティ,ダイバーシティ,,,,,
But I get this error:
dictionary.cpp(355) [cid->left_size() == matrix.left_size() && cid->right_size() == matrix.right_size()] Context ID files(/usr/local/lib/unidic/unidic-cwj-2.3.0/left-id.def or /usr/local/lib/unidic/unidic-cwj-2.3.0/right-id.def may be broken
This unresolved GitHub issue post seems to be related but goes over my head: https://github.com/taku910/mecab/issues/42
I'm able to build a MeCab user dictionary with the older unidic-mecab-2.1.2:
$ /usr/local/libexec/mecab/mecab-dict-index -d ~/mecab/unidic-mecab-2.1.2_src/ -u ~/foo/bar/foo.dic -f utf8 -t utf8 ~/foo/bar/foo.csv
./pos-id.def is not found. minimum setting is used
emitting double-array: 100% |###########################################|
done!
I'm also able to build a user dictionary using the reiwa.33.csv from the unidic-py documentation:
/usr/local/libexec/mecab/mecab-dict-index -d /usr/local/lib/unidic/unidic-cwj-2.3.0 -u ~/foo/bar/reiwa33.dic -f utf8 -t utf8 ~/foo/bar/reiwa.33.csv
/usr/local/lib/unidic/unidic-cwj-2.3.0/pos-id.def is not found. minimum setting is used
reading /home/foo/bar/reiwa.33.csv ... 3
emitting double-array: 100% |###########################################|
done!
The reiwa.33.csv is:
令和,4786,4786,8205,名詞,固有名詞,一般,*,*,*,レイワ,令和,令和,レーワ,令和,レーワ,固,*,*,*,*,*,*,*,レイワ,レイワ,レイワ,レイワ,"1,0",*,*,*,*
㋿,5969,5969,2588,補助記号,一般,*,*,*,*,,㋿,㋿,,㋿,,記号,*,*,*,*,*,*,*,,,,,*,*,*,*,999999
㋿,4786,4786,3992,名詞,固有名詞,一般,*,*,*,レイワ,令和,㋿,レーワ,㋿,レーワ,固,*,*,*,*,*,*,*,レイワ,レイワ,レイワ,レイワ,"1,0",*,*,*,*
Thus, the difference between the two csv files is that the left and right context IDs are specified for each surface form (and the aType and lemma_id for some but not all entries) in the reiwa.33.csv, but not in the foo.csv.
According to the instructions for MeCab, mecab-dict-index will automatically assign the left and right IDs,and that seems to be the case with unidic-mecab-2.1.2, but not for UniDic 2.3.0.
So, I guess the question becomes: How does one determine what the left and right context IDs should be? Is there an explanation somewhere?
I was able to find the answer in this Qiita post.
To determine the left and right context IDs:
Find the line which matches the features of the word.
For general foreign loanword nouns (e.g., ダイバーシティ) without a specified accent type (aType) or accent change type (aConType) the values are:
Note: The values in the reiwa.33.csv appear to be for UniDic 2.1.2.
For a detailed explanation of why the left/right-id.def error occurs and how to swap all the left and right values in matrix.def, see this Japanese Stack Overflow post.