Use MeCab to separate Japanese sentences into words not morphemes in vb.net

1.1k Views Asked by At

I am using the following code to split Japanese sentences into its words:

        Dim parameter = New MeCabParam()
        Dim tagger = MeCabTagger.Create(parameter)

        For Each node In tagger.ParseToNodes(sentence)

            If node.CharType > 0 Then
                Dim features = node.Feature.Split(",")
                Console.Write(node.Surface)
                Console.WriteLine(" (" & features(7) & ") " & features(1)) 
            End If
        Next

An input of それに応じて大きくになります。 outputs morphemes:

それ (それ) 代名詞
に (に) 格助詞
応じ (おうじ) 自立
て (て) 接続助詞
大きく (おおきく) 自立
に (に) 格助詞
なり (なり) 自立
ます (ます) *
。 (。) 句点

Rather than words like so:

それ
に
応じて
大きく
に
なります
。

Is there a way I can use a parameter to get MeCab to output the latter? I am very new to coding so would appreciate it if you explain simply. Thanks.

1

There are 1 best solutions below

3
Ahmed Fasih On

This is actually pretty hard to do. MeCab, Kuromoji, Sudachi, KyTea, Rakuten-MA—all of these Japanese parsers and the dictionary databases they consume (IPADIC, UniDic, Neologd, etc.) have chosen to parse morphemes, the smallest units of meaning, instead of what you call "words", which as your example shows often contain multiple morphemes.

There are some strategies that usually folks combine to improve on this.

  1. Experiment with different dictionaries. I've noticed that UniDic is sometimes more consistent than IPADIC.
  2. Use a bunsetsu chunker like J.DepP, which consumes the output of MeCab to chunk together morphemes into bunsetsu. Per this paper, "We use the notion of a bunsetsu which roughly corresponds to a minimum phrase in English and consists of a content words (basically nouns or verbs) and the functional words surrounding them." The bunsetsu output by J.DepP often correspond to "words". I personally don't think of, say, a noun + particle phrase as a "word" but you might—these two are usually in a single bunsetsu. (J.DepP is also pretttty fancy, in that it also outputs a dependency tree between bunsetsu, so you can see which one modifies or is secondary to which other one. See my example.)
  3. A last technique that you shouldn't overlook is scanning the dictionary (JMdict) for runs of adjacent morphemes; this helps find idioms or set phrases. It can get complicated because the dictionary may have a deconjugated form of a phrase in your sentence, so you might have to search both the literal sentence form and the deconjugated (lemma) form of MeCab output.

I have an open-source package that combines all of the above called Curtiz: it runs text through MeCab, chunks them into bunsetsu with J.DepP to find groups of morphemes that belong together, identifies vocabulary by looking them up in the dictionary, separates particles and conjugated phrases, etc. It is likely not going to be useful for you, since I use it to support my activities in learning Japanese and making Japanese learning tools but it shows how the above pieces can be combined to get to what you need in Japanese NLP.

Hopefully that's helpful. I'm happy to elaborate more on any of the above topics.