The Treetagger can do POS-tagging as well as text-chunking, which means extracting verbal and nominal clauses, as in this German example:
$ echo 'Das ist ein Test.' | cmd/tagger-chunker-german
reading parameters ...
tagging ...
finished.
<NC>
Das PDS die
</NC>
<VC>
ist VAFIN sein
</VC>
<NC>
ein ART eine
Test NN Test
</NC>
. $. .
I'm trying to figure out how to do this with the Treetaggerwrapper in Python (since it's faster than directly calling Treetagger), but I can't figure out how it's done. The documentation refers to chunking as preprocessing, so I tried using this:
tags = tagger.tag_text(u"Dieser Satz ist ein Satz.",prepronly=True)
But the output is just a list of the words with no information added. I'm starting to think that what the Wrapper calls Chunking is something different than what the actual tagger calls Chunking, but maybe I'm just missing something? Any help would be appreciated.
The original poster is right in his assumptions.
treetaggerwrapper
(as of version 2.2.4) defines chunking as merely "preprocessing of text", and does not fully wrap TreeTagger's capabilities in this sense. Fromtreetaggerwrapper.py
:But inspecting
tagger-chunker-german
one can see that getting clauses and tags is a string of operations, actually calling TreeTagger 3 times:$ echo 'Das ist ein Test.' | cmd/tree-tagger-german | perl -nae 'if ($#F==0){print} else {print "$F[0]-$F[1]\n"}' | bin/tree-tagger lib/german-chunker.par -token -sgml -eps 0.00000001 -hyphen-heuristics -quiet | cmd/filter-chunker-output-german.perl | bin/tree-tagger -quiet -token -lemma -sgml lib/german-utf8.par
whereas
treetaggerwrapper
's tagging command (shown intagcmdlist
) is actually a one-shot call (after it's own preprocessing of the text) to:bin/tree-tagger -token -lemma -sgml -quiet -no-unknown lib/german-utf8.par
The point of entry to extend it for chunking is the line
"tagparfile": "german-utf8.par",
where you would define something like
"chunkingparfile": "german-chunker.par",
and issue an additional call to TreeTagger with this other parfile following the
tagger-chunker-german
operation chain. You'd then probably still have to copy some extra logic fromcmd/filter-chunker-output-german.perl
though.