I'm trying to train a specific chunker (let's say a noun chunker for simplicity) by using NLTK's brill module. I'd like to use three features, ie. word, POS-tag, IOB-tag.
(Ramshaw and Marcus, 1995:7) have shown 100 templates which are generated from the combination of those three features, for example,
W0, P0, T0 # current word, pos tag, iob tag W-1, P0, T-1 # prev word, pos tag, prev iob tag ...
I want to incorporate them into nltk.tbl.feature
, but there are only two kinds of feature objects, ie. brill.Word
and brill.Pos
. Limited by the design, I could only put word and POS feature together like (word, pos), and thus used ( (word, pos), iob) as features for training. For example,
from nltk.tbl import Template
from nltk.tag import brill, brill_trainer, untag
from nltk.corpus import treebank_chunk
from nltk.chunk.util import tree2conlltags, conlltags2tree
# Codes from (Perkins, 2013)
def train_brill_tagger(initial_tagger, train_sents, **kwargs):
templates = [
brill.Template(brill.Word([0])),
brill.Template(brill.Pos([-1])),
brill.Template(brill.Word([-1])),
brill.Template(brill.Word([0]),brill.Pos([-1])),]
trainer = brill_trainer.BrillTaggerTrainer(initial_tagger, templates, trace=3,)
return trainer.train(train_sents, **kwargs)
# generating ((word, pos),iob) pairs as feature.
def chunk_trees2train_chunks(chunk_sents):
tag_sents = [tree2conlltags(sent) for sent in chunk_sents]
return [[((w,t),c) for (w,t,c) in sent] for sent in tag_sents]
>>> from nltk.tag import DefaultTagger
>>> tagger = DefaultTagger('NN')
>>> train = treebank_chunk.chunked_sents()[:2]
>>> t = chunk_trees2train_chunks(train)
>>> bt = train_brill_tagger(tagger, t)
TBL train (fast) (seqs: 2; tokens: 31; tpls: 4; min score: 2; min acc: None)
Finding initial useful rules...
Found 79 useful rules.
B |
S F r O | Score = Fixed - Broken
c i o t | R Fixed = num tags changed incorrect -> correct
o x k h | u Broken = num tags changed correct -> incorrect
r e e e | l Other = num tags changed incorrect -> incorrect
e d n r | e
------------------+-------------------------------------------------------
12 12 0 17 | NN->I-NP if Pos:NN@[-1]
3 3 0 0 | I-NP->O if Word:(',', ',')@[0]
2 2 0 0 | I-NP->B-NP if Word:('the', 'DT')@[0]
2 2 0 0 | I-NP->O if Word:('.', '.')@[0]
As shown above, (word, pos) are treated one feature as a whole. This is not a perfect capture of three features (word, pos-tag, iob-tag).
- Any other ways to implement word, pos, iob features seperately into
nltk.tbl.feature
? - If it is impossible in NLTK, are there other implementations of them in python? I was only able to find C++ and Java implementations on the internet.
The nltk3 brill trainer api (I wrote it) does handle training on sequences of tokens described with multidimensional features, as your data is an example of. However, the practical limits may be severe. The number of possible templates in multidimensional learning increases drastically, and the current nltk implementation of the brill trainer trades memory for speed, similar to Ramshaw and Marcus 1994, "Exploring the statistical derivation of transformation-rule sequences...". Memory consumption may be HUGE and it is very easy to give the system more data and/or templates than it can handle. A useful strategy is to rank templates according to how often they produce good rules (see print_template_statistics() in the example below). Usually, you can discard the lowest-scoring fraction (say 50-90%) with little or no loss in performance and a major decrease in training time.
Another or additional possibility is to use the nltk implementation of Brill's original algorithm, which has very different memory-speed tradeoffs; it does no indexing and so will use much less memory. It uses some optimizations and is actually rather quick in finding the very best rules, but is generally extremely slow towards end of training when there are many competing, low-scoring candidates. Sometimes you don't need those, anyway. For some reason this implementation seems to have been omitted from newer nltks, but here is the source (I just tested it) http://www.nltk.org/_modules/nltk/tag/brill_trainer_orig.html.
There are other algorithms with other tradeoffs, and in particular the fast memory-efficient indexing algorithms of Florian and Ngai 2000 (http://www.aclweb.org/anthology/N/N01/N01-1006.pdf) and probabilistic rule sampling of Samuel 1998 (https://www.aaai.org/Papers/FLAIRS/1998/FLAIRS98-045.pdf) would be a useful additions. Also, as you noticed, the documentation is not complete and too much focused on part-of-speech tagging, and it is not clear how to generalize from it. Fixing the docs is (also) on the todo list.
However, the interest for generalized (non-POS-tagging) tbl in nltk has been rather limited (the totally unsuited api of nltk2 was untouched for 10 years), so don't hold your breath. If you get impatient, you may wish to check out more dedicated alternatives, in particular mutbl and fntbl (google them, I only have reputation for two links).
Anyway, here is a quick sketch for nltk:
First, a hardcoded convention in nltk is that tagged sequences ('tags' meaning any label you would like to assign to your data, not necessarily part-of-speech) are represented as sequences of pairs, [(token1, tag1), (token2, tag2), ...]. The tags are strings; in many basic applications, so are the tokens. For instance, the tokens may be words and the strings their POS, as in
(As an aside, this sequence-of-token-tag-pairs convention is pervasive in nltk and its documentation, but it should arguably be better expressed as named tuples rather than pairs, so that instead of saying
you could say for instance
The first case fails on non-pairs, but the second exploits duck typing so that tagged_sequence could be any sequence of user-defined instances, as long as they have an attribute "token".)
Now, you could well have a richer representation of what a token is at your disposal. An existing tagger interface (nltk.tag.api.FeaturesetTaggerI) expects each token as a featureset rather than a string, which is a dictionary that maps feature names to feature values for each item in the sequence.
A tagged sequence may then look like
There are other possibilities (though with less support in the rest of nltk). For instance, you could have a named tuple for each token, or a user-defined class which allows you to add any amount of dynamic calculation to attribute access (perhaps using @property to offer a consistent interface).
The brill tagger doesn't need to know what view you currently provide on your tokens. However, it does require you to provide an initial tagger which can take sequences of tokens-in-your-representation to sequences of tags. You cannot use the existing taggers in nltk.tag.sequential directly, since they expect [(word, tag), ...]. But you may still be able to exploit them. The example below uses this strategy (in MyInitialTagger), and the token-as-featureset-dictionary view.
The setup above builds a POS tagger. If you instead wish to target another attribute, say to build an IOB tagger, you need a couple of small changes so that the target attribute (which you can think of as read-write) is accessed from the 'tag' position in your corpus [(token, tag), ...] and any other attributes (which you can think of as read-only) are accessed from the 'token' position. For instance:
1) construct your corpus [(token,tag), (token,tag), ...] for IOB tagging
2) change the initial tagger accordingly
3) modify the feature-extracting class definitions