I am using a open source jar (Mate Parser) which outputs in the CoNLL 2009 format after dependency parsing. I want to use the dependency parsing results for Information Extraction, however, I only understand part of the output in the CoNLL data format.
Can someone explain the CoNLL data format?
There are many different CoNLL formats since CoNLL is a different shared task each year. The format for CoNLL 2009 is described here. Each line represents a single word with a series of tab-separated fields.
_s indicate empty values. Mate-Parser's manual says that it uses the first 12 columns of CoNLL 2009:The definition of some of these columns come from earlier shared tasks (the CoNLL-X format used in 2006 and 2007):
ID(index in sentence, starting at 1)FORM(word form itself)LEMMA(word's lemma or stem)POS(part of speech)FEAT(list of morphological features separated by |)HEAD(index of syntactic parent, 0 forROOT)DEPREL(syntactic relationship betweenHEADand this word)There are variants of those columns (e.g.,
PPOSbut notPOS) that start withPindicate that the value was automatically predicted rather a gold standard value.Update: There is now a CoNLL-U data format as well which extends the CoNLL-X format.