Decrypting SENNA Chunk, SRL and Parser Output

1.2k Views Asked by At

Senna is a NLP tool built using neural nets and it's able to do:

  • POS tagging
  • NER tagging
  • Chunk tagging
  • Semantic Role Label tagging and
  • Parsing

After downloading the pre-compiled package from http://ml.nec-labs.com/senna/download.html

I ran the --help menu and see what are the options:

alvas@ubi:~/senna$ ./senna-linux64 --help
invalid argument: --help

SENNA Tagger (POS - CHK - NER - SRL)
(c) Ronan Collobert 2009

Usage: ./senna-linux64 [options]

 Takes sentence (one line per sentence) on stdin
 Outputs tags on stdout
 Typical usage: ./senna-linux64 [options] < inputfile.txt > outputfile.txt

Display options:
  -h             Display this help
  -verbose       Display model informations on stderr
  -notokentags   Do not output tokens
  -offsettags    Output start/end offset of each token
  -iobtags       Output IOB tags instead of IOBES
  -brackettags   Output 'bracket' tags instead of IOBES

Data options:
  -path <path>   Path to the SENNA data/ and hash/ directories [default: ./]

Input options:
  -usrtokens     Use user's tokens (space separated) instead of SENNA tokenizer

SRL options:
  -posvbs        Use POS verbs instead of SRL style verbs for SRL task
  -usrvbs <file> Use user's verbs (given in <file>) instead of SENNA verbs for SRL task

Tagging options:
  -pos           Output POS
  -chk           Output CHK
  -ner           Output NER
  -srl           Output SRL
  -psg           Output PSG

The command-line interface is straight forward and the outputs for POS and NER tags are also easy to interpret.

Given this input:

alvas@ubi:~/senna$ cat test.in
Foo went to eat bar at the Foobar.

This is out standard Penn Treebank tagset:

alvas@ubi:~/senna$ ./senna-linux64 -pos < test.in
            Foo        NNP
           went        VBD
             to         TO
            eat         VB
            bar         NN
             at         IN
            the         DT
         Foobar        NNP
              .          .

And this is the BIO tagset:

alvas@ubi:~/senna$ ./senna-linux64 -ner < test.in
            Foo      S-PER
           went          O
             to          O
            eat          O
            bar          O
             at          O
            the          O
         Foobar      S-LOC
              .          O

And for the chunking it's also some sort of the BIOE tagset we're used to:

alvas@ubi:~/senna$ ./senna-linux64 -chk < test.in
            Foo       S-NP
           went       B-VP
             to       I-VP
            eat       E-VP
            bar       S-NP
             at       S-PP
            the       B-NP
         Foobar       E-NP
              .          O

But what does the S- tags mean? It seems like it's only attached to tokens that are single token chunks, is that true?

The SRL tags are a little weird, they are multiple-annotations per token:

alvas@ubi:~/senna$ ./senna-linux64 -srl < test.in
            Foo               -       S-A1        S-A0
           went            went        S-V           O
             to               -   B-AM-PNC           O
            eat             eat   I-AM-PNC         S-V
            bar               -   I-AM-PNC        S-A1
             at               -   I-AM-PNC    B-AM-LOC
            the               -   I-AM-PNC    I-AM-LOC
         Foobar               -   E-AM-PNC    E-AM-LOC
              .               -          O           O

The look like the "tuple-like" outputs we get from semantic frames but I don't understand the conventions, e.g. what is -AM-? what is -PNC?

What does the output mean and how should we interpret it?

And for the Parser output:

alvas@ubi:~/senna$ ./senna-linux64 -psg < test.in
            Foo (S1(S(NP*)
           went (VP*
             to (S(VP*
            eat (VP*
            bar (ADVP*)
             at (PP*
            the (NP*
         Foobar *))))))
              . *))

It looks like the bracketed parse output we see in parsing but what does the * mean?

1

There are 1 best solutions below

3
On BEST ANSWER

SENNA uses the CoNLL format. You can read about it here: http://universaldependencies.github.io/docs/format.html

It's rather common and there are plenty of converters around.

As for the prefixes they mean: S- singleton expressions and B- begin I- intermediate E- end of a multi word expression.

Then there is the output of the semantic role labeling. Look for more information on SRL as this gets a little more complex. Notice there are two columns, one for the verb go and one for the verb eat. Usually A0 is the subject and A1 the direct object (again, oversimplified). AM is the argument modifier and -LOC is a location (it could be other adverbs). PNC seems to refer to the surrogate noun phrase acting as object of the verb go. Don't remember from the top of my head. Examples here verbs.colorado.edu/propbank/framesets-english/go-v.html As for the parse tree, it's bracketed and also a common notation loosely inspired by Lisp. The * indicates the label of the current token. I found this useful: https://math.stackexchange.com/questions/588230/how-to-convert-parentheses-notation-for-trees-into-an-actual-tree-drawing