Cannot access terminal labels of Berkeley Neural Parser

20 Views Asked by At

I'm having a very simple problem using the Berkeley Neural Parser. I would like to retrieve the category label of each constituent of a sentence, using the ._.labels property of benepar:


import spacy, benepar 

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe('benepar', config={'model': 'benepar_en3'})
doc = nlp('This red apple is tasty')

sent = list(doc.sents)[0]
tree = sent._.parse_string

for child in sent._.constituents:
    print(child)
    print(child._.labels)

For many nodes, the category label is empty. This is the output I get:

This red apple is tasty

('S',)

This red apple

('NP',)

This

()

red

()

apple

()

is tasty

('VP',)

is

()

tasty

('ADJP',)

What is missing are in particular the terminal (i.e., lowest, most deeply embedded) labels. Benepar has this website which allows you to test-parse some sentences. (Note that this website gets flagged as a security issue by my browser for some reason - I assume it's safe, but there's no need to access it if you'd rather not take the risk.) According to the website, this is the parse benepar should generate:

A parse of the sentence above as generated by benepar

As you see, there are terminal labels (DT, JJ, NN, VBZ) which apparently we cannot access using ._.labels.

Hence, my question: is there something wrong with the ._.labels property or the way I use it? Or is there some other way to access the information I'm looking for?

Note that I'm getting some warning messages when running this code. They seem harmless and unrelated, but adding them here just in case:

You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding. /home/mllenouvelle/.local/lib/python3.10/site-packages/torch/distributions/distribution.py:51: UserWarning: <class 'torch_struct.distributions.TreeCRF'> does not define arg_constraints. Please set arg_constraints = {} or initialize the distribution with validate_args=False to turn off validation. warnings.warn(f'{self.class} does not define arg_constraints. ' +

Thank you!

0

There are 0 best solutions below