The target language is Spanish.
The English pipeline has support for typed dependencies whereas the Spanish pipeline, to my knowledge, does not.
The goal is to produce a dependency tree from a TreeAnnotation where the end result is a list of directed edges. Is this possible with CoreNLP 3.4.1 and using Spanish models, if so: how?
Background
I'm using Stanford CoreNLP 3.4.1 + (3.5.0 Spanish models for POS tagging) (Due to compatibility reasons, Java 8 cannot be used yet) with the following configuration:
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, ner, parse");
props.setProperty("tokenize.options", "invertible=true,ptb3Escaping=true");
props.setProperty("tokenize.language", "es");
props.setProperty("pos.model", "edu/stanford/nlp/models/pos-tagger/spanish/spanish-distsim.tagger");
props.setProperty("ner.model", "edu/stanford/nlp/models/ner/spanish.ancora.distsim.s512.crf.ser.gz");
props.setProperty("parse.model", "edu/stanford/nlp/models/srparser/spanishSR.ser.gz"); //Stanford Parser 3.4.1 shift-reduce models for Spanish.
props.setProperty("ner.applyNumericClassifiers", "false");
props.setProperty("ner.useSUTime", "false");
Which is then used to create the pipeline and run annotation of a document.
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
// ... extract start, end position of sentence ...
for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class)) {
// ... extract POS tags, NER annotations, id ...
}
//This works, and I have a tree that is not empty.
Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
}
By using a debugger I was able to examine both sentences and tokens and conclude that they have the following content:
Sentence (keys)
From edu.stanford.nlp.ling.CoreAnnotations:
- TextAnnotation
- CharacterOffsetBeginAnnotation
- CharacterOffsetEndAnnotation
- TokensAnnotation
- TokenBeginAnnotation
- TokenEndAnnotation
- SentenceIndexAnnotation
From edu.stanford.nlp.trees.TreeCoreAnnotations
- TreeAnnotation
Tokens (keys)
From edu.stanford.nlp.ling.CoreAnnotations
- TextAnnotation
- OriginalTextAnnotation
- CharacterOffsetBeginAnnotation
- CharacterOffsetEndAnnotation
- BeforeAnnotation
- AfterAnnotation
- IndexAnnotation
- SentenceIndexAnnotation
- PartOfSpeechAnnotation
- NamedEntityTagAnnotation
From edu.stanford.nlp.trees.TreeCoreAnnotations
- HeadWordAnnotation - In my experiments: this one always points to itself, i.e. the token where the annotation is retrieved from.
- HeadTagAnnotation
Thanks in advance!
There is no support for Spanish dependency parsing in CoreNLP at the moment. This includes typed dependency conversion from constituency parses.
There is a head finder implemented (but not fully tested). You could hack an untyped dependency converter using this head finder, but we have no guarantees that this will yield a sensible parse.