CoreNLP API equivalent to command line?

405 Views Asked by At

For one of our project, we are currently using the syntax analysis component with the command line. We want to move from this approach to now use the corenlp server (for better performances).

Our command line options are as follow:

java -mx4g -cp "$scriptdir/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser -tokenized -escaper edu.stanford.nlp.process.PTBEscapingProcessor  -sentences newline -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory  -outputFormat "wordsAndTags,typedDependenciesCollapsed"

I've tried a few things but I didn't manage to find the proper options when using the corenlp API (with Python).

For instance, how to specify that the text is already tokenised?

I would really appreciate any help.

1

There are 1 best solutions below

2
On

In general, the server calls into CoreNLP rather than the individual NLP components, so the documentation on CoreNLP may be useful. The body of the text being annotated is sent to the server as the POST body; the properties are passed in as URL params. For example, for your case, I believe the following curl command should do the trick (and should be easy to adapt to the language of your choice):

curl -X POST -d "it's split on whitespace" \
  'http://localhost:9000/?annotators=tokenize,ssplit,pos,parse&tokenize.whitespace=true&ssplit.eolonly=true'

Note that we're just passing the following properties into the server:

  • annotators = tokenize,ssplit,pos,parse (specifies that we want the parser, and all its prerequisites).
  • tokenize.whitespace = true will call the withespace tokenizer.
  • ssplit.eolonly = true will split sentences on and only on newlines.

Other potentially useful options are documented on the parser annotator page.