Sentence detection with Apache OpenNLP - removing headers, unterminated sentences etc

35 Views Asked by At

I am using Apache OpenNLP v2.2.0 and am attempting to detect sentences from the following sort of text:

April 2023
This is header one

The quick brown fox jumps.  Over the lazy dog a fox is seen to jump!  Does the dog notice?

I would like sentences not properly terminated with punctuation and with an EOL char (\r, \n, \r\n) to be rejected. Hence the first two lines in the above example text would not be included. I've tried training OpenNLP with properly terminated sentences but it made no difference.

The first sentence returned is:

April 2023
This is header one

The quick brown fox jumps.

I can, of course, parse all sentences for EOLs with no preceding punctuation and trim those sentences. But this cannot be a rare pattern and as OpenNLP has already parsed the text, it is not efficient to run over the text twice.

Is there any way to configure OpenNLP to get this done?

0

There are 0 best solutions below