Resolving an Edge-Case while using Java's BreakIterator

67 Views Asked by At

I'm working on a side project to apply NLP to clinical data, and I'm using Java's BreakIterator to divide text into sentences for further analysis. When using BreakIterator, I'm coming across a problem where BreakIterator doesn't recognize sentences that start with a numerical value.

Example:

String text = "1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level. This is another sentence."

Expected Output:

1) No acute osseous abnormality.
2) Mild to moderate disc space narrowing at the L4-5 level.
This is another sentence.

Actual Output:

1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level.
This is another sentence.

Code:

import java.text.BreakIterator;
import java.util.*;

public class Test {
   public static void main(String[] args) {
      String text = "1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level. This is another sentence";
      Locale locale = Locale.US;
      BreakIterator splitIntoSentences = BreakIterator.getSentenceInstance(locale);
      splitIntoSentences.setText(text);
      int index = 0;
      while (splitIntoSentences.next() != BreakIterator.DONE) {
        String sentence = text.substring(index, splitIntoSentences.current());
         System.out.println(sentence);
         index = splitIntoSentences.current();
      }
   }
}

Any help would be appreciated. I was trying to find an answer online but to no avail.

1

There are 1 best solutions below

0
On BEST ANSWER

Instead of using BreakIterator, I'm now using Apache OpenNLP and it works great!