Android's BreakIterator considers line breaks as sentence delimiters

440 Views Asked by At

I have a unix text file that I want to read in my Android app and split it into sentences. However I noticed that BreakIterator considers some line break characters as sentence delimiters. I use the following code to read the file and split it into senteces (only the first sentence is output for presentation purpose):

        File file = new File...
        String text = "";
        BreakIterator sentenceIterator = BreakIterator.getSentenceInstance(Locale.US);

    try {
        FileInputStream inputStream = new FileInputStream(file);

        InputStreamReader inputStreamReader = new InputStreamReader(inputStream);
        BufferedReader bufferedReader = new BufferedReader(inputStreamReader);
        String line;
        StringBuilder stringBuilder = new StringBuilder();

        while ((line = bufferedReader.readLine()) != null) {
            stringBuilder.append(line);
            stringBuilder.append('\n');
        }

        inputStream.close();
        text = stringBuilder.toString();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }

    sentenceIterator.setText(text);
    int end = sentenceIterator.next();
    System.out.println(end);
    System.out.println(text.substring(0, end));

But if I compile and run the code from Eclipse as a Desktop app the text is split correctly. I don't understand why it doesn't do the same on Android app.

I tried to convert the text file to dos format, I even tried to read the file and preserve original line breaks:

    Pattern pat = Pattern.compile(".*\\R|.+\\z");
    StringBuilder stringBuilder = new StringBuilder();
    try (Scanner in = new Scanner(file, "UTF-8")) {
        String line;
        while ((line = in.findWithinHorizon(pat, 0)) != null) {
            stringBuilder.append(line);
        }
        text = stringBuilder.toString();
        sentenceIterator.setText(text);
        int end = sentenceIterator.next();
        System.out.println(end);
        System.out.println(text.substring(0, end));
    }

but without success. Any ideas? You can download an excerpt from the file (unix format) here: http://dropmefiles.com/TZgBp

I've just noticed that it can be reproduced without download of this file. Just create a string that has line breaks inside sentences (e.g. "Hello, \nworld!") and run an instrumented test. If BreakIterator is used in a usual test then it splits correctly.

I expect 2 sentences:

sentence 1:

Foreword

IF a colleague were to say to you, Spouse of me this night today manufactures the unusual meal in a home.

sentence 2:

You will join?

Yes, they don't look great but at least you know why it is so (sentence delimiters are ?. etc.). But if the code runs on Android it creates a sentence even from

Foreword

for some reason...

I'm not sure whether it is a bug, or whether there is a workaround for this. But in my eyes it makes Android version of BreakIterator as sentence splitter useless as it is normal for sentences in books to spread over multiple lines.

In all the experiments I've used the same import java.text.BreakIterator;

1

There are 1 best solutions below

4
On

This is not really an answer but it might give you some insights.

It is not a file encoding issue, I tried it it his way and have the same faulty behaviour.

BreakIterator sentenceIterator = BreakIterator.getSentenceInstance(Locale.US);
String text = "Foreword\nIf a colleague were to say to you, Spouse of me this night today manufactures the unusual meal in a home. You will join?";
sentenceIterator.setText(text);

Android does not use the same Java version as your computer

I noticed that when I printout the class of the sentenceIterator object

sentenceIterator.getClass()

I have different classes when running with IntelliJ and when running on Android:

Running with IntelliJ:

sun.util.locale.provider.RuleBasedBreakIterator

Running on Android:

java.text.RuleBasedBreakIterator 

sun.util.locale.provider.RuleBasedBreakIterator has the behaviour you want.

I don't know how to get Android to use the good RuleBasedBreakIterator class. I don't even know if it is possible.