Error with getText().replaceAll() in java

361 Views Asked by At

I'm extracting the text from a WordExtractor class (apache POI), but I have an error for some .doc files. Debugging, I saw that the line with the problem is the last one here:

HWPFDocument docx = new HWPFDocument(new FileInputStream(file));
WordExtractor we = new WordExtractor(docx);
String T = we.getText().replaceAll("\\n", " ").replaceAll("\\r", " ");

For most .docx and .doc files it's work fine.

The error message is:

Exception in thread "main" java.lang.RuntimeException: 
java.lang.IllegalArgumentException: The end (4958) must not be before the start (4990)

How can I fix it?

1

There are 1 best solutions below

6
On

XWPFWordExtractor from docs:

Helper class to extract text from an OOXML Word file

So this is your problem :) And solution from their docs:

For .doc files from Word 97 - Word 2003, in scratchpad there is org.apache.poi.hwpf.extractor.WordExtractor, which will return text for your document.

Those using POI 3.7 can also extract simple textual content from older Word 6 and Word 95 files, using the scratchpad class org.apache.poi.hwpf.extractor.Word6Extractor.

For .docx files, the relevant class is org.apache.poi.xwpf.extractor.XPFFWordExtractor