How to use Apache HWPF to extract text and images out of a DOC file

7.1k Views Asked by At

I downloaded the Apache HWPF. I want to use it to read a doc file and write its text into a plain text file. I don't know the HWPF so well.

My very simple program is here:

I have 3 problems now:

  1. Some of packages have errors (they can't find apache hdf). How I can fix them?

  2. How I can use the methods of HWDF to find and extract the images out?

  3. Some piece of my program is incomplete and incorrect. So please help me to complete it.

I have to complete this program in 2 days.

once again I repeat Please Please help me to complete this.

Thanks you Guys a lot for your help!!!

This is my elementary code :

public class test {
  public void m1 (){
    String filesname = "Hello.doc";
    POIFSFileSystem fs = null;
    fs = new POIFSFileSystem(new FileInputStream(filesname ); 
    HWPFDocument doc = new HWPFDocument(fs);
    WordExtractor we = new WordExtractor(doc);
    String str = we.getText() ;
    String[] paragraphs = we.getParagraphText();
    Picture pic = new Picture(. . .) ;
    pic.writeImageContent( . . . ) ;
    PicturesTable picTable = new PicturesTable( . . . ) ;
    if ( picTable.hasPicture( . . . ) ){
      picTable.extractPicture(..., ...);
      picTable.getAllPictures() ;
    }
}
4

There are 4 best solutions below

0
On

I know this long after the fact but I've found TextMining on google code, more accurate and very easy to use. It is however, pretty much abandoned code.

0
On

Apache Tika will do this for you. It handles talking to POI to do the HWPF stuff, and presents you with either XHTML or Plain Text for the contents of the file. If you register a recursing parser, then you'll also get all the embedded images too.

0
On

If you just want to do this, and you don't care about the coding, you can just use Antiword.

$ antiword file.doc > out.txt

0
On
    //you can use the org.apache.poi.hwpf.extractor.WordExtractor to get the text
    String fileName = "example.doc";
    HWPFDocument wordDoc = new HWPFDocument(new FileInputStream(fileName));
    WordExtractor extractor = new WordExtractor(wordDoc);
    String[] text = extractor.getParagraphText();
    int lineCounter = text.length;
    String articleStr = ""; // This string object use to store text from the word document.
    for(int index = 0;index < lineCounter;++ index){
        String paragraphStr = text[index].replaceAll("\r\n","").replaceAll("\n","").trim();
        int paragraphLength = paragraphStr.length();
        if(paragraphLength != 0){
            articleStr.concat(paragraphStr);
        }
    }
    //you can use the org.apache.poi.hwpf.usermodel.Picture to get the image
    List<Picture> picturesList = wordDoc.getPicturesTable().getAllPictures();
    for(int i = 0;i < picturesList.size();++i){
        BufferedImage image = null;
        Picture pic = picturesList.get(i);
        image = ImageIO.read(new ByteArrayInputStream(pic.getContent()));
        if(image != null){
            System.out.println("Image["+i+"]"+" ImageWidth:"+image.getWidth()+" ImageHeight:"+image.getHeight()+" Suggest Image Format:"+pic.suggestFileExtension());
        }
    }