How to convert docx to xhtml

2.3k Views Asked by At

I am trying to find a solution to convert a docx file to XHTML.

I found xdocreport, which looks good, but I have some issues. (and I am new to xdocreport)

According to their documentations on github here and here: I should be able to convert with this code:

    String source = args[0];
    String dest = args[1];

    // 1) Create options DOCX to XHTML to select well converter form the registry
    Options options = Options.getFrom(DocumentKind.DOCX).to(ConverterTypeTo.XHTML);

    // 2) Get the converter from the registry
    IConverter converter = ConverterRegistry.getRegistry().getConverter(options);

    // 3) Convert DOCX to (x)html
    try {
        InputStream in = new FileInputStream(new File(source));
        OutputStream out = new FileOutputStream(new File(dest));
        converter.convert(in, out, options);
    } catch (XDocConverterException | FileNotFoundException e) {
        e.printStackTrace();
    }

I am using these dependencies (tried different versions, like 2.0.2, 2.0.0, 1.0.6):

    <dependency>
        <groupId>fr.opensagres.xdocreport</groupId>
        <artifactId>fr.opensagres.xdocreport.document.docx</artifactId>
        <version>2.0.2</version>
    </dependency>

    <dependency>
        <groupId>fr.opensagres.xdocreport</groupId>
        <artifactId>fr.opensagres.xdocreport.template.freemarker</artifactId>
        <version>2.0.2</version>
    </dependency>

    <dependency>
        <groupId>fr.opensagres.xdocreport</groupId>
        <artifactId>fr.opensagres.xdocreport.converter.docx.xwpf</artifactId>
        <version>2.0.2</version>
    </dependency>

My issues:

  • The images are missing
  • The background color is missing (all pages have a background color, which is not white and I have to convert this too)

How can I handle these issues? (Or how can I convert docx to xhtml using Docx4j with formats/numbering/images?)

1

There are 1 best solutions below

2
On BEST ANSWER

To convert *.docx to XHTML using XDocReport and apache poi's XWPFDocument as the source you will need XHTMLOptions. Those options are able having ImageManager to set the path for extracted images from XWPFDocument. Then XHTMLConverter is needed to convert.

Complete example:

import java.io.*;

//needed jars: xdocreport-2.0.2.jar, 
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLConverter;
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLOptions;
import fr.opensagres.poi.xwpf.converter.core.ImageManager;

//needed jars: all apache poi dependencies
import org.apache.poi.xwpf.usermodel.*;

public class DOCXToXHTMLXDocReport {

 public static void main(String[] args) throws Exception {

  String docPath = "./WordDocument.docx";

  String root = "./";
  String htmlPath = root + "WordDocument.html";

  XWPFDocument document = new XWPFDocument(new FileInputStream(docPath));

  XHTMLOptions options = XHTMLOptions.create().setImageManager(new ImageManager(new File(root), "images"));

  FileOutputStream out = new FileOutputStream(htmlPath);
  XHTMLConverter.getInstance().convert(document, out, options);

  out.close();      
  document.close();     
 
 }
}

This handles images properly.

But XDocReport is unable handling page background colors of XWPFDocument properly until now. It extracts and handles paragraph background colors but not page background colors.