How to properly convert from DOCM to PDF with open-source java libraries?

975 Views Asked by At

I started looking into how to convert .docm files into PDF files. As far as I looked there are only open-source libraries for converting .docx to pdf. My solution was to look for a way to convert .docm to .docx, while keeping every information. For this I could not find a proper open-source solution, but I found a submit for apache-poi (link). Using the code found in that commit, I managed to create .docx files with all the information my .docm file had.

        String dir = "<directory>";
    for (int i = 1; i < 41; i++) {
        File f = new File(dir + File.separator + i + ".docm");
        File target = new File(dir + "output" + i + ".docx");
        try {
            new DocumentConverter(f).toDocx(target);
        } catch (IOException e1) {
            // TODO Auto-generated catch block
            e1.printStackTrace();
        }
    }

I copied the code from the link and used it in the above stated way.

Once I had my .docx files with all the information, I started converting them into .pdf files. For this I found 2 possible open-source libraries, docx4j and documents4j.

Docx4j convert to pdf code:

    try {
            Docx4J.toPDF(WordprocessingMLPackage.load(target), new FileOutputStream(dir + "out" + i + ".pdf"));
        } catch (FileNotFoundException e1) {
            // TODO Auto-generated catch block
            e1.printStackTrace();
        } catch (Docx4JException e1) {
            // TODO Auto-generated catch block
            e1.printStackTrace();
        }

This will produce me a pdf file, that has all the information except MS Word's comments.

Documents4j convert to pdf code:

try (ByteArrayOutputStream bo = new ByteArrayOutputStream()) {
                try (InputStream in = new BufferedInputStream(new FileInputStream(target));) {
                    IConverter converter = LocalConverter.builder()
                            .baseFolder(new File(dir))
                            .workerPool(20, 25, 2, TimeUnit.SECONDS)
                            .processTimeout(5, TimeUnit.SECONDS)
                            .build();

                    Future<Boolean> conversion = converter
                            .convert(in).as(DocumentType.DOC)
                            .to(bo).as(DocumentType.PDF)
                            .prioritizeWith(1000) // optional
                            .schedule();
                    conversion.get();
                    try (OutputStream outputStream = new FileOutputStream("out"+ i +".pdf")) {
                        bo.writeTo(outputStream);
                    }
                    converter.shutDown();
                } 
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            } catch (InterruptedException | ExecutionException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            } 

This will produce me a pdf file that seemingly looks good and has MS Word's comments included.

Further testing showed that docx4j pdfs were accurate in text, but positions were changed (for example: paragraphs merged or split into two). Pdfs from documents4j were more accurate on position, but like I said they were missing information. My tests were on form documents created in same fashion and missing information was always in the same place.

My questions are the following:

  1. Is there a certified way to properly convert .docm file into .docx file with open-source libraries?
  2. What is going wrong, when I use documents4j to create pdfs?
  3. How can I include MS Word's comments with the help of docx4j?
  4. Is there any alternatives to my choices of libraries? (Open-source only)

EDIT: I forgot to include I am using latest version from each library.

1

There are 1 best solutions below

1
On

documents4j is delegating the actual work to MS Word via a VBS script, therefore, any changes in result are because of the configuration in the script. You can try to play around with it to see if you can make Word include the content you are missing: https://github.com/documents4j/documents4j/blob/master/documents4j-transformer-msoffice/documents4j-transformer-msoffice-word/src/main/resources/word_convert.vbs

Simply build the project and see how the changes affect the output.