Convert malformed HTML to PDF using Flying Saucer PDF Rendering

6.2k Views Asked by At

In a project GitHub I'm trying to convert any arbitrary HTML string into a PDF version. By convert I mean parse the HTML, and render it into a PDF file.

To achieve that I'm using Flying Saucer PDF Rendering like this:

Main.java

public class Main {

    public static void main(String [] args) {
        final String ok = "<valid html here>: see github rep for real html markup here";
        final String html = "<invalid html here>: see github rep for real html markup here";
        try {
            // final byte[] bytes = generatePDFFrom(ok); // works!
            final byte[] bytes = generatePDFFrom(html); // does NOT work :(
            try(FileOutputStream fos = new FileOutputStream("sample-file.pdf")) {
                fos.write(bytes);
            }

        } catch (IOException | DocumentException e) {
            e.printStackTrace();
        }
    }

    private static byte[] generatePDFFrom(String html) throws IOException, DocumentException {
        final ITextRenderer renderer = new ITextRenderer();
        renderer.setDocumentFromString(html);
        renderer.layout();
        try (ByteArrayOutputStream fos = new ByteArrayOutputStream(html.length())) {
            renderer.createPDF(fos);
            return fos.toByteArray();
        }
    }
}

In the above code if I use the html string stored in ok variable (this is a "valid" html), it creates the PDF correctly (if you run the GitHub project by using the ok variable it will create a file sample-file.pdf inside the project folder with some rendered html).

Now, if I use the value in html variable (html with invalid tags, tags maybe not closed properly, etc) it throws the following error (the error can vary depending on the incorrect value):

ERROR:  'The markup in the document following the root element must be well-formed.'
Exception in thread "main" org.xhtmlrenderer.util.XRRuntimeException: Can't load the XML resource (using TrAX transformer). org.xml.sax.SAXParseException; lineNumber: 22; columnNumber: 9; The markup in the document following the root element must be well-formed.
    at org.xhtmlrenderer.resource.XMLResource$XMLResourceBuilder.transform(XMLResource.java:222)
    at org.xhtmlrenderer.resource.XMLResource$XMLResourceBuilder.createXMLResource(XMLResource.java:181)
    at org.xhtmlrenderer.resource.XMLResource.load(XMLResource.java:84)
    at org.xhtmlrenderer.pdf.ITextRenderer.setDocumentFromString(ITextRenderer.java:171)
    at org.xhtmlrenderer.pdf.ITextRenderer.setDocumentFromString(ITextRenderer.java:166)
    at Main.generatePDFFrom(Main.java:84)
    at Main.main(Main.java:72)
Caused by: javax.xml.transform.TransformerException: org.xml.sax.SAXParseException; lineNumber: 22; columnNumber: 9; The markup in the document following the root element must be well-formed.
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:740)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:343)
    at org.xhtmlrenderer.resource.XMLResource$XMLResourceBuilder.transform(XMLResource.java:220)
    ... 6 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 22; columnNumber: 9; The markup in the document following the root element must be well-formed.
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transformIdentity(TransformerImpl.java:659)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:728)
    ... 8 more

Now, as far as I understood this is because of the "invalid" parts of the html string.

Important notes:

  • The values assigned to variables ok and html here are just a placeholder for the question. Real ones are here.
  • In the real project, the html string is an input that comes from the user. Yes, he/she must know what to put there, but, of course, he/she can do some mistakes in the html conformation, so I have to handle this.

Question(s)

  • Is there any way I can "tell" to Flying Saucer PDF Rendering to ignore/autocomplete/clean itself/or any other, those "invalid" parts and move on with the creation of the PDF file (preferred).
  • Is there a better approach I can use in order to overcome this.
2

There are 2 best solutions below

0
Kasun On BEST ANSWER

Since I had the same issue while using Flying Saucer to generate a PDF from an HTML, I used the HtmlCleaner library (see maven link) to clean the HTML code before parsing into Flying Saucer library.

// Clean the html to use in the flying saucer converting tool
// get the element you want to serialize
HtmlCleaner cleaner = new HtmlCleaner();
TagNode rootTagNode = cleaner.clean(html);
// set up properties for the serializer (optional, see online docs)
CleanerProperties cleanerProperties = cleaner.getProperties();
// use the getAsString method on an XmlSerializer class
XmlSerializer xmlSerializer = new PrettyXmlSerializer(cleanerProperties);
String cleanedHtml = xmlSerializer.getAsString(rootTagNode);

// use the https://github.com/flyingsaucerproject/flyingsaucer to convert cleaned HTML to PDF
ITextRenderer renderer = new ITextRenderer();
renderer.setDocumentFromString(cleanedHtml);
// ....
0
VioletGil On

An initial thought would be to parse your input through another library that would be able to handle html better and then toString() that library's results into the PDF Renderer.

https://jsoup.org/

Five minutes of Googling found this as a pretty reasonable library to use. There's even a test utility you can try throwing your malformed input into:

https://try.jsoup.org/