Unicode Characters not display correctly in the converted pdf file using xhtml2pdf library

391 Views Asked by At

I am using xhtml2pdf library to convert my webpage into an pdf document. The library function was working fine, however since the webpage contents is an Khmer unicode, the produced pdf document content was not correctly display.

Here is the result pdf file:

Result of produced pdf file

The correct content should be

ភាសាខ្មែរ

នេះគឺជាអត្ថបទសាមញ្ញជាភាសាខ្មែរដែលបានសរសេរជាអក្សរខ្មែរយូនីកូដ។

Here is my python code to convert to pdf file:

from xhtml2pdf import pisa
import io

def convert_html_to_pdf(source_html, output_filename):
    with open(output_filename, "wb") as output_file:
        pdf_status = pisa.CreatePDF(source_html, dest=output_file)

    return not pdf_status.err

if __name__ == "__main__":
    with io.open("khmer_unicode.html", "r", encoding="utf-8") as html_file:
        html_content = html_file.read()

    pdf_output = "khmer_unicode.pdf"
    if convert_html_to_pdf(html_content, pdf_output):
        print "PDF file has been created at "+pdf_output
    else:
        print("Error generating PDF")

and my html file:

<!DOCTYPE html>
<html>
    <title>sample file</title>
<head>
    <meta charset="UTF-8">
    <style>
        @font-face {
            font-family: 'KhmerOS';
            src: url('KhmerOS.ttf');
        }
        body {
            font-family: 'KhmerOS', Arial, sans-serif;
        }
    </style>
</head>
<body>
    <h1>ភាសាខ្មែរ</h1>
    <p>នេះគឺជាអត្ថបទសាមញ្ញជាភាសាខ្មែរដែលបានសរសេរជាអក្សរខ្មែរយូនីកូដ។</p>
</body>
</html>

The font file 'KhmerOS.tff' is in the same directory.

Do I have to encode anything more to make the Khmer Unicode character to display correctly in the converted file? Please kindly help, thanks.

1

There are 1 best solutions below

7
VonC On

Reading "Fonts and encodings" you might have to use the pdfmetrics module from the reportlab package to register the font first, before using it with xhtml2pdf:

from xhtml2pdf import pisa
import io
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont

def convert_html_to_pdf(source_html, output_filename):
    pdfmetrics.registerFont(TTFont('KhmerOS', 'KhmerOS.ttf'))
    with open(output_filename, "wb") as output_file:
        pdf_status = pisa.CreatePDF(source_html, dest=output_file)

    return not pdf_status.err

if __name__ == "__main__":
    with io.open("khmer_unicode.html", "r", encoding="utf-8") as html_file:
        html_content = html_file.read()

    pdf_output = "khmer_unicode.pdf"
    if convert_html_to_pdf(html_content, pdf_output):
        print "PDF file has been created at %s" % pdf_output
    else:
        print "Error generating PDF"

That code registers the KhmerOS.ttf font with reportlab which should then be available for xhtml2pdf to use when creating the PDF.
If the Khmer Unicode characters are still not displaying correctly, it could be an issue with the font file itself.


I have tested with registered font function already and tried the different font, but it was still the same.

That means your current process pipeline need to change.

You can try and convert the text to a format that is known to work well with xhtml2pdf. For example, you could try converting the Khmer Unicode text to HTML entities.

Or you might consider using a PDF creation library (like reportlab) directly, without going through HTML. You could create a PDF template and then insert your text into that template.

But you can also try a Different HTML to PDF Library, like pdfkit or weasyprint. These libraries may handle Unicode characters more reliably.

Example using pdfkit library:

import pdfkit
import io

def convert_html_to_pdf(source_html, output_filename):
    pdfkit.from_file(source_html, output_filename)

if __name__ == "__main__":
    html_content = "khmer_unicode.html"
    pdf_output = "khmer_unicode.pdf"
    convert_html_to_pdf(html_content, pdf_output)

To use pdfkit, you will need to install wkhtmltopdf.
Then install the pdfkit Python package with pip install pdfkit.