Reportlab and pdfrw with matplotlib imshow() error in python3

338 Views Asked by At

I've recently updated some code which worked in python2 to python3 and encountered an error using reportlab in conjunction with pdfrw and matplotlib imshow().

Can someone reproduce this error in py3? Also I am uncertain whether it is a reportlab issue or a pdfrw problem.

import numpy as np
import matplotlib.pyplot as plt
from pdfrw import PdfReader
from pdfrw.buildxobj import pagexobj
from pdfrw.toreportlab import makerl
from reportlab.lib.pagesizes import A4
from reportlab.pdfgen import canvas

fig = plt.figure(figsize=(5,5))
plt.imshow(np.random.rand(10,10))
plt.savefig('Imshow.pdf')

MyReport = canvas.Canvas('foo.pdf', pagesize=A4)
pages = PdfReader('Imshow.pdf').pages
page = pagexobj(pages[0])
MyReport.saveState()
MyReport.doForm(makerl(MyReport, page))
MyReport.restoreState()
MyReport.save()

The error reads

UnicodeEncodeError: 'charmap' codec can't encode character '\x1f' in position 6: character maps to <undefined>

System: Windows 10, Python 3.9, pdfrw 0.4, reportlab 3.6.8,

2

There are 2 best solutions below

1
On BEST ANSWER

The problem is the way pdfrw deals with strings vs bytes.

pdfrw.PdfReader loads the entire source PDF using the Latin-1 encoding. All 256 possible byte values are meaningful in Latin-1, so all the binary data in the matplotlib image is loaded. But this creates junk unicode which reportlab is having problems re-encoding (because it doesn't use Latin-1).

The solution is to find data which really should be binary, and pass it to reportlab as correctly-encoded bytes instead of str. You need to hack the function _makestr on line 108 of pdfrw/toreportlab.py.

Old, original code (including the TODO tag!):

def _makestr(rldoc, pdfobj):
    assert isinstance(pdfobj, (float, int, str)), repr(pdfobj)
    # TODO: Add fix for float like in pdfwriter
    return str(getattr(pdfobj, 'encoded', None) or pdfobj)

New:

def _makestr(rldoc, pdfobj):
    assert isinstance(pdfobj, (float, int, str)), repr(pdfobj)
    # TODO: Add fix for float like in pdfwriter
    value = str(getattr(pdfobj, 'encoded', None) or pdfobj)
    try:
        value.encode("ascii")  # Don't return this, it is just a test.
    except UnicodeEncodeError:
        value = value.encode("Latin-1")
    return value

Anything which can't be represented as ASCII is encoded using the original Latin-1 encoding and sent to reportlab as bytes.

Based on my testing, this doesn't appear to affect non-ASCII strings in the plot (e.g. axis labels). I guess they go a different way through the pdfrw code - but I don't know!

pdfrw as a project seems to be dead with no release since 2017. If anyone sees this and knows how to contribute a patch to the project, feel free (or let me know).

1
On

Aron Lockey thank. it works!!

def _makestr(rldoc, pdfobj):

# --------- Original Code --------------------
# assert isinstance(pdfobj, (float, int, str)), repr(pdfobj)
# # TODO: Add fix for float like in pdfwriter
# return str(getattr(pdfobj, 'encoded', None) or pdfobj)

# --------- New Code --------------------
assert isinstance(pdfobj, (float, int, str)), repr(pdfobj)
# TODO: Add fix for float like in pdfwriter
value = str(getattr(pdfobj, 'encoded', None) or pdfobj)
try:
    value.encode("ascii")  # Don't return this, it is just a test.
except UnicodeEncodeError:
    value = value.encode("Latin-1")
return value