open(..., encoding="") vs str.encode(encoding="")

301 Views Asked by At

Question:
What is the difference between open(<name>, "w", encoding=<encoding>) and open(<name>, "wb") + str.encode(<encoding>)? They seem to (sometimes) produce different outputs.

Context:
While using PyFPDF (version 1.7.2), I subclassed the FPDF class, and, among other things, added my own output method (taking pathlib.Path objects). While looking at the source of the original FPDF.output() method, I noticed almost all of it is argument parsing - the only relevant bits are

#Finish document if necessary
if(self.state < 3):
    self.close()
[...]
f=open(name,'wb')
if(not f):
    self.error('Unable to create output file: '+name)
if PY3K:
    # manage binary data as latin1 until PEP461 or similar is implemented
    f.write(self.buffer.encode("latin1"))
else:
    f.write(self.buffer)
f.close()

Seeing that, my own Implementation looked like this:

def write_file(self, file: Path) -> None:
    if self.state < 3:
        # See FPDF.output()
        self.close()
    file.write_text(self.buffer, "latin1", "strict")

This seemed to work - a .pdf file was created at the specified path, and chrome opened it. But it was completely blank, even tho I added Images and Text. After hours of experimenting, I finally found a Version that worked (produced a non empty pdf file):

def write_file(self, file: Path) -> None:
    if self.state < 3:
        # See FPDF.output()
        self.close()
    # using .write_text(self.buffer, "latin1", "strict") DOES NOT WORK AND I DON'T KNOW WHY
    file.write_bytes(self.buffer.encode("latin1", "strict"))

Looking at the pathlib.Path source, it uses io.open for Path.write_text(). As all of this is Python 3.8, io.open and the buildin open() are the same.

Note: FPDF.buffer is of type str, but holds binary data (a pdf file). Probably because the Library was originally written for Python 2.

2

There are 2 best solutions below

0
On BEST ANSWER

Aaaand found it: Path.write_bytes() will save the bytes object as is, and str.encoding doesn't touch the line endings.

Path.write_text() will encode the bytes object just like str.encode(), BUT: because the file is opened in text mode, the line endings will be normalized after encoding - in my case converting \n to \r\n because I'm on Windows. And pdfs have to use \n, on all platforms.

6
On

Both should be the same (with minor differences).

I like open way, because it is explicit and shorter, OTOH if you want to handle encoding errors (e.g. a way better error to user), one should use decode/encode (maybe after a '\n'.split(s), and keeping line numbers)

Note: if you use the first method (open), you should just use r or w, so without b. For your question's title, it seems you did correct, but check that your example keep b, and probably for this, it used encoding. OTOH the code seems old, and I think the ".encoding" was just done because it would be more natural in Python2 mindset.

Note: I would also replace strict to backslashreplace for debugging. And possibly you may want to check and print (maybe just ord) of the first few characters of the self.buffer on both methods, to see if there are substantial differences before file.write.

I would add a file.flush() on both functions. This is one of the differences: buffering is different, and I'll make sure I close the file. Python will do it, but when debugging, it is important to see the content of the file as quick as possible (and also after an exception). Garbage collector could not guarantee all of this. Maybe you are reading a text file which was not yet flushed.