I'm trying to read lines from a number of files. Some are gzipped, and others are plain text files. In Python 2.7, I have been using the following code and it worked:
for line in fileinput.input(filenames, openhook=fileinput.hook_compressed):
match = REGEX.match(line)
if (match):
# do things with line...
Now I moved to Python 3.8, and it still works ok with plain text files, but when it encounters gzipped files I get the following error:
TypeError: cannot use a string pattern on a bytes-like object
What's the best way to fix this? I know I can check if line is a bytes object and decode it into a string, but I would rather do it with some flag to automatically always iterate lines as string, if possible; and, I would prefer to write code that works with both Python 2 and 3.
fileinput.inputdoes fundamentally different things depending on whether it gets a gzipped file or not. For text files, it opens with regularopen, which effectively opens in text mode by default. Forgzip.open, the default mode is binary, which is sensible for compressed files of unknown content.The binary-only restriction is artificially imposed by
fileinput.FileInput. From the code of the__init__method:This gives you two options for a workaround.
Option 1
Set the
_modeattribute after__init__. To avoid adding extra lines of code to your usage, you can subclassfileinput.FileInputand use the class directly:Option 2
Messing with undocumented leading-underscore is pretty hacky, so instead, you can create a custom hook for zip files. This is actually pretty easy, since you can use the code for
fileinput.hook_compressedas a template:Option 3
Finally, you can always decode your bytes to unicode strings. This is clearly not the preferable option.