I'm trying to read lines from a number of files. Some are gzipped, and others are plain text files. In Python 2.7, I have been using the following code and it worked:
for line in fileinput.input(filenames, openhook=fileinput.hook_compressed):
match = REGEX.match(line)
if (match):
# do things with line...
Now I moved to Python 3.8, and it still works ok with plain text files, but when it encounters gzipped files I get the following error:
TypeError: cannot use a string pattern on a bytes-like object
What's the best way to fix this? I know I can check if line
is a bytes object and decode it into a string, but I would rather do it with some flag to automatically always iterate lines as string, if possible; and, I would prefer to write code that works with both Python 2 and 3.
fileinput.input
does fundamentally different things depending on whether it gets a gzipped file or not. For text files, it opens with regularopen
, which effectively opens in text mode by default. Forgzip.open
, the default mode is binary, which is sensible for compressed files of unknown content.The binary-only restriction is artificially imposed by
fileinput.FileInput
. From the code of the__init__
method:This gives you two options for a workaround.
Option 1
Set the
_mode
attribute after__init__
. To avoid adding extra lines of code to your usage, you can subclassfileinput.FileInput
and use the class directly:Option 2
Messing with undocumented leading-underscore is pretty hacky, so instead, you can create a custom hook for zip files. This is actually pretty easy, since you can use the code for
fileinput.hook_compressed
as a template:Option 3
Finally, you can always decode your bytes to unicode strings. This is clearly not the preferable option.