UnicodeDecodeError in `re.Pattern.search()`

62 Views Asked by At

I've received a bug report for a library I wrote. The symptom is that the search method of a compiled Pattern is raising UnicodeDecodeError. The Python re library docs do not mention UnicodeDecodeError.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 32: invalid start byte

The argument to search is a Windows file path. I suspect that the path in question is ill-formed UTF-16 and that the (third-party, compiled, closed-source) code that's interacting with the filesystem is creating a malformed Python string object, and then search is failing in an unusual manner as a result. Unfortunately, I don't have any specific information about what the broken file path might be, so I can't test it directly.

My questions:

  1. Is this a plausible thing that could happen?
  2. Is there any way within pure Python to construct a malformed Python string, or to check whether a given string is malformed?
0

There are 0 best solutions below