I'k working on some Python 3 code to grab NNTP messages, parse the headers,, and process the data. My code works fine for the first several hundred messages then I throw an exception.
The exception is:
sys.exc_info()
(<class 'UnicodeDecodeError'>, UnicodeDecodeError('utf-8', b"Ana\xefs's", 3, 4, 'invalid continuation byte'), <traceback object at 0x7fe325261c08>)
The problem is coming from trying to parse out the subject. The raw content of the message is:
{'subject': 'Re: Mme. =?UTF-8?B?QW5h73Mncw==?= Computer Died', 'from': 'Fred Williams <[email protected]>', 'date': 'Sun, 05 Aug 2007 18:55:22 -0400', 'message-id': '<[email protected]>', 'references': '<[email protected]>', ':bytes': '1353', ':lines': '14', 'xref': 'number1.nntp.dca.giganews.com rec.pets.cats.community:171958'}
That ?UTF-8? is what I don't know how to handle. The code fragment that is puking on itself is:
for msgId, msg in overviews:
print(msgId)
hdrs = {}
if msgId == 171958:
print(msg)
try:
for k in msg.keys():
hdrs[k] = nntplib.decode_header(msg[k])
except:
print('Unicode error!')
continue
The problem here is that the input you have is actually invalid.
This string is the problem:
You can do this to decode it:
The result is:
So, the ugly part
=?UTF-8?B?QW5h73Mncw==?=
isb"Ana\xefs's"
and it is supposed to by an UTF-8 string, but it is not valid UTF-8.This is the error which you are seeing.
Now it's up to you to decide what to do. For example...
Ignore the error:
Mark it as an error:
Make a wild guess of the correct encoding: