Decoding NNTP headers from (UTF-8?)

183 Views Asked by At

I'k working on some Python 3 code to grab NNTP messages, parse the headers,, and process the data. My code works fine for the first several hundred messages then I throw an exception.

The exception is:

sys.exc_info()
(<class 'UnicodeDecodeError'>, UnicodeDecodeError('utf-8', b"Ana\xefs's", 3, 4, 'invalid continuation byte'), <traceback object at 0x7fe325261c08>)

The problem is coming from trying to parse out the subject. The raw content of the message is:

{'subject': 'Re: Mme. =?UTF-8?B?QW5h73Mncw==?= Computer Died', 'from': 'Fred Williams <[email protected]>', 'date': 'Sun, 05 Aug 2007 18:55:22 -0400', 'message-id': '<[email protected]>', 'references': '<[email protected]>', ':bytes': '1353', ':lines': '14', 'xref': 'number1.nntp.dca.giganews.com rec.pets.cats.community:171958'}

That ?UTF-8? is what I don't know how to handle. The code fragment that is puking on itself is:

for msgId, msg in overviews:
    print(msgId)
    hdrs = {}
    if msgId == 171958:
        print(msg)
    try:
        for k in msg.keys():
            hdrs[k] = nntplib.decode_header(msg[k])
    except:
        print('Unicode error!')
        continue
1

There are 1 best solutions below

0
On

The problem here is that the input you have is actually invalid.

This string is the problem:

'Re: Mme. =?UTF-8?B?QW5h73Mncw==?= Computer Died'

You can do this to decode it:

import email.header
email.header.decode_header('Re: Mme. =?UTF-8?B?QW5h73Mncw==?= Computer Died')

The result is:

[(b'Re: Mme. ', None), (b"Ana\xefs's", 'utf-8'), (b' Computer Died', None)]

So, the ugly part =?UTF-8?B?QW5h73Mncw==?= is b"Ana\xefs's" and it is supposed to by an UTF-8 string, but it is not valid UTF-8.

>>> b"Ana\xefs's".decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 3: invalid continuation byte

This is the error which you are seeing.

Now it's up to you to decide what to do. For example...

Ignore the error:

>>> b"Ana\xefs's".decode('utf-8', errors='ignore')
"Anas's"

Mark it as an error:

>>> b"Ana\xefs's".decode('utf-8', errors='replace')
"Ana�s's"

Make a wild guess of the correct encoding:

>>> b"Ana\xefs's".decode('windows-1252')
"Anaïs's"
>>> b"Ana\xefs's".decode('iso-8859-1')
"Anaïs's"
>>> b"Ana\xefs's".decode('iso-8859-2')
"Anaďs's"
>>> b"Ana\xefs's".decode('iso-8859-4')
"Anaīs's"
>>> b"Ana\xefs's".decode('iso-8859-5')
"Anaяs's"