Access all fields in mbox using mailbox

1.3k Views Asked by At

I am attempting to perform some processing on email messages in mbox format.

After searching, and a bit of trial and error tried https://docs.python.org/3/library/mailbox.html#mbox

I have got this to do most of what I want (even though I had to write code to decode subjects) using the test code listed below.

I found this somewhat hit and miss, in particular the key needed to look up fields 'subject' seems to be trial and error, and I can't seem to find any way to list the candidates for a message. (I understand that the fields may differ from email to email.)

Can anyone help me to list the possible values?

I have another issue; the email may contain a number of "Received:" fields e.g.

Received: from awcp066.server-cpanel.com
Received: from mail116-213.us2.msgfocus.com ([185.187.116.213]:60917)
    by awcp066.server-cpanel.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256)

I am interested in accessing the FIRST chronologically - I would be happy to search, but can't seem to find any way to access any but the first in the file.

#! /usr/bin/env python3
#import locale
#2020-08-31

"""
Extract Subject from MBOX file
"""

import os, time
import mailbox
import base64, quopri

def isbqencoded(s):
    """
    Test if Base64 or Quoted Printable strings
    """
    return s.upper().startswith('=?UTF-8?')

def bqdecode(s):
    """
    Convert UTF-8 Base64 or Quoted Printable string to str
    """
    nd = s.find('?=', 10)
    if s.upper().startswith('=?UTF-8?B?'):   # Base64
        bbb = base64.b64decode(s[10:nd])
    elif s.upper().startswith('=?UTF-8?Q?'): # Quoted Printable
        bbb = quopri.decodestring(s[10:nd])
    return bbb.decode("utf-8")

def sdecode(s):
    """
    Convert possibly multiline Base64 or Quoted Printable strings to str
    """
    outstr = ""
    if s is None:
        return outstr
    for ss in str(s).splitlines():   # split multiline strings
        sss = ss.strip()
        for sssp in sss.split(' '):   # split multiple strings
            if isbqencoded(sssp):
                outstr += bqdecode(sssp)
            else:
                outstr += sssp
            outstr+=' '
        outstr = outstr.strip()
    return outstr

INBOX = '~/temp/2020227_mbox'

print('Messages in ', INBOX)
mymail = mailbox.mbox(INBOX)
print('Values = ', mymail.values())
print('Keys = ', mymail.keys())
# print(mymail.items)
# for message in mailbox.mbox(INBOX):
for message in mymail:

#     print(message)
    subject = message['subject']
    to = message['to']
    id = message['id']
    received = message['Received']
    sender = message['from']
    ddate = message['Delivery-date']
    envelope = message['Envelope-to']


    print(sdecode(subject))
    print('To ', to)
    print('Envelope ', envelope)
    print('Received ', received)
    print('Sender ', sender)
    print('Delivery-date ', ddate)
#     print('Received ', received[1])

Based on this answer I simplified the Subject decoding, and got similar results.

I am still looking for suggestions to access the remainder of the Header - in particular how to access multiple "Received:" fields.

#! /usr/bin/env python3
#import locale
#2020-09-02

"""
Extract Subject from MBOX file
"""

import os, time
import mailbox
from email.parser import BytesParser
from email.policy import default

INBOX = '~/temp/2020227_mbox'
print('Messages in ', INBOX)

mymail = mailbox.mbox(INBOX, factory=BytesParser(policy=default).parse)

for _, message in enumerate(mymail):
    print("date:  :", message['date'])
    print("to:    :", message['to'])
    print("from   :", message['from'])
    print("subject:", message['subject'])
    print('Received: ', message['received'])

    print("**************************************")
2

There are 2 best solutions below

0
On BEST ANSWER

Based on a Comment by snakecharmerb (now edited into the Question) I simplified the process.
In the end I did not need to decode received, because the Message-ID actually extracts the id from the original received field.

I list the code I finally used, in case this is of use to others. This code just extracts header fields of interest and prints them, but the full code performs analysis on the messages.

#! /usr/bin/env python3
#import locale
#2020-09-05

"""
Extract Message Header details from MBOX file
"""

import os, time
import mailbox
from email.parser import BytesParser
from email.policy import default

INBOX = '~/temp/Gmail'

print('Messages in ', INBOX)

mymail = mailbox.mbox(INBOX, factory=BytesParser(policy=default).parse)

for _, message in enumerate(mymail):
    date = message['date']
    to = message['to']
    sender = message['from']
    subject = message['subject']
    messageID = message['Message-ID']
    received = message['received']
    deliveredTo = message['Delivered-To']
    if(messageID == None): continue

    print("Date        :", date)
    print("From        :", sender)
    print("To:         :", to)
    print('Delivered-To:', deliveredTo)
    print("Subject     :", subject)
    print("Message-ID  :", messageID)
#     print('Received    :', received)

    print("**************************************")
0
On

The email message object provides a get_all method which returns all instances of a header, so we can use this to obtain all the values of the received header.

for header in message.get_all('received'):
    print('Received', header)

Each header is an instance of UnstructuredHeader. This isn't very helpful for identifying the earliest Received header, as the headers need to be be parsed to extract the dates so that they can be sorted.

However, according to this answer, which quotes the RFC, received headers are always inserted at the beginning of the message. The docstring for EmailMessage.get_all() states:

Return a list of all the values for the named field. These will be sorted in the order they appeared in the original message, and may contain duplicates.

So the earliest received header should be the last header in the list returned by EmailMessage.get_all().