separate emails in the email thread based on reference or in-reply-to headers using imap_tools

1.4k Views Asked by At

I am working on a CRM, where I am receiving hundreds of emails for offers/requirements per day. I am building an API that will process the email and will insert entries in the CRM.

I am using imap_tools to get the mails in my API. but I am stuck at the point when there's a thread/conversation. I read some articles regarding using reference or in-reply-to header from the mail. but unlucky so far. I have also tried using the message-id but it gave me the same email thread instead of multiple emails.

I am getting an email thread/conversation as a single email and I want to get separated emails so I can process them easily.

here's what I have done so far.

from imap_tools import MailBox
with MailBox('mail.mail.com').login('[email protected]', 'password', 'INBOX') as mailbox:
for msg in mailbox.fetch():
   From = msg.headers['from'][0]
   To = msg.headers['to'][0]
   subject = msg.headers['subject'][0]
   received_date = msg.headers['date'][0]
   raw_email = msg.text
   process_email(raw_email) #processing the email
1

There are 1 best solutions below

5
On

The issue you are facing is not related to the headers reference or in-reply-to. Most email clients will append the previous email as quoted text to the new mail when you reply. Hence in a thread, a mail will have the body of all previous mails as quoted text.

In most cases, and I say most since the Email standards vary a lot from client to client, the client will quote the previous mail by pretending > before all quoted lines

new message

> old message
>> very old message

As a hacky solution, you can drop all lines that start with >

In python, you can splitlines() and filter

lines = email.splitlines()
new_lines = [i for i in lines if not i.startswith('>')]

or

new_lines = list(filter(lambda i: not i.startswith('>'), lines))

you may use regular expressions or other techniques too.

the issue with the solution is obvious, if an email contains > else where it will cause loss of information. Hence a more complicated approach is to select lines with > and compare them with the previous emails in the thread using references and remove those which match.

Google has their patented implementation here https://patents.google.com/patent/US7222299

Source: How to remove the quoted text from an email and only show the new text


Edit

I realized Gmail follows the > quoting and other clients may follow other methods. There's a Wikipedia article on it: https://en.wikipedia.org/wiki/Posting_style

conceptually the approach needed will be similar, but different types of clients will need to be handled