I am working on a CRM, where I am receiving hundreds of emails for offers/requirements per day. I am building an API that will process the email and will insert entries in the CRM.
I am using imap_tools to get the mails in my API. but I am stuck at the point when there's a thread/conversation. I read some articles regarding using reference
or in-reply-to
header from the mail. but unlucky so far. I have also tried using the message-id but it gave me the same email thread instead of multiple emails.
I am getting an email thread/conversation as a single email and I want to get separated emails so I can process them easily.
here's what I have done so far.
from imap_tools import MailBox
with MailBox('mail.mail.com').login('[email protected]', 'password', 'INBOX') as mailbox:
for msg in mailbox.fetch():
From = msg.headers['from'][0]
To = msg.headers['to'][0]
subject = msg.headers['subject'][0]
received_date = msg.headers['date'][0]
raw_email = msg.text
process_email(raw_email) #processing the email
The issue you are facing is not related to the headers
reference
orin-reply-to
. Most email clients will append the previous email as quoted text to the new mail when you reply. Hence in a thread, a mail will have the body of all previous mails as quoted text.In most cases, and I say most since the Email standards vary a lot from client to client, the client will quote the previous mail by pretending
>
before all quoted linesAs a hacky solution, you can drop all lines that start with
>
In python, you can
splitlines()
and filteror
you may use regular expressions or other techniques too.
the issue with the solution is obvious, if an email contains
>
else where it will cause loss of information. Hence a more complicated approach is to select lines with>
and compare them with the previous emails in the thread usingreferences
and remove those which match.Google has their patented implementation here https://patents.google.com/patent/US7222299
Source: How to remove the quoted text from an email and only show the new text
Edit
I realized Gmail follows the
>
quoting and other clients may follow other methods. There's a Wikipedia article on it: https://en.wikipedia.org/wiki/Posting_styleconceptually the approach needed will be similar, but different types of clients will need to be handled