I've been trying to thread emails from google for the past few days and have gotten quite a bit frustrated. I want to group all the emails in my gmail into threads, like the gmail web client normally does, and concatenate all the text from each thread separately into different strings and save them in a document. A method or solution in imaplib would be great but I think imapclient is high-level so I would really prefer that.
So far I have figured out how to extract text from emails but I just can't seem to find any resources on how to use "X-GM-THRID" to thread the emails in my inbox. I have looked everywhere and tried pretty much everything I can. Any help would be appreciated.
Edit: Here is the code I have so far to extract text from a single email.
#Import necessary packages
!pip install imaplib2
!pip install html2text
import imaplib2 as imap
import email
from email.header import decode_header, make_header
import html2text
import re
#Login
con = imap.IMAP4_SSL(imap_server)
con.login(username, password)
#Get list of emails in inbox
Mailbox = 'INBOX'
con.select(Mailbox, readonly = True)
_, email_ids = con.search(None, "ALL")
email_ids = email_ids[0].split()
print("Number of emails in ", Mailbox, " is: ", len(email_ids))
#Define function to extract text from an email given uid (Unique Identification Number)
def extract_text_from_email(uid):
_, data = con.fetch(uid, "(RFC822)")
email_message = email.message_from_bytes(data[0][1])
text = ""
text = text + "To: " + str(make_header(decode_header(email_message["To"]))) + ". "
text = text + "From: " + str(make_header(decode_header(email_message["From"]))) + ". "
text = text + "Date: " + str(make_header(decode_header(email_message["Date"]))) + ". "
if email_message["BCC"]:
text = text + "BCC: " + str(make_header(decode_header(email_message["BCC"]))) + ". "
if email_message["Subject"]:
text = text + "Subject: " + str(make_header(decode_header(email_message["Subject"]))) + ". "
for part in email_message.walk():
if part.get_content_type() == "text/plain":
result = part.get_payload(decode = True).decode(part.get_content_charset()) #Extracting payload if message type is plain text
result = re.sub(r"http\S+", "url", result).replace("\n", " ").replace("\r","") #Remove urls and replace them with "url" and remove "\n" and "\r"
text = text + result
return text
extract_text_from_email(email_ids[-1])
X-GM-THRID
is dead simple: If two messages have the same thrid, they're in the same thread.To use it, just fetch your messages in the same way as before. You can fetch everything or just e.g.
BODYSTRUCTURE X-GM-THRID
. Create a dictionary with integer as key and list of message as value. For each message, look up its thrid in the dictionary and add the message to the list of messages. Now each value in the dictionary is a thread. The thread/list is probably roughly sorted already, but you can sort it by date easily enough.