I'm trying to extract all the emails from my gmail account to do some analysis. The end goal is a dataframe of emails. I'm using the gmailR package.
So far I've extracted all the email threads and "expanded" them by mapping all the thread IDs to gm_thread(). Here's the code for that:
threads <- gm_threads(num_results = 5)
thread_ids <- gm_id(threads)
#extract all the thread ids
threads_expanded <- map(thread_ids, gm_thread)
This returns a list of all the threads. The structure of this is a list of gmail_thread objects. When you drill down one level into the list of thread objects, str(threads_expanded[[1]], max.level = 1)
, you get a single thread object which looks like:
List of 3
$ id : chr "xxxx"
$ historyId: chr "yyyy"
$ messages :List of 3
- attr(*, "class")= chr "gmail_thread"
Then, if you drill down further into the messages composing the threads, you start to get the useful info. str(threads_expanded[[1]]$messages, max.level = 1)
gets you a list of the gmail_message objects for that thread:
List of 3
$ :List of 8
..- attr(*, "class")= chr "gmail_message"
$ :List of 8
..- attr(*, "class")= chr "gmail_message"
$ :List of 8
..- attr(*, "class")= chr "gmail_message"
Where I'm stuck is actually extracting all the useful information from each email within all the threads. The end goal is a dataframe with a column for the message_id, thread_id, to, from, etc. I'm imagining something like this:
message_id | thread_id | to | from | ... |
-------------------------------------------------------------------------
1234 | abcd | [email protected] | [email protected] | ... |
1235 | abcd | [email protected] | [email protected] | ... |
1236 | abcf | [email protected] | [email protected] | ... |
It's not the prettiest answer, but it works. I'm going to work on vectorizing it later: