Extracting All Emails Using GmailR

474 Views Asked by At

I'm trying to extract all the emails from my gmail account to do some analysis. The end goal is a dataframe of emails. I'm using the gmailR package.

So far I've extracted all the email threads and "expanded" them by mapping all the thread IDs to gm_thread(). Here's the code for that:

threads <- gm_threads(num_results = 5)

thread_ids <- gm_id(threads)
#extract all the thread ids

threads_expanded <- map(thread_ids, gm_thread)

This returns a list of all the threads. The structure of this is a list of gmail_thread objects. When you drill down one level into the list of thread objects, str(threads_expanded[[1]], max.level = 1), you get a single thread object which looks like:

List of 3
 $ id       : chr "xxxx"
 $ historyId: chr "yyyy"
 $ messages :List of 3
 - attr(*, "class")= chr "gmail_thread"

Then, if you drill down further into the messages composing the threads, you start to get the useful info. str(threads_expanded[[1]]$messages, max.level = 1) gets you a list of the gmail_message objects for that thread:

List of 3
 $ :List of 8
  ..- attr(*, "class")= chr "gmail_message"
 $ :List of 8
  ..- attr(*, "class")= chr "gmail_message"
 $ :List of 8
  ..- attr(*, "class")= chr "gmail_message"

Where I'm stuck is actually extracting all the useful information from each email within all the threads. The end goal is a dataframe with a column for the message_id, thread_id, to, from, etc. I'm imagining something like this:

    message_id    |  thread_id    |  to            |  from            | ... |
    -------------------------------------------------------------------------
    1234          |  abcd         |  [email protected]  | [email protected]    | ... |
    1235          |  abcd         |  [email protected] | [email protected]     | ... |
    1236          |  abcf         |  [email protected]  | [email protected]    | ... |
1

There are 1 best solutions below

0
On

It's not the prettiest answer, but it works. I'm going to work on vectorizing it later:

threads <- gm_threads(num_results = 5)

thread_ids <- gm_id(threads)
#extract all the thread ids

threads_expanded <- map(thread_ids, gm_thread)

msgs <- vector()
for(i in (1:length(threads_expanded))){
  msgs <- append(msgs, values = threads_expanded[[i]]$messages)
}
#extract all the individual messages from each thread

msg_ids <- unlist(map(msgs, gm_id))
#get the message id for each message

msg_body <- vector()
#get message body, store in vector
for(msg in msgs){
  body <- gm_body(msg)
  attchmnt <- nrow(gm_attachments(msg))
  if(length(body) != 0 && attchmnt == 0){
    #does not return a null value, rather an empty list or list
of length 0, so if,
    #body is not 0 (there is something there) and there are no attachemts, 
    #add it to vector
    msg_body <- append(msg_body, body)
    #if there is no to info, fill that spot with an empty space
  }
  else{
    msg_body <- append(msg_body, "")
    #if there is no attachment but the body is also empty add "" to the list
  }
}
msg_body <- unlist(msg_body)

msg_datetime <- msgs %>%
  map(gm_date) %>%
  unlist()%>%
  dmy_hms()
#get datetime info, store in vector

message_df <- tibble(msg_ids, msg_datetime, msg_body)
#all the other possible categories, e.g., to, from, cc, subject, etc.,
#either use a similar for loop or a map call