Create tm corpus including text (tweet) attributes from dataframe

380 Views Asked by At

I have a data frame including tweets, creation date, tweet ids, favorite and retweet counts. I want to create a corpus that includes for each document the favorite and retweet counts as variables. I also want to identify the documents by the tweet id, not by the random doc 001 etc ids.

I start with the data below... See below for rest of code

                   id
1: 737243856144629760
2: 737242308261842945
3: 737242189055594496
4: 737242018687164416
5: 737241411465170944
6: 737239685295181824
                                                                                                                                    text
1:                                                    Have a great Memorial Day and remember that we will soon MAKE AMERICA GREAT AGAIN!
2:                 "@NBCDFW: Trump rallies veterans at annual Rolling Thunder Gathering https://twitter.com/b08FcMlgkr https://twitter.com/RCDeLvHQqD"
3:                "@FrankyLamouche: how many of donald's rolling thunder brigade will sign up and go to war for him in the middle east."
4:    "@MariaErnandez3b: Trump Supports Rolling Thunder Rally #TRUMP STRONG https://twitter.com/pfVXQ8NdZu" So true, and remember the M.I.A.'s!
5:     "@ScottWRasmussen: Donald Trump and Bikers Share Affection at Rolling Thunder Rally https://twitter.com/ZZl2sc29dn" A great day in D.C.!
6: "@TeaPartyNevada: #Trump2016 "Illegals are taken care of better than our veterans."  https://twitter.com/KKIgM4rNma https://twitter.com/1cEZ8wG7Cy"
   favorited favoritwitter.comunt replyToSN             created truncated replyToSID replyToUID
1:     FALSE         25944        NA 2016-05-30 11:26:47     FALSE         NA         NA
2:     FALSE          9268        NA 2016-05-30 11:20:38     FALSE         NA         NA
3:     FALSE          6739        NA 2016-05-30 11:20:09     FALSE         NA         NA
4:     FALSE         15417        NA 2016-05-30 11:19:29     FALSE         NA         NA
5:     FALSE          7192        NA 2016-05-30 11:17:04     FALSE         NA         NA
6:     FALSE          9834        NA 2016-05-30 11:10:12     FALSE         NA         NA
                                                                           statusSource      screenName retweetCount
1: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump         9455
2: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump         2744
3: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump         1604
4: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump         4237
5: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump         2148
6: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump         3545
   isRetweet retweeted longitude latitude
1:     FALSE     FALSE        NA       NA
2:     FALSE     FALSE        NA       NA
3:     FALSE     FALSE        NA       NA
4:     FALSE     FALSE        NA       NA
5:     FALSE     FALSE        NA       NA
6:     FALSE     FALSE        NA       NA
                                                                                                                                cleantxt
1:                                                    have a great memorial day and remember that we will soon make america great again!
2:                 "@nbcdfw: trump rallies veterans at annual rolling thunder gathering https://twitter.com/b08fcmlgkr https://twitter.com/rcdelvhqqd"
3:                "@frankylamouche: how many of donald's rolling thunder brigade will sign up and go to war for him in the middle east."
4:    "@mariaernandez3b: trump supports rolling thunder rally #trump strong https://twitter.com/pfvxq8ndzu" so true, and remember the m.i.a.'s!
5:     "@scottwrasmussen: donald trump and bikers share affection at rolling thunder rally https://twitter.com/zzl2sc29dn" a great day in d.c.!
6: "@teapartynevada: #trump2016 "illegals are taken care of better than our veterans."  https://twitter.com/kkigm4rnma https://twitter.com/1cez8wg7cy"

I try to convert it to a corpus with

myReader <- readTabular(mapping=list(content="cleantxt", id="id", created="created", retweet="retweetCount", fav="favoriteCount"))
trumptweetsenhanced <- VCorpus(DataframeSource(trumptweets.df), readerControl=list(reader=myReader))

However, when I convert the corpus back to a data frame, there are no added variables

> head(trumptweetsenhanced_dataframe.df)
      docs                                                                            text
1 doc 0001                            great memori day rememb will soon make america great
2 doc 0002                           nbcdfw trump ralli veteran annual roll thunder gather
3 doc 0003       frankylamouch mani donald roll thunder brigad will sign go war middl east
4 doc 0004     mariaernandezb trump support roll thunder ralli trump strong true rememb ms
5 doc 0005 scottwrasmussen donald trump biker share affect roll thunder ralli great day dc
6 doc 0006                            teapartynevada trump illeg taken care better veteran
1

There are 1 best solutions below

1
On

You can add metadata to your tweets-corpus with the tm::meta() function. See library(tm); example(meta).

This metadata-annotation can happen on a per-corpus level- you might want to store "common" metadata such as the date when the tweets in this corpus were harvested, or the search query string, API call details, or whatever.

Annotation can also happen on a per-document level (in this case, on a per-tweet level)- you can store inside the corpus the tweet-attributes from your trumptweets.df data frame such as retweet-count, fav-count, language etc.

This implies clever and careful housekeeping. You typically use a set of custom functions together with the *apply-family of functions for calling meta() in a reading and writing manner. (Or use purrr::walk*, or purrr::map*)

I'm writing this off the top of my head. It's been a while since I worked with meta(). Maybe there is a completely different way (use nested data frames, or use other text-mining packages).