I am trying to plot a weighed graph of terms used in tweets. Basically I made a term Document Matrix; removed sparse terms; build a adjazenzmatrix of the remaining words and would like to plot them. I can't figure out where the problem is. Tried to do it exactly like on: http://www.rdatamining.com/examples/text-mining
Here's my code:
tweet_corpus = Corpus(VectorSource(df$CONTENT))
tdm = TermDocumentMatrix(
tweet_corpus,
control = list(
removePunctuation = TRUE,
stopwords = c("hehe", "haha", stopwords_phil, stopwords("english"), stopwords("spanish")),
removeNumbers = TRUE, tolower = TRUE)
)
m = as.matrix(tdm)
termDocMatrix <- m
termDocMatrix[5:10,1:20]
Docs
Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
aabutin 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
aad 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
aaf 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
aali 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
aannacm 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
aantukin 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
myTdm2 <- removeSparseTerms(tdm, sparse =0.98)
m2 <- as.matrix(myTdm2)
m2[5:10,1:20]
Docs
Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
filipino 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
give 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
god 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
good 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
guy 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0
haiyan 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
myTdm2
<<TermDocumentMatrix (terms: 34, documents: 27395)>>
Non-/sparse entries: 39769/891661
Sparsity : 96%
Maximal term length: 9
Weighting : term frequency (tf)
termDocMatrix2 <- m2
termDocMatrix2[termDocMatrix2>=1] <- 1
termMatrix2 <- termDocMatrix2 %*% t(termDocMatrix2)
termMatrix2[5:10,5:10]
Terms
Terms disaster give god good guy test
disaster 623 6 53 11 4 19
give 6 592 98 16 8 6
god 53 98 2679 135 38 29
good 11 16 135 816 21 5
guy 4 8 38 21 637 5
test 19 6 29 5 5 610
g2 <- graph.adjacency(termMatrix2, weighted=T, mode="undirected")
g2 <- simplify(g2)
V(g)$label <- V(g)$name
V(g2)$label <- V(g2)$name
V(g2)$degree <- degree(g2)
set.seed(3952)
layout1 <- layout.fruchterman.reingold(g2)
plot(g2, layout=layout1)
plot(g2, layout=layout.kamada.kawai)
V(g2)$label.cex <- 2.2 * V(g2)$degree / max(V(g2)$degree)+ .2
V(g2)$label.color <- rgb(0, 0, .2, .8)
V(g2)$frame.color <- NA
egam <- (log(E(g2)$weight)+.4) / max(log(E(g2)$weight)+.4)
E(g2)$color <- rgb(.5, .5, 0, egam)
E(g2)$width <- egam
plot(g2, layout=layout1)
This then looks like:
but i would like to have something like this:
apparently the weighing doesn't work - but why?!
Thank you guys in advance!
Even though your graph is weighted, the layout algorithm does not use the weights unless you explicitly tell it to do so. Try this:
However, if your weights are wildly varying in terms of magnitude, it is usually better to use the logarithm of the weights (plus some constant to make all of them strictly positive) as the input of the layout algorithm.