I am interested in creating a network graph similar to the one displayed on this persons website - this first one on this page >> http://minimaxir.com/2016/12/interactive-network/
I would like to make the nodes of this graph == words in a .txt document (after removing stopwords and other pre-processing). I would also like to make the vertices/edges of this graph be the correlations to other words in the document (e.g. the word "word" occurs frequently next to the word "up") accounting for only the stronger correlations. I was thinking "size of node" = "frequency of word" in the document overall, and "distance between nodes" = strength/weakness of relationship" between words.
I am currently using a combination of R, quanteda and ggplot2 as well as some other dependencies.
If anyone has any advice on how I can generate word correlations in R (preferably with quanteda) and then plot as a graph I would be forever grateful!
Of course if there are any improvements I can make to this question please let me know. Here is where i'm at so far with my attempt:
library(quanteda)
library(readtext)
library(ggplot2)
library(stringi)
## Load the .txt doc
document <- texts(readtext("file1.txt"))
## Make everything lowercase... store in a seperate variable
documentlower <- char_tolower(document$text)
## Tokenize the lower-case document
documenttokens <- tokens(documentlower, remove_punct = TRUE) %>% as.character()
(total_length <- length(documenttokens)
## Create the Document Frequency Matrix - here we can also remove stopwords and stem
docudfm <- dfm(documentlower, remove_punct = TRUE, remove = stopwords("english"), stem = TRUE)
## Inspect the top 10 Words by Count
textstat_frequency(docudfm, n = 10)
## Create a sorted list of tokens by frequency count
sorted_document <- topfeatures(docudfm, n = nfeat(docudfm))
## Normalize the data points to find their percentage of occurrence in the documents
sorted_document <- sorted_document / sum(sorted_document) * 100
## Also normalize the data points in the DFM
docudfm_pct <- dfm_weight(docudfm, scheme = "prop") * 100