Keeping ID's with corpus and stemming

130 Views Asked by At

Good afternoon, everyone.

I have faced a problem when I tried to perform a text mining operations. The thing is that I have a dataset of 3000 obs. with several columns where mostly categorical variables and one column which is text. For example:

| id    | header | cate1 | cate2 | 
| 75641 | <text> |   1   |   0   |
| 71245 | <text> |   0   |   0   |

When I perform the text mining techniques while keeping the id's from original data in meta of corpus, the stemming does not work at all (leave the results with similar words). Although, other functions work fine. I have tried many techniques from the other questions that were proposed, but it still does not work.

Here I attach a part of the code:

dung<-read.csv("dung.csv") 
library(RTextTools)
library(fpc)   
library(cluster)
library(tm)
library(stringi)
library(stringr)
library(proxy)
library(wordcloud)
library(SnowballC) 

library(ggplot2)
library(slam)


##################################
#######  PREPROCESS HEADER #######
##################################

#Create new dataset
datah <- dung[,1:2] #Here I take only 2 columns of the original data where only id and text columns
remove(dung)
library(tm)
myReader <- readTabular(mapping=list(id="id", 
                                     content="header"))
mycorpus <- VCorpus(DataframeSource(datah), readerControl=list(reader=myReader))

##### Preprocessing #####
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " " , x))}) # Defying additional function

a <- tm_map(mycorpus, toSpace, "-")
a <- tm_map(mycorpus, toSpace, "/")
a <- tm_map(mycorpus, PlainTextDocument)
a <- tm_map(mycorpus, stemDocument, language = "russian")
skipWords <- function(x) removeWords(x, stopwords("russian"))
funcs <- list(content_transformer(tolower), removePunctuation, removeNumbers, stripWhitespace, skipWords)
a <- tm_map(mycorpus, FUN = tm_reduce, tmFuns = funcs)

mydtm <- DocumentTermMatrix(a, control = list(wordLengths = c(3,10),
                                              weighting = function(x) weightTfIdf(x, normalize = FALSE)))
inspect(mydtm)
0

There are 0 best solutions below