Consider the following string:
txt <- ("Viele Dank für das Feedback + die Verbesserungsvorschläge! :) http://testurl.com/5lhk5p #Greenwashing #PR #Vattenfal")
I create a dfm (Create a document-feature matrix) and pre-process the string as followed:
txt_corp <- quanteda::corpus(txt)
txt_dfm <- quanteda::dfm(txt_corp,remove_punct=TRUE, remove_symbols=TRUE, remove_url = T)
topfeatures(txt_dfm)
The output looks then as follows:
topfeatures(txt_dfm)
viele dank für das feedback
1 1 1 1 1
die verbesserungsvorschläge #greenwashing #pr #vattenfal
1 1 1 1 1
This is not bad. But I would like to have the output without the hashtag (#). I've tried some combinations like: txt_dfm <- quanteda::dfm(txt_corp,remove_punct=TRUE, remove_symbols=TRUE, remove_url = T, what ="word1")
topfeatures(txt_dfm)
viele dank für das feedback
1 1 1 1 1
die verbesserungsvorschläge http testurl.com 5lhk5p
1 1 1 1 1
Then I receive the above output. On the one side the hashtags are removed, but on the other side the links are splitted and not removed. Can somebody help to obtain the following output using quanteda?
viele dank für das feedback
1 1 1 1 1
die verbesserungsvorschläge greenwashing pr vattenfal
1 1 1 1 1
Remove the hashtag from your string ?