quanteda: remove tags (#,@) and url in on string

833 Views Asked by At

Consider the following string:

txt <- ("Viele Dank für das Feedback + die Verbesserungsvorschläge! :) http://testurl.com/5lhk5p #Greenwashing #PR #Vattenfal")

I create a dfm (Create a document-feature matrix) and pre-process the string as followed:

txt_corp <- quanteda::corpus(txt)
txt_dfm <- quanteda::dfm(txt_corp,remove_punct=TRUE, remove_symbols=TRUE, remove_url = T)
topfeatures(txt_dfm)

The output looks then as follows:

topfeatures(txt_dfm)
              viele                    dank                     für                     das                feedback 
                  1                       1                       1                       1                       1 
                die verbesserungsvorschläge           #greenwashing                     #pr              #vattenfal 
                  1                       1                       1                       1                       1 

This is not bad. But I would like to have the output without the hashtag (#). I've tried some combinations like: txt_dfm <- quanteda::dfm(txt_corp,remove_punct=TRUE, remove_symbols=TRUE, remove_url = T, what ="word1")

topfeatures(txt_dfm)
              viele                    dank                     für                     das                feedback 
                  1                       1                       1                       1                       1 
                die verbesserungsvorschläge                    http             testurl.com                  5lhk5p 
                  1                       1                       1                       1                       1 

Then I receive the above output. On the one side the hashtags are removed, but on the other side the links are splitted and not removed. Can somebody help to obtain the following output using quanteda?

                  viele                    dank                     für                     das                feedback 
                  1                       1                       1                       1                       1 
                die verbesserungsvorschläge           greenwashing                     pr              vattenfal 
                  1                       1                       1                       1                       1 
2

There are 2 best solutions below

0
On

Remove the hashtag from your string ?

txt <- gsub("#","",txt)

> txt_dfm
Document-feature matrix of: 1 document, 10 features (0.0% sparse).
       features
docs    viele dank für das feedback die verbesserungsvorschläge greenwashing pr vattenfal
  text1     1    1   1   1        1   1                       1            1  1         1
0
On

There is a regex pattern that matches hash tags in quanteda_options(). If you set NULL to it, it stops preserving them.

require(quanteda)
quanteda_options(reset = TRUE)
quanteda_options("pattern_hashtag")     
# [1] "#\\w+#?"
tokens("#aaaa bbbb")
# Tokens consisting of 1 document.
# text1 :
# [1] "#aaaa" "bbbb" 

quanteda_options("pattern_hashtag" = NULL)
tokens("#aaaa bbbb")
# Tokens consisting of 1 document.
# text1 :
# [1] "#"    "aaaa" "bbbb"