I was trying to create ngrams using hash_vectorizer function in text2vec, when I noticed that it doesn't change the dimensions of my dtm wit changing values.
h_vectorizer = hash_vectorizer(hash_size = 2 ^ 14, ngram = c(2L, 10L))
dtm_train = create_dtm(it_train, h_vectorizer)
dim(dtm_train)
In the above code, the dimensions dont change whether its 2-10 or 9-10.
vocab = create_vocabulary(it_train, ngram = c(1L, 4L))
ngram_vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(it_train, ngram_vectorizer)
In the above code, the dimensions change, but i want to use the hash_vectorizor also since it saves on space. How do I go about using that?
When using hashing you set the size of your output matrix in advance. You did so by setting
hash_size = 2 ^ 14
. This stays the same indpendently of the ngram window specified in the model. However, the counts within the output matrix change.(In response to below comments:) Below you find a minimum example with two very simple strings to demonstrate the different outputs for two different ngram windows used in a
hash_vectorizer
. For the bigrams case I have added the output matrix of avocab_vectorizer
for comparison. You realize that you have to set a hash size sufficiently large to account for all terms. If it is too small the hash values of individual terms may collide.Your comment concerning that you always have to compare the outputs of a
vocab_vectorizer
approach and ahash_vectorizer
approach leads into the wrong direction, because you would then loose the efficiency/memory advantage that might be generated by a hashing approach, which avoids generating a vocabulary. Depending on your data and desired output hashing may treat accuracy (and interpretability of terms in the dtm) against efficiency. Hence, it depends on your use case if hashing is reasonable or not (which it is especially for classification tasks at the document level for large collections).I hope this gives you a rough idea about hashing and what you can or cannot expect from it. You might also check some posts on hashing at quora, Wikipedia (or also here). Or also refer to the detailed original sources listed on text2vec.org.