I am trying to build char level ngrams using sklearn's CountVectorizer.
When using analyzer='char_wb' the vocab has features with whitespaces around it. I want to exclude the features/words with whitespaces.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(binary=True, analyzer='char_wb', ngram_range=(4, 5))
vectorizer.fit(['this is a plural'])
vectorizer.vocabulary_
the vocabulary from the above code is
[' thi', 'this', 'his ', ' this', 'this ', ' is ', ' a ', ' plu', 'plur', 'lura', 'ural', 'ral ', ' plur', 'plura', 'lural', 'ural ']
I have tried using other analyzers e.g. word and char. None of those gives the kind of feature i need.
I hope you get an improved answer because I'm confident this answer is a bit of a bad hack. I'm not sure it does what you want, and what it does is not very efficient. It does produce your vocabulary though (probably)!