Here is my code:
def ngrams(string, n=4):
string = re.sub(r'[,-./]|\sBD',r'', string)
ngrams = zip(*[string[i:] for i in range(n)])
R = [''.join(ngram) for ngram in ngrams]
if len(R) == 0:
return string
else:
return R
L = ['a', 'aa', 'aaa', 'a', 'aa', 'aaa']
vectorizer = TfidfVectorizer(min_df = 0, token_pattern='(?u)\\b\\w+\\b', analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(L)
print(vectorizer.vocabulary_)
The output of vocabulary is {'a': 0}
.
I am confused where are "aa"
and "aaa"
and when you check my ngrams function, I am returning string if it's length is less then the parameter (which is 4 in above code).
The token regex is also made in a way to accept single character.
This is a theory.
I believe
TfidVectorizer
expects theanalyzer
function to return a sequence. Notice the inputs vs outputs of yourngrams
function:A string is a sequence, so in the first 3 cases, you are returning a sequence that consists of repeats of the single letter
'a'
.If my theory is correct, you need to replace
with