when executing this code I get 11937, but shouldn't I get 10.000? If I shouldn't I have a few follow-up questions:
- What's the point of num_words?
- What tihs number 11937 I got represents?
- How do I limit the size of my vocabulary?
MAX_WORDS_COUNT = 10000
WIN_SIZE = 1000
WIN_HOP = 100
tokenizer = Tokenizer(num_words=MAX_WORDS_COUNT, filters='!"#$%&()*+,-–—./…:;<=>?@[\\]^_`{|}~«»\t\n\xa0\ufeff',
lower=True, split=' ', oov_token='unkown_word', char_level=False, )
tokenizer.fit_on_texts(x_data)
items = list(tokenizer.word_index.items())
print(len(items))
I expected the 10.000 as an output because I believe num_words limits the size of the vocabulary.
If needed I can provide the full code from my colab notebook.