How to train a keras tokenizer on a large corpus that doesn't fit in memory?

249 Views Asked by At

I am trying to train a language model that based on a 2-word input tries to predict a 1-word output. This is the model definition (all the layers are imported from keras.layers):

model = Sequential()
model.add(Embedding(vocab_size, 2, input_length=seq_length))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

The problem is that my dataset has 87 million lines of 3-word data (2 for input, 1 for output) and it does not fit into my memory. I heard that keras.preprocessing.text.Tokenizer creates tokens based on their frequency in text. I am training my tokenizer like this:

tokenizer = Tokenizer(oov_token=1)
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)

How am I supposed to fit my tokenizer on all texts if they don't fit into memory?

0

There are 0 best solutions below