How to reduce overfitting in spelling correction model with large vocab size?

76 Views Asked by At

I am developing a spelling correction system. My dataset contains 17,00,000 sentences and vocab size is 3,26,000. I have tried adding layers of GRU, LSTM but instead of stacking layers increasing units in a single layered architecture increases train set accuracy. However test set accuracy is never more than 14% . I have tried regularization techniques to reduce overfitting , but I am not able to find right method.Here is my code:

def create_model(layers,learning_rate):
  vocab_size = len(stemmed_dict)
  model = Sequential()
  model.add(Embedding(vocab_size,output_dim=200,input_length=max_sequence_len-1, trainable=True))
  model.add(GRU(1024,input_shape=input_sequences.shape,recurrent_dropout=0.2,kernel_regularizer=keras.regularizers.l1(0.0001),return_sequences=False))
  model.add(BatchNormalization())
  model.add(Dense(layers))
  model.add(Activation("softmax"))
  optimizer = Adam(learning_rate)
  model.compile(loss=SparseCategoricalCrossentropy(), optimizer=optimizer, metrics=['accuracy'])
  print(model.summary())

  return model
1

There are 1 best solutions below

0
On

I would reproduce an online tutorial or Kaggle competition first, and then substitute the given model with yours or similar (adapted to the specific task). This way it would be easier to pinpoint whether the problem comes from your model architecture or your data or preprocessing (often the case and my guess based on your description).