I have a very complex LSTM based neural network model which I'm training on Quora Duplicate Question pairs. There are approximately 400 000 sentence pairs in the original dataset. It would take a lot of processing power and computation time to train on the entire (or 80%) dataset. Would it be unwise if I choose a random subset of the dataset (say 8000 pairs only) for training and 2000 for testing? Would it have a severe impact on the performance? Is always "more the data, better the model" true?

1

There are 1 best solutions below

0
On

As a Rule of Thumb, Deep Neural Networks usually benefit from more data.

If you have a well described model and properly engineered your inputs, you will lose if you chose a smaller subset of your dataset.

However, you could always evaluate this by using metrics. Check how your loss decreases at every sample size, starting from your 8000 pairs.

For big problems, you always have to keep in mind that computation time is usually also big.