Should I be using whole available data for training my deep learning model ? What are the pros and cons of using only a subset?

85 Views Asked by Jeena KK At 29 October 2019 at 17:36

I have a very complex LSTM based neural network model which I'm training on Quora Duplicate Question pairs. There are approximately 400 000 sentence pairs in the original dataset. It would take a lot of processing power and computation time to train on the entire (or 80%) dataset. Would it be unwise if I choose a random subset of the dataset (say 8000 pairs only) for training and 2000 for testing? Would it have a severe impact on the performance? Is always "more the data, better the model" true?

Original Q&A

There are 1 best solutions below

Tiago Duque On 30 October 2019 at 14:10

As a Rule of Thumb, Deep Neural Networks usually benefit from more data.

If you have a well described model and properly engineered your inputs, you will lose if you chose a smaller subset of your dataset.

However, you could always evaluate this by using metrics. Check how your loss decreases at every sample size, starting from your 8000 pairs.

For big problems, you always have to keep in mind that computation time is usually also big.

Should I be using whole available data for training my deep learning model ? What are the pros and cons of using only a subset?

There are 1 best solutions below

Related Questions in DEEP-LEARNING

Related Questions in NLP

Related Questions in SEMANTIC-COMPARISON

Trending Questions

Popular # Hahtags

Popular Questions