How to tell tell that my self-play Neural Network is overfitting

Question

How to tell tell that my self-play Neural Network is overfitting

515 Views Asked by NeomerArcana At 28 July 2025 at 02:17

I have a Neural Network designed to play Connect 4, it gauges the value of a game state toward Player 1 or Player 2.

In order to train it, I am having it play against itself for n number of games.

What I've found is that 1000 games results in a better game-play than 100,000 even though the Mean Square Average over every 100 games constantly improves in the 100,000 epochs.

(I determine this by challenging the top-ranked player at http://riddles.io)

I've therefore reached the conclusion that over-fitting has occurred.

With self-play in mind, how do you successfully measure/determine/estimate that over-fitting has occurred? I.e., how to I determine when to stop the self-play?

Original Q&A

There are 2 best solutions below

**Frames Catherine White** · Answer 1

I'm not super familiar with reinforcement learning, being much more a supervised learning person. With that said, I feel like your options are never-the-less going to be the same as for supervised learning.

You need to find the point in which performance on Inputs (and I use that term losely) outside of the training space (again lossly), starts to decrease. When that happens you terminate training. You need Early Stopping.

For supervised learning, this would be done by having a held-out dev-set. As an in imitation of having a test-set.

In your case, it seems clear that this would be making your bot play a bunch of real people -- which is a perfect imitation of the test set.
Which is exactly what you have done.

The downside is sufficient play against real people is slow.
What you can do to partially off-set this is rather than pausing training to do this test, take a snapshot of your network, say every 500 iterations, and start that up in a separate process as a bot, and test it, and record the score, while the network is still training. However, this won't really help in this case, as I imagine that the time take for even 1 trial game is much longer than the time taken to run 500 iterations of training. Still this is applicable if you were not converging so fast.

I assume, since this problem is so simple, this is for learning purposes.
On that basis, you could fake real people.
Connect4 is a game with a small enough play space, that classic gameplaying AI should be able to do nearly perfectly.
So you can set up a bot for it to play (as its Dev-set equiv), that uses Alpha-beta pruning minimax.

Run a game against that ever 100 iterators or so, and if your relative score starts decreasing you know you've over-fitted.

The other thing you could do, is make it less likely to overfit in the first place. Which wouldn't help you detect it, but if you make it hard enough for it to overfit, you can to an extent assume that it isn't. So L1/L2 weight penalties. Dropout. Smaller hidden-layer sizes.

You could also increase the training set equivalent. Rather than pure self play, you could use play against other bots, potentially even other versions of itself setup with different hyper-parameters.

**Dennis Soemers** · Answer 2

Rather than measuring/detecting when overfitting starts to occur, it's easier to take steps to prevent it from happening. Two ideas for doing this:

Instead of always having the agent play against itself, have it play against an agent randomly selected from a larger set of older versions of itself. This idea is somewhat similar in spirit to Lyndon's idea of testing against humans and/or alpha-beta search engines (and very similar to his idea in the last paragraph of his answer). However, the goal here is not to test and figure out when performance starts dropping against a test set of opponents; the goal is simply to create a diverse set of training opponents, so that your agent cannot afford to only overfit against one of them. I believe this approach was also used in [1, 2].
Incorporate search algorithms (like MCTS) directly in the agent's action selection during training. The combination of NN + search (typically informed/biased by the NN) is usually a bit stronger than just the NN on its own. So you can always keep updating the NN to make its behaviour more like the behaviour of NN + search, and it'll generally be an improvement. The search part in this is unlikely to ever overfit against a specific opponent (because it's not learned from experience, the search always behaves in the same way). If the NN on its own starts overfitting against a particular opponent and starts suggesting moves that would be bad in general, but good against a particular opponent, a search algorithm should be able to "exploit/punish" this "mistake" by the overfitting NN, and therefore provide feedback to the NN to move away from that overfitting again. Examples of this approach can be found in [3, 4, 5].

The second idea probably requires much more engineering effort than the first, and it also only works if you actually can implement a search algorithm like MCTS (which you can, since you know the game's rules), but it probably works a bit better. I don't know for sure if it'll work better though, I only suspect it will because it was used in later publications with better results than papers using the first idea.

References

[1] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. (2016). Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature, Vol 529, No. 7587, pp. 484-489.

[2] Bansal, T., Pachocki, J., Sidor, S., Sutskever, I., and Mordatch, I. (2017). Emergent Complexity via Multi-Agent Competition. arXiv:1710.03748v2.

[3] Anthony, T. W., Tian, Z., and Barber, D. (2017). Thinking Fast and Slow with Deep Learning and Tree Search. arXiv:1705.08439v4.

[4] Silver, D., Schrittwieser, J., Simonyan, K, Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., and Hassabis, D. (2017). Mastering the Game of Go without Human Knowledge. Nature, Vol. 550, No. 7676, pp. 354-359.

[5] Silver, D., Hubert, T., Schrittweiser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., and Hassabis, D. (2017c). Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. arXiv:1712.01815v1.

How to tell tell that my self-play Neural Network is overfitting

There are 2 best solutions below

Related Questions in NEURAL-NETWORK

Related Questions in REINFORCEMENT-LEARNING

Related Questions in TEMPORAL-DIFFERENCE

Trending Questions

Popular # Hahtags

Popular Questions