I have a Neural Network designed to play Connect 4, it gauges the value of a game state toward Player 1 or Player 2.
In order to train it, I am having it play against itself for n
number of games.
What I've found is that 1000 games results in a better game-play than 100,000 even though the Mean Square Average over every 100 games constantly improves in the 100,000 epochs.
(I determine this by challenging the top-ranked player at http://riddles.io)
I've therefore reached the conclusion that over-fitting has occurred.
With self-play in mind, how do you successfully measure/determine/estimate that over-fitting has occurred? I.e., how to I determine when to stop the self-play?
I'm not super familiar with reinforcement learning, being much more a supervised learning person. With that said, I feel like your options are never-the-less going to be the same as for supervised learning.
You need to find the point in which performance on Inputs (and I use that term losely) outside of the training space (again lossly), starts to decrease. When that happens you terminate training. You need Early Stopping.
For supervised learning, this would be done by having a held-out dev-set. As an in imitation of having a test-set.
In your case, it seems clear that this would be making your bot play a bunch of real people -- which is a perfect imitation of the test set.
Which is exactly what you have done.
The downside is sufficient play against real people is slow.
What you can do to partially off-set this is rather than pausing training to do this test, take a snapshot of your network, say every 500 iterations, and start that up in a separate process as a bot, and test it, and record the score, while the network is still training. However, this won't really help in this case, as I imagine that the time take for even 1 trial game is much longer than the time taken to run 500 iterations of training. Still this is applicable if you were not converging so fast.
I assume, since this problem is so simple, this is for learning purposes.
On that basis, you could fake real people.
Connect4 is a game with a small enough play space, that classic gameplaying AI should be able to do nearly perfectly.
So you can set up a bot for it to play (as its Dev-set equiv), that uses Alpha-beta pruning minimax.
Run a game against that ever 100 iterators or so, and if your relative score starts decreasing you know you've over-fitted.
The other thing you could do, is make it less likely to overfit in the first place. Which wouldn't help you detect it, but if you make it hard enough for it to overfit, you can to an extent assume that it isn't. So L1/L2 weight penalties. Dropout. Smaller hidden-layer sizes.
You could also increase the training set equivalent. Rather than pure self play, you could use play against other bots, potentially even other versions of itself setup with different hyper-parameters.