I am currently trying to implement my own version of a Connect Four Environment based on the version available on the PettingZoo Library github (https://github.com/Farama-Foundation/PettingZoo/blob/master/pettingzoo/classic/connect_four/connect_four.py).
From their documentation, in the page of the classic environments (https://pettingzoo.farama.org/environments/classic/) there is written the following thing:
" Most [classic] environments only give rewards at the end of the games once an agent wins or losses, with a reward of 1 for winning and -1 for losing. "
It is not clear to me how to model the learning for non-terminating states, if the reward signal (on which I guess the whole learning of the agents is based) occurs only for terminating states.
I thought to modify the setup by allowing the environment to emit rewards at each turn, something like:
+1 for each (non-terminating) step of the game
+100 for a winning state
0 for a draw
-100 for illegal moves (and quitting the current game/episode) However, this setup would require very high exploratory rates for a $\epsilon$-greedy agent, given my current setup. This is because, for each state that has just been observed, the agent takes a random move and, if the state is not terminating, it will assign a state-action value of 1 to the just taken action, and zero for all the others. Otherwise, the agent will always pick the already taken action with very high probability, thus not allowing actual learning...
I am not so syure on how to solve this problem, as allowing for very high exploratory rates doesnt seem to me to be a good choice... My code is available on https://github.com/FMGS666/RLProject
Probably i should use the same setup as theirs in the github repo, but i didnt really quite understand how to do it for the aforementioned problem.
Probably im missing something important, but thank you very much for the help anyway!