I have a huge project that is a GUI Tic Tac Toe game, it has 6 AI players and offers you the opportunity to customize the UI. I have completed what I originally intended and posted it on Code Review, but so far there are no answers. It also has a GitHub repository.
This question is entirely self contained, and you don't need to click the links above to understand what is going on here.
So I have decided to upgrade the AI, I have decided to properly emulate Matchbox Educable Noughts and Crosses Engine. The basic idea is simple, there is a two level nested dictionary whose keys are all the non-game-over states that can be reached via normal gameplay, the values are dictionaries whose keys are all the legal moves for the given state, and the values their associated weights.
When the AI is about to make its move, it reads the current state of the board, and retrieves the entry corresponding to the state in the dictionary, and chooses a legal move based on the weights of the moves. It also remembers each state and the move it made at the state during each game.
When the game is over, the end state is fed back to the AI, and if the AI wins, all the moves it made during the game will have a much greater chance to be made again when the same board states are encountered, and for ties the moves have a slightly greater chance to be repeated, and if it loses the game the moves are less likely to be repeated.
I just learned the technical term for this, it is called reinforcement learning. In my simulations it works very well, by letting the AI play against another AI that knows just one strategy, in 1048576 games, the AI will consistently not lose to the simpler AI more than 95% of the times.
But there is a key problem, it doesn't know the aforementioned simple strategy, so when there are three of the same pieces on a single line, the game ends. So if you have two pieces on the same line and the other spot is empty, it would be very obvious that your next move should be the empty spot, to win the game.
And of course, the same is true for the opponent, so when the opponent has two pieces on the same line and the third spot is empty, and it is your move. It would also be obvious that you should choose the empty spot to prevent the opponent from winning.
The aforementioned simple strategy is what is used by the simple AI, it randomly chooses empty spots when the strategy isn't applicable, else it uses the strategy. I used the simple AI to train the advanced AI.
The problem is when I tried to let the advanced AI use the simple strategy, its performance decreases. It seems obvious that not utilizing the simple strategy is a serious flaw, but guaranteeing the utilization of the strategy somehow makes the AI lose more. I have tried 6 kinds of tests, letting the AI not distinguish order, play first, play second, and letting the AI use the strategy or not (3 * 2), in each scenario making the AI use the strategy makes it lose more.
from copy import deepcopy
from itertools import cycle, product
import random
from typing import Tuple
LINES = (
(0, 3, 1),
(3, 6, 1),
(6, 9, 1),
(0, 7, 3),
(1, 8, 3),
(2, 9, 3),
(0, 9, 4),
(2, 7, 2),
)
GAMESTATES = {}
def is_playable(board: str) -> bool:
for start, stop, step in LINES:
line = board[start:stop:step]
if len(set(line)) == 1 and line[0] in {"O", "X"}:
return False
return " " in board and abs(board.count("O") - board.count("X")) <= 1
for board in product(" OX", repeat=9):
if is_playable(board):
c = 0
moves = []
for i, s in enumerate(board):
if s == " ":
moves.append(i)
c += 1
GAMESTATES["".join(board)] = {i: c for i in moves}
def check_state(board: str) -> Tuple[bool, str]:
for start, stop, step in LINES:
line = board[start:stop:step]
if len(set(line)) == 1 and (winner := line[0]) in {"O", "X"}:
return True, winner
return " " not in board, None
OTHER = {"O": "X", "X": "O"}
GAMESTATES_P1 = {}
GAMESTATES_P2 = {}
def tally_states(board: str, move: str, states: dict) -> None:
if board not in states and not check_state(board)[0]:
other = OTHER[move]
moves = []
c = 0
for i, s in enumerate(board):
if s == " ":
tally_states(board[:i] + move + board[i + 1 :], other, states)
moves.append(i)
c += 1
states[board] = {i: c for i in moves}
tally_states(" " * 9, "O", GAMESTATES_P1)
tally_states(" " * 9, "X", GAMESTATES_P2)
assert not GAMESTATES.keys() - (GAMESTATES_P1.keys() | GAMESTATES_P2.keys())
class Game_Board:
def __init__(self) -> None:
self.choices = list(range(9))
self.state = [" "] * 9
@property
def state_string(self) -> str:
return "".join(self.state)
@property
def alternate_state(self) -> str:
return "".join(OTHER.get(i, i) for i in self.state)
def submit(self, choice: int, player: str) -> None:
self.choices.remove(choice)
self.state[choice] = player
def reset(self) -> None:
self.__init__()
def fill_line(board: str, piece: str) -> int:
for start, end, step in LINES:
line = board[start:end:step]
if line.count(piece) == 2 and " " in line:
return start + line.index(" ") * step
def find_gaps(board: str, piece: str) -> int:
gaps = []
for start, end, step in LINES:
line = board[start:end:step]
if line.count(piece) == 2 and " " in line:
gaps.append(start + line.index(" ") * step)
return gaps
class TrueMenace:
deltas = {"Win": 16, "Tie": 3, "Loss": -2}
def __init__(self, states: dict, board: Game_Board, fill: bool = False) -> None:
self.boards = []
self.moves = []
self.states = deepcopy(states)
self.board = board
self.fill = fill
self.move = None
def fill_move(self) -> None:
if not self.fill:
return
gaps = []
state = self.board.state_string
for p in ("O", "X"):
gaps.extend(find_gaps(state, p))
if gaps:
moves = self.states[state]
self.move = random.choices(gaps, weights=[moves[g] for g in gaps])[0]
def normal_move(self) -> None:
entry = self.states[self.board.state_string]
move = random.choices(self.board.choices, weights=entry.values())[0]
self.move = move
def get_move(self) -> int:
state = self.board.state_string
self.boards.append(state)
self.fill_move()
if not self.move:
self.normal_move()
move = self.move
self.moves.append(move)
self.move = None
return move
def back_propagate(self, state: str) -> None:
delta = self.deltas[state]
for board, move in zip(self.boards, self.moves):
entry = self.states[board]
weight = entry[move]
entry[move] = max(weight + delta, 1)
self.boards.clear()
self.moves.clear()
class Trainer:
def __init__(self) -> None:
self.turns = cycle([self.nought, self.cross])
stats = {"games": 0, "O": 0, "X": 0}
self.stats = [stats.copy() for _ in range(6)]
self.board = Game_Board()
self.players = [
TrueMenace(GAMESTATES, self.board),
TrueMenace(GAMESTATES_P1, self.board),
TrueMenace(GAMESTATES_P2, self.board),
TrueMenace(GAMESTATES, self.board, True),
TrueMenace(GAMESTATES_P1, self.board, True),
TrueMenace(GAMESTATES_P2, self.board, True),
]
self.index = 0
def nought(self) -> None:
self.board.submit(self.players[self.index].get_move(), "O")
def cross(self) -> None:
state = self.board.state_string
for p in ("O", "X"):
if (pos := fill_line(state, p)) is not None:
self.board.submit(pos, "X")
break
else:
self.board.submit(random.choice(self.board.choices), "X")
def game(self) -> None:
for _ in range(9):
next(self.turns)()
_, winner = check_state(self.board.state_string)
if winner:
self.stats[self.index][winner] += 1
self.players[self.index].back_propagate(
"Win" if winner == "O" else "Loss"
)
break
else:
self.players[self.index].back_propagate("Tie")
self.stats[self.index]["games"] += 1
self.board.reset()
def train(self, n: int, cheat: bool = False) -> None:
self.index = 3 * cheat
for _ in range(n):
self.game()
def train_alt1(self, n: int, cheat: bool = False) -> None:
self.index = 1 + 3 * cheat
for _ in range(n):
self.turns = cycle([self.nought, self.cross])
self.game()
def train_alt2(self, n: int, cheat: bool = False) -> None:
self.index = 2 + 3 * cheat
for _ in range(n):
self.turns = cycle([self.cross, self.nought])
self.game()
trainer = Trainer()
for i, name in enumerate(("train", "train_alt1", "train_alt2")):
for b in (0, 1):
getattr(trainer, name)(1048576, b)
stats = trainer.stats[i + 3 * b]
print(name, b, stats)
print(
stats["O"] / stats["games"], (stats["games"] - stats["X"]) / stats["games"]
)
train 0 {'games': 1048576, 'O': 723889, 'X': 3183}
0.6903543472290039 0.9969644546508789
train 1 {'games': 1048576, 'O': 679969, 'X': 25232}
0.6484689712524414 0.9759368896484375
train_alt1 0 {'games': 1048576, 'O': 1031051, 'X': 666}
0.9832868576049805 0.9993648529052734
train_alt1 1 {'games': 1048576, 'O': 932999, 'X': 1208}
0.8897771835327148 0.9988479614257812
train_alt2 0 {'games': 1048576, 'O': 621079, 'X': 27380}
0.5923070907592773 0.9738883972167969
train_alt2 1 {'games': 1048576, 'O': 518353, 'X': 62287}
0.4943399429321289 0.9405984878540039
It takes about 9 minutes for the simulation to run.
Why is it the case that ensuring utilization of the aforementioned strategy decreases the performance of the AI?
Update
This thing gets weirder, I have let one AI play first, and the AI plays 1048576 games without using the aforementioned simple strategy of filling the gap. Then I changed its setting by setting its fill attribute to True, to make it utilize the strategy without failure. I then let the same AI play additional 1048576 games as player 1.
And then I created another AI and let it play 2097152 games as player 1 without using the simple strategy as control.
Here is what I found:
AI A, moving first, without using the strategy:
{'games': 1048576, 'O': 1031051, 'X': 666}
AI A, moving first, after enabling the strategy:
{'games': 2097152, 'O': 1439990, 'X': 856}
AI B, moving first, without using the strategy:
{'games': 2097152, 'O': 1945835, 'X': 932}
Without using the strategy, for the first 1048576 games, AI A wins 98.33% of the time, but after enabling the strategy, the same AI wins dramatically less for the next 1048576 games, it only wins 68.66% of the time.
In contrast, AI B playing first and without using the strategy for 2097152 games, wins 92.78% of the time. It has played more games than AI A initially, but somehow wins less than AI A which played less games. But it still plays much better than AI A after AI A starts to use the strategy.
Why does using the strategy decrease performance of the AI, and why does more training also decrease the performance of the AI?