Creating an MDP // Artificial Intelligence for 2D game w/ multiple terminals

310 Views Asked by At

so I'm sure everyone has heard about the Berkeley Pac-Man AI challenge at some point or another. A while ago I created a 2D platformer (doesn't scroll) and figured it would be pretty cool to take some inspiration from this project but create an AI for my game (instead of PacMan). That being said, I have found myself very stuck. I have looked at several GitHub solutions for the PacMan as well as plenty of articles regarding implementation of MDP / Reinforced learning in Python. I'm having a hard time relating them back to my game.

In my game, I have 10 levels. Each level has fruit, and once the agent grabs all the fruit, he completes the level and the next level starts. Here's an example stage: enter image description here

As you can see in this picture, my agent is the little squirrel and he has to grab all the cherries. On the ground there's also spikes that he can't walk on (or you lose a life). You can avoid spikes by jumping. So a jump technically moves the agent 2 spaces to the left or right (depending on way he's facing). Other than that you can move left, right, up and down on a ladder. You can't jump more than 1 space, so the "2+ gappers" at the top you see, you'd have to climb down the ladder and go around. Additionally, not pictured above there are enemies that go Left and Right only on a floor that you have to dodge (you can jump over them, or just avoid them). They're tracked on a grid I'll talk about below. So that's a bit about the game, if you need anymore clarifications, feel free to ask and I can assist, let me go into what I've tried now a little bit and see if someone can assist me with getting something together.

In my code I have a grid that has all the spaces on the map and what they are (platform, regular spot, spike, reward/fruit, ladder, etc). Using that, I created an action grid (code below) that basically stores in a dictionary all of the spots the agent can move from each location.

for r in range(len(state_grid)):
        for c in range(len(state_grid[r])):
            if(r == 9 or r == 6 or r == 3 or r == 0):
                if (move_grid[r][c] != 6):
                    actions.update({state_grid[r][c] : 'None'})
                else:
                    actions.update({state_grid[r][c] : [('Down', 'Up')]})
            elif move_grid[r][c] == 4 or move_grid[r][c] == 0 or move_grid[r][c] == 2 or move_grid[r][c] == 3 or move_grid[r][c] == 5:
                actions.update({state_grid[r][c] : 'None'})
            elif move_grid[r][c] == 6:
                if move_grid[r+1][c] == 4 or move_grid[r+1][c] == 2 or move_grid[r+1][c] == 3 or move_grid[r+1][c] == 5:
                    if c > 0 and c < 18:
                        actions.update({state_grid[r][c] : [('Left', 'Right', 'Up', 'Jump Left', 'Jump Right')]})
                    elif c == 0:
                        actions.update({state_grid[r][c] : [('Right', 'Up', 'Jump Right')]})
                    elif c == 18:
                        actions.update({state_grid[r][c] : [('Left', 'Up', 'Jump Left')]})
                elif move_grid[r+1][c] == 6:
                    actions.update({state_grid[r][c] : [('Down', 'Up')]})
            elif move_grid[r][c] == 1 or move_grid[r][c] == 8 or move_grid[r][c] == 9 or move_grid[r][c] == 10:
                if c > 0 and c < 18:
                    if move_grid[r+1][c] == 6:
                        actions.update({state_grid[r][c] : [('Down', 'Left', 'Right', 'Jump Left', 'Jump Right')]})
                    else:
                        actions.update({state_grid[r][c] : [('Left', 'Right', 'Jump Left', 'Jump Right')]})
                elif c == 0:
                    if move_grid[r+1][c] == 6:
                        actions.update({state_grid[r][c] : [('Down', 'Right', 'Jump Right')]})
                    else:
                        actions.update({state_grid[r][c] : [('Right', 'Jump Right')]})
                elif c == 18:
                    if move_grid[r+1][c] == 6:
                        actions.update({state_grid[r][c] : [('Down', 'Left', 'Jump Left')]})
                    else:
                        actions.update({state_grid[r][c] : [('Left', 'Jump Left')]})
            elif move_grid[r][c] == 7:
                actions.update({state_grid[r][c] : 'Spike'})
            else:
                actions.update({state_grid[r][c] : 'WTF'})

At this point I'm kind of just stuck on how/what is needed to send to the MDP. I'm using the Berkeley MDP and I'm just super stuck on how to start getting it involved and implemented. I have a bunch of the data points and where stuff is, just not sure how to actually get the ball rolling.

I have a boolean grid that tracks all the harmful objects (spikes and enemies) and is updated constantly since the enemies move every second.

I created a reward grid that sets:

  • Spikes "-5" reward
  • Falling off the map "-5" reward
  • Fruit is a "5" reward
  • Everything else is a "-0.2" reward (since you want to optimize steps, not trying to be on level 1 all day).

Another part of my issue during researching solutions or ways to implement this is that most of the solutions are cars driving to a position. So they only have 1 reward position whereas mine has multiple fruits per stage. Yeah I'm just super stuck and getting frustrated with this. Wanted to try my own thing but if I can't do this I might as well just do the Pac-Man one since there's several online solutions. I appreciate your time and help with this!

Edit: So here's me trying to make an example call get the move grid back, although from my understanding (and obviously) it will dynamically change after each step the agent makes since the enemy will threaten certain spots and all. enter image description here

This is the result, you can see like it's close but obviously a little stuck up on the fact that there's multiple rewards. I feel like I might be close, but I'm not really sure. I'm a little frustrated at this point. enter image description here

1

There are 1 best solutions below

0
On

Creating an Artificial Intelligence has to do with managing the state space. In the 2d plattformer game the state-space was managed by a map-table. That means, the robot can be on position x/y and this position is connected to a reward which is given by the table. The idea behind q-learning is, to create the state-action space from scratch with a trial and error algorithm. That means, from a Reinforcement Learning perspective the task was solved, congratulations. If the software contains of a q-table which stores the rewards, and if the rewards are determined automatically it's an reinforcement learning robot.

Will the 2d platformer game profit from this approach? Perhaps not, and this is reason why a bit frustration might be possible. The problem is, that the idea of connecting the player's position on the map to a reward is too mechanic. It won't result into a human-like policy, but in a non working controller. That means, the prediction is, that the robot will fail in the level. He isn't able to avoid the spikes nor collect the fruits.

Improving the AI algorithm is easy: all what the developer has to do is to invent a new state space. That means, he has to map potential actions of the robot to a q-table matrix which can be learned. Let us describe potential states of the robot: the robot can be near a ladder, near a fruit, near of spikes, and before falling down somewhere. The states can be combined, that means two situations can be there at the same time. This kind of qualitative state description has to mapped to q-table. What is the difference to the previous state-space description? In the first example, the state of the robot was given by simple x/y value. In the better approach, the state is given by linguistic variables (“near a ladder”, “near a fruit”). Such a linguistic grounding allows to debug the algorithm if something is wrong the robot. That means, the communication between the developers and his software has improved.