I've been trying to get my head around this for a couple weekends now, I've been watching some tutorials and reading up, but I'm still missing key bits of what's actually doable with a dataset. I've struggled to find any resources on actual dataset creation. All the tutorials just seem to use plug and play datasets without any reasoning behind the data.
I've been messing around trying to create a prediction model to predict football events in games in Google Colab + Tensorflow, Pandas and Numpy. So for example on this dataset, this is to predict if a player will have a shot in the game.
I initially had this all in a nested JSON format. I tried that first using Pandas json_normalize() method.
The issue I was having with this was then when it came to the actual prediction, the inputs for the model were like 20,000 features or something high like that.
So I tried to flatten my structure, and make everything as generic as possible. I'm still struggling with the same thing though, where the features are just exploding in size. So I think my full dataset has 153 different columns, including the result column.
There is a mixture of types of data
So for example:
| Column Name | Type | Example Data | Notes |
|---|---|---|---|
| PlayerTeam_HomeOrAway | String | Home | Values are Home or Away. Could be converted to Number/Boolean |
| PlayerTeamGamesAnalysed | Number | 5 | |
| PlayerTeamAverageShotsPerGame | Decimal | 9.6 | |
| Position | String | FW | Different player positions, FW, DC, MC etc. |
| gamesWithShots_game_1_sub | Boolean | FALSE | |
| PlayerStyles_Strengths_Strong | List | Passing, Holding on to the ball, Aerial Duels | Can be different lengths depending on the player in question. |
So I think the issue with this has to be the columns such as PlayerStyles_Strengths_Strong. Everything else I've managed to break out of a nested structure into a single value. I've got a number of columns like this for weaknesses, team strengths etc.
I don't understand how to structure this data in the CSV file however. I want this to be treated as a single record, but it seems to be 'hot encoding' each to a new record. I may be completely wrong here, that's just from my initial research on it so far. Which is why when I then try to run a prediction using the same dataset structure it tells me the features don't match up.
I'm not sure if it's something I need to do directly with Pandas, Tensorflow or if it's a CSV structure issue.
My first solution idea was to add a column for every type of strength etc. Then assign a 1 / 0 into the field if the player/team has the trait. I'd write this into the Python script converting the JSON to CSV. Before I went through this laborious process, I thought I'd try and see if there was something obvious I'm missing on this, my AI modelling knowledge is quite literally as mentioned, what I've cobbled together from YouTube, Udemy and some Medium + GeekForGeek articles.
Thanks