Trouble understanding how to use list of String data in a Machine Learning dataset - Features expanded before making prediction

33 Views Asked by GHQWE At 28 March 2024 at 20:11

I've been trying to get my head around this for a couple weekends now, I've been watching some tutorials and reading up, but I'm still missing key bits of what's actually doable with a dataset. I've struggled to find any resources on actual dataset creation. All the tutorials just seem to use plug and play datasets without any reasoning behind the data.

I've been messing around trying to create a prediction model to predict football events in games in Google Colab + Tensorflow, Pandas and Numpy. So for example on this dataset, this is to predict if a player will have a shot in the game.

I initially had this all in a nested JSON format. I tried that first using Pandas json_normalize() method.

The issue I was having with this was then when it came to the actual prediction, the inputs for the model were like 20,000 features or something high like that.

So I tried to flatten my structure, and make everything as generic as possible. I'm still struggling with the same thing though, where the features are just exploding in size. So I think my full dataset has 153 different columns, including the result column.

There is a mixture of types of data

So for example:

Column Name	Type	Example Data	Notes
PlayerTeam_HomeOrAway	String	Home	Values are Home or Away. Could be converted to Number/Boolean
PlayerTeamGamesAnalysed	Number	5
PlayerTeamAverageShotsPerGame	Decimal	9.6
Position	String	FW	Different player positions, FW, DC, MC etc.
gamesWithShots_game_1_sub	Boolean	FALSE
PlayerStyles_Strengths_Strong	List	Passing, Holding on to the ball, Aerial Duels	Can be different lengths depending on the player in question.

So I think the issue with this has to be the columns such as PlayerStyles_Strengths_Strong. Everything else I've managed to break out of a nested structure into a single value. I've got a number of columns like this for weaknesses, team strengths etc.

I don't understand how to structure this data in the CSV file however. I want this to be treated as a single record, but it seems to be 'hot encoding' each to a new record. I may be completely wrong here, that's just from my initial research on it so far. Which is why when I then try to run a prediction using the same dataset structure it tells me the features don't match up.

I'm not sure if it's something I need to do directly with Pandas, Tensorflow or if it's a CSV structure issue.

My first solution idea was to add a column for every type of strength etc. Then assign a 1 / 0 into the field if the player/team has the trait. I'd write this into the Python script converting the JSON to CSV. Before I went through this laborious process, I thought I'd try and see if there was something obvious I'm missing on this, my AI modelling knowledge is quite literally as mentioned, what I've cobbled together from YouTube, Udemy and some Medium + GeekForGeek articles.

Thanks

Original Q&A

Trouble understanding how to use list of String data in a Machine Learning dataset - Features expanded before making prediction

There are 0 best solutions below

Related Questions in PANDAS

Related Questions in CSV

Related Questions in TENSORFLOW

Related Questions in MACHINE-LEARNING

Related Questions in DATASET

Trending Questions

Popular # Hahtags

Popular Questions