Grouping Multiple Rows of Data For Use In scikit-learn Random Forest Machine Learning Model

43 Views Asked by At

I'm having a difficult time phrasing my question, so if there's anything unclear that I can improve upon, please let me know. My goal is ultimately to determine the location of an RF transmitter using a machine learning model. There are many other techniques that could be used to identify the source of an RF signal (including triangulation and time offsets between multiple receivers), but that's not the point. I want to see if I can make this work with a ML model.

I'm attempting to use a RandomForestClassifier from scikit-learn to build out a model for identifying the source of an RF signal, given the signal strength on several receivers scattered across a known area. These receivers are all linked (via network) to a central database. The receivers are in fixed locations, but the transmitter could be anywhere, so the signal strength into a receiver primarily depends on whether the transmitter has direct line of sight to the receiver. Receivers measure signal strength from 1 to 255. If it's a 0, it means the receiver didn't hear anything, so it's not recorded in the database. The fact that it won't be recorded will be important in a moment. An rssi of 255 is an indication of full scale into a particular receiver.

The database logs the receiver data every second (please see table below). Each group of time is a representation of what the signal looked like at each receiver at that time. As stated, if the signal wasn't heard on a receiver, it won't be logged into the database, so each group of time could have as few as 1 row, or as many as X rows, where X represents the total number of receivers in the system (e.g., if there are ten receivers listening on the same frequency and each receiver hears the signal, a row for all ten receivers will show up in the database, but if only three of those ten hear the signal, only three rows will be recorded in the database). Essentially, I'm trying to correlate what signal strengths look like in a database with known locations. For example, strong into Red and Green means the signal is likely coming from Foo, whereas strong signals into Red and Yellow, with a weak signal into Blue means the signal probably came from Bar. The known location data is built out manually by observing what a signal looks like when a transmitter is in a known location. It's a very tedious process.

The way the receiver data is logged (across multiple rows and never knowing how many rows will show up in the dataset) is causing an obvious challenge for me when I'm trying to model the data because the RandomForestClassifier looks at each row individually. I need the data to be grouped by date/time, but not knowing how many receivers are going to hear the signal at any given time makes it difficult for me to model the data in a more logical way. At least I haven't come up with any good ideas.

The table below contains a few seconds of signal data from a known location (Region A). Does anybody have any suggestions for how I could restructure this data to make it useful with the RandomForestClassifier from scikit-learn?

Receiver Name Time RSSI Location
Red 2024-03-21 20:37:58 182 Region A
Blue 2024-03-21 20:37:58 254 Region A
Green 2024-03-21 20:37:58 208 Region A
Red 2024-03-21 20:37:59 192 Region A
Blue 2024-03-21 20:37:59 254 Region A
Green 2024-03-21 20:37:59 215 Region A
Red 2024-03-21 20:38:00 202 Region A
Blue 2024-03-21 20:38:00 254 Region A
Green 2024-03-21 20:38:00 207 Region A
Yellow 2024-03-21 20:38:00 17 Region A
Red 2024-03-21 20:38:01 189 Region A
Blue 2024-03-21 20:38:01 254 Region A
Green 2024-03-21 20:38:01 225 Region A
Yellow 2024-03-21 20:38:01 16 Region A
Red 2024-03-21 20:38:02 204 Region A
Blue 2024-03-21 20:38:02 255 Region A
Green 2024-03-21 20:38:02 213 Region A
Yellow 2024-03-21 20:38:02 18 Region A
Red 2024-03-21 20:38:03 180 Region A
Blue 2024-03-21 20:38:03 254 Region A
Green 2024-03-21 20:38:03 214 Region A
Yellow 2024-03-21 20:38:03 13 Region A
Red 2024-03-21 20:38:04 182 Region A
Blue 2024-03-21 20:38:04 254 Region A
Green 2024-03-21 20:38:04 213 Region A
Yellow 2024-03-21 20:38:04 12 Region A

Below is the Python code I started with. It still looks at each row individually. I've also never worked with scikit-learn or Python, so I'm not confident anything below is correct:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

data = pd.read_csv("data/combined.csv", header=0)

label_encoder = LabelEncoder()
print(data.columns)
data["name_encoded"] = label_encoder.fit_transform(data["name"])
data["location_encoded"] = label_encoder.fit_transform(data["location"])

x = data[["rssi", "name_encoded"]]  # Features (rssi and encoded name)
y = data["location_encoded"]  # Target (encoded location)

# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Create the Random Forest model
model = RandomForestClassifier(n_estimators=100)

# Train the model
model.fit(x_train, y_train)

# Make predictions on the testing set
y_pred = model.predict(x_test)

# Decode predictions
location_decoder = LabelEncoder()

location_decoder.fit(data["location"])  # Fit the decoder with original locations
predicted_locations = location_decoder.inverse_transform(y_pred)
print("Predicted locations:", predicted_locations)

Thank you in advance for any help.

1

There are 1 best solutions below

0
fam-woodpecker On BEST ANSWER

I think it would be best to transform your dataset from long to wide. So a single row has the following columns: "Time", "Red", "Yellow", "Green", "Blue", "Location". Remember, an input into the model (or a row in your dataset) should contain all the features you want, so just imagine them as your columns.

I achieved this from your data in the following way

# long to wide pivot
data = pd.pivot(df, index=["Time", "Location"], columns = "Receiver", values="RSSI")

# drop "Time" into a column form the index
data.reset_index(drop=False, inplace=True)

# replace NaN with 0
data.fillna(0, inplace = True)

# order columns to have location last
data = data[["Time", *df.Receiver.unique(), "Location"]]

Where I had defined a pandas dataframe from your table you shared above as df.

This gives us

Time Location Blue Green Red Yellow
2024-03-21 20:37:58 Region A 254.0 208.0 182.0 0.0
2024-03-21 20:37:59 Region A 254.0 215.0 192.0 0.0
2024-03-21 20:38:00 Region A 254.0 207.0 202.0 17.0
2024-03-21 20:38:01 Region A 254.0 225.0 189.0 16.0
2024-03-21 20:38:02 Region A 255.0 213.0 204.0 18.0

And to add just a few tweaks to your training code, typical practice is to have the final column of your dataset as the label or target column, so you can simply slice the dataframe as

x = data.iloc[:,:-1]
y = data.iloc[:,-1]

Also, LabelEncoder should only be used for the target column, you used it on "name", which is not correct. You would typically use something like OrdinalEncoder for that instead. We don't need that here now anymore because the names are now just columns in their own right. You also then re-defined your label encoder to decode the output, but you already used a label encoder to encode the targets so just reuse it.

You'll also need to convert your time column into a number, any method is fine really, so try just converting to int. Be wary that this number is very big, and ML models typically do better with values between 0 and 1, so maybe consider using a Scaler.

Here is what I would do as a small update to your training

le = LabelEncoder()
y = le.fit_transform(y)

x['Time'] = pd.to_datetime(x['Time'])
x["Time"] = x["Time"].apply(lambda x: x.toordinal())
x = np.array(x)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Create the Random Forest model
model = RandomForestClassifier(n_estimators=100)

# Train the model
model.fit(x_train, y_train)

# Make predictions on the testing set
y_pred = model.predict(x_test)

predicted_locations = le.inverse_transform(y_pred)
print("Predicted locations:", predicted_locations)