How to group lists and evaluate mean square error?

224 Views Asked by At

I'm writing custom metric function and here's the steps I implemented:

  1. I have a list of floats in preds and list of int 0-1 values in target
  2. I round preds
  3. I need to make groupby on preds
  4. Count mean target values for those groupedby preds
  5. Count MSE between groupedby preds and target

That's how df looks like before groupby

enter image description here

rounded = [np.round(x, 2) for x in preds]

df = pd.DataFrame({'target': target, 'preds': rounded})
        
df = df.groupby('preds')['target'].mean().to_frame().reset_index()
        
mse = mean_squared_error(df['target'], df['preds'])  

And that's how after groupby and mean() (as I can't properly display groupby)

enter image description here

Basicaly, I don't know how to groupby on two python lists.

I did groupby on one list like that

gr_list = [list(j) for i, j in groupby(rounded)]

But I have no clue how to groupby second list, based on gr_list groupping

2

There are 2 best solutions below

0
Michael On

Not the cleanest code, but I managed to do it like that:

from collections import defaultdict

d = defaultdict(list)
for i, item in enumerate(rounded): # rounded is rounded preds
    d[item].append(target[i])

enter image description here

meanDict = {}
for k,v in d.items():
    meanDict[k] = sum(v)/ float(len(v))

enter image description here

preds, target = zip(*avgDict.items())

mse = mean_squared_error(values, keys)
0
Laurent On

Here is a reproducible example of a more idiomatic way to do what you are trying to achieve, if I understand correctly:

import random
import pandas as pd

preds = [random.random() for _ in range(1_000)]
target = [random.randint(0, 1) for _ in range(1_000)]

df = pd.DataFrame({"preds": preds, "target": target})
import numpy as np

# Steps 1 to 4 of your post
df = df.round({"preds": 2}).groupby("preds").agg(np.mean).reset_index()

print(df)
# Output
     preds    target
0     0.00  1.000000
1     0.01  0.555556
2     0.02  0.375000
3     0.03  0.375000
4     0.04  0.416667
..     ...       ...
96    0.96  0.666667
97    0.97  0.500000
98    0.98  0.375000
99    0.99  0.461538
100   1.00  0.285714
from sklearn.metrics import mean_squared_error

# Step 5
print(mean_squared_error(df["preds"], df["target"]))  # 0.1084811098077257