I am trying to understand the Pandas Groupby, but I'm currently seeing some behavior I don't understand. Basically, I have a dataset that looks like (only head shown):
userId movieId rating timestamp parsed_time
0 1 2 3.5 1112486027 2005-04-02 23:53:47
1 1 29 3.5 1112484676 2005-04-02 23:31:16
2 1 32 3.5 1112484819 2005-04-02 23:33:39
3 1 47 3.5 1112484727 2005-04-02 23:32:07
4 1 50 3.5 1112484580 2005-04-02 23:29:40
I have checked the dataset for NaN/null values, and there are none. Now, I would like to compute the average rating of each movie, as well as the standard deviation.
Getting the average rating is simple:
ratings = pd.read_csv('ratings.csv', sep=',')
average_rating = ratings[['movieId','rating']].groupby('movieId',as_index=False).mean()
average_ratings.rename(columns={'rating':'AverageRating'}, inplace=True)
which gives me something like:
movieId AverageRating
0 1 3.921240
1 2 3.211977
2 3 3.151040
3 4 2.861393
4 5 3.064592
So this is all fine and well, and what I expect from the combination of groupby()
and mean()
.
Now, I would like to do the same to compute the standard deviation of the movie ratings, and add this as a new column to the average_rating
df:
average_rating['StdDev'] = ratings[['movieId','rating']].groupby('movieId').std()
which gives me:
movieId AverageRating StdDev
0 1 3.921240 NaN
1 2 3.211977 0.889012
2 3 3.151040 0.951150
3 4 2.861393 1.006642
4 5 3.064592 1.095702
What puzzles me here, is the NaN that appears as the first entry in my StdDev column. If I extract manually the rows of, say movieId [1,2] and compute the mean and standard deviation just for those:
print('Mean movieID 1:')
print(ratings[ratings['movieId']==1]['rating'].mean())
print('StdDev movieID 1:')
print(ratings[ratings['movieId']==1]['rating'].std())
print('Mean movieID:')
print(ratings[ratings['movieId']==2]['rating'].mean())
print('StdDev movieID 2:')
print(ratings[ratings['movieId']==2]['rating'].std())
I get returned:
Mean movieID 1:
3.921240
StdDev movieID 1:
0.889012
Mean movieID 2:
3.211977
StdDev movieID 2:
0.951150
So to me it looks like the groupby.std()
for some reason skips the first index, replaces that with a NaN, and then fill in the correct values, but shifted by one index. I do not understand this behavior, and it's not what I would expect. Can anyone explain this behavior to me on the second use of groupby, and how to avoid it/get it to do what I wanted?
The problem happens not during the computation of the standard deviation, but when assigning the result to the new column
StdDev
. The is because pandas does assignment by index, implicitly.The code below should work because the result of both
groupby
operations is indexed onmovieId
:Of course, you should do both in one go:
More elegant (or at least more standard):