I am trying to understand the Pandas Groupby, but I'm currently seeing some behavior I don't understand. Basically, I have a dataset that looks like (only head shown):
userId movieId rating timestamp parsed_time
0 1 2 3.5 1112486027 2005-04-02 23:53:47
1 1 29 3.5 1112484676 2005-04-02 23:31:16
2 1 32 3.5 1112484819 2005-04-02 23:33:39
3 1 47 3.5 1112484727 2005-04-02 23:32:07
4 1 50 3.5 1112484580 2005-04-02 23:29:40
I have checked the dataset for NaN/null values, and there are none. Now, I would like to compute the average rating of each movie, as well as the standard deviation.
Getting the average rating is simple:
ratings = pd.read_csv('ratings.csv', sep=',')
average_rating = ratings[['movieId','rating']].groupby('movieId',as_index=False).mean()
average_ratings.rename(columns={'rating':'AverageRating'}, inplace=True)
which gives me something like:
movieId AverageRating
0 1 3.921240
1 2 3.211977
2 3 3.151040
3 4 2.861393
4 5 3.064592
So this is all fine and well, and what I expect from the combination of groupby() and mean().
Now, I would like to do the same to compute the standard deviation of the movie ratings, and add this as a new column to the average_rating df:
average_rating['StdDev'] = ratings[['movieId','rating']].groupby('movieId').std()
which gives me:
movieId AverageRating StdDev
0 1 3.921240 NaN
1 2 3.211977 0.889012
2 3 3.151040 0.951150
3 4 2.861393 1.006642
4 5 3.064592 1.095702
What puzzles me here, is the NaN that appears as the first entry in my StdDev column. If I extract manually the rows of, say movieId [1,2] and compute the mean and standard deviation just for those:
print('Mean movieID 1:')
print(ratings[ratings['movieId']==1]['rating'].mean())
print('StdDev movieID 1:')
print(ratings[ratings['movieId']==1]['rating'].std())
print('Mean movieID:')
print(ratings[ratings['movieId']==2]['rating'].mean())
print('StdDev movieID 2:')
print(ratings[ratings['movieId']==2]['rating'].std())
I get returned:
Mean movieID 1:
3.921240
StdDev movieID 1:
0.889012
Mean movieID 2:
3.211977
StdDev movieID 2:
0.951150
So to me it looks like the groupby.std() for some reason skips the first index, replaces that with a NaN, and then fill in the correct values, but shifted by one index. I do not understand this behavior, and it's not what I would expect. Can anyone explain this behavior to me on the second use of groupby, and how to avoid it/get it to do what I wanted?
The problem happens not during the computation of the standard deviation, but when assigning the result to the new column
StdDev. The is because pandas does assignment by index, implicitly.The code below should work because the result of both
groupbyoperations is indexed onmovieId:Of course, you should do both in one go:
More elegant (or at least more standard):