Apply functions to pandas groupby and indexing

328 Views Asked by At

I am trying to understand the Pandas Groupby, but I'm currently seeing some behavior I don't understand. Basically, I have a dataset that looks like (only head shown):

 userId movieId rating  timestamp   parsed_time
0   1   2       3.5     1112486027  2005-04-02 23:53:47
1   1   29      3.5     1112484676  2005-04-02 23:31:16
2   1   32      3.5     1112484819  2005-04-02 23:33:39
3   1   47      3.5     1112484727  2005-04-02 23:32:07
4   1   50      3.5     1112484580  2005-04-02 23:29:40

I have checked the dataset for NaN/null values, and there are none. Now, I would like to compute the average rating of each movie, as well as the standard deviation.

Getting the average rating is simple:

ratings = pd.read_csv('ratings.csv', sep=',')

average_rating = ratings[['movieId','rating']].groupby('movieId',as_index=False).mean()
average_ratings.rename(columns={'rating':'AverageRating'}, inplace=True)

which gives me something like:

 movieId    AverageRating
0   1     3.921240
1   2     3.211977
2   3     3.151040
3   4     2.861393
4   5     3.064592

So this is all fine and well, and what I expect from the combination of groupby() and mean(). Now, I would like to do the same to compute the standard deviation of the movie ratings, and add this as a new column to the average_rating df:

average_rating['StdDev'] = ratings[['movieId','rating']].groupby('movieId').std()

which gives me:

    movieId AverageRating   StdDev
0   1       3.921240    NaN
1   2       3.211977    0.889012
2   3       3.151040    0.951150
3   4       2.861393    1.006642
4   5       3.064592    1.095702

What puzzles me here, is the NaN that appears as the first entry in my StdDev column. If I extract manually the rows of, say movieId [1,2] and compute the mean and standard deviation just for those:

print('Mean movieID 1:')
print(ratings[ratings['movieId']==1]['rating'].mean())
print('StdDev movieID 1:')
print(ratings[ratings['movieId']==1]['rating'].std())
print('Mean movieID:')
print(ratings[ratings['movieId']==2]['rating'].mean())
print('StdDev movieID 2:')
print(ratings[ratings['movieId']==2]['rating'].std())

I get returned:

Mean movieID 1:
3.921240
StdDev movieID 1:
0.889012
Mean movieID 2:
3.211977
StdDev movieID 2:
0.951150

So to me it looks like the groupby.std() for some reason skips the first index, replaces that with a NaN, and then fill in the correct values, but shifted by one index. I do not understand this behavior, and it's not what I would expect. Can anyone explain this behavior to me on the second use of groupby, and how to avoid it/get it to do what I wanted?

2

There are 2 best solutions below

3
On BEST ANSWER

The problem happens not during the computation of the standard deviation, but when assigning the result to the new column StdDev. The is because pandas does assignment by index, implicitly.

The code below should work because the result of both groupby operations is indexed on movieId:

# note how I remove as_index=False
average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating['StdDev'] = ratings[['movieId','rating']].groupby('movieId').std()

Of course, you should do both in one go:

ratings[['movieId','rating']].groupby('movieId').agg(['mean', 'std'])

More elegant (or at least more standard):

ratings.groupby('movieId')['rating'].agg(['mean', 'std'])
1
On

The key here is that in your first groupby, you included as_index=False, so the created df has a new serial index. In your secong groupby, you don't include the as_index argument, so it uses MovieID as the index.

When you then assign this as a column in average_ratings the indexes aren't referring to the same thing.

It just so happens in this case that it looks like your index has been shifted, because you have MovieIDs 1-5, and an integer index 0-4. The null value in the StdDev column merely reflects the fact that there isn't a Movie with ID = 0.