Standard deviation with groupby(multiple columns) Pandas

1.4k Views Asked by At

I am working with data from the California Air Resources Board.

site,monitor,date,start_hour,value,variable,units,quality,prelim,name 
5407,t,2014-01-01,0,3.00,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,1,1.54,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,2,3.76,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,3,5.98,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,4,8.09,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,5,12.05,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,6,12.55,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
...

df = pd.concat([pd.read_csv(file, header = 0) for file in f]) #merges all files into one dataframe
df.dropna(axis = 0, how = "all", subset = ['start_hour', 'variable'],
          inplace = True) #drops bottom columns without data in them, NaN

df.start_hour = pd.to_timedelta(df['start_hour'], unit = 'h')
df.date = pd.to_datetime(df.date)
df['datetime'] = df.date + df.start_hour
df.drop(columns=['date', 'start_hour'], inplace=True)
df['month'] = df.datetime.dt.month
df['day'] = df.datetime.dt.day
df['year'] = df.datetime.dt.year
df.set_index('datetime', inplace = True)
df =  df.rename(columns={'value':'conc'})

I have multiple years of hourly PM2.5 concentration data and am trying to prepare graphs that show the average monthly concentration over many years (different graphs for each month). Here's an image of the graph I've created thus far. [![Bombay Beach][1]][1] However, I want to add error bars to the average concentration line but I am having issues when attempting to calculate the standard deviation. I've created a new dataframe d_avg that includes the year, month, day, and average concentration of PM2.5; here's some of the data.

d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
   year  month  day      conc
0  2014      1    1  9.644583
1  2014      1    2  4.945652
2  2014      1    3  4.345238
3  2014      1    4  5.047917
4  2014      1    5  5.212857
5  2014      1    6  2.095714

After this, I found the monthly average m_avg and created a datetime index to plot datetime vs monthly avg conc (refer above, black line).

m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
m_avg['datetime'] = pd.to_datetime(m_avg.year.astype(str) + m_avg.month.astype(str), format='%Y%m') + MonthEnd(1)
[In]: m_avg.head(6)
[Out]:
   year  month      conc   datetime
0  2014      1  4.330985 2014-01-31
1  2014      2  2.280096 2014-02-28
2  2014      3  4.464622 2014-03-31
3  2014      4  6.583759 2014-04-30
4  2014      5  9.069353 2014-05-31
5  2014      6  9.982330 2014-06-30

Now I want to calculate the standard deviation of the d_avg concentration, and I've tried multiple things:

sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()

sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].agg(np.std)

sd = d_avg['conc'].apply(lambda x: x.std())

However, each attempt has left me with the same error in the dataframe. I am unable to plot the standard deviation because I believe it is taking the standard deviation of the year and month too, which I am trying to group the data by. Here's what my resulting dataframe sd looks like:

        year     month        sd
0  44.877611  1.000000  1.795868
1  44.877611  1.414214  2.355055
2  44.877611  1.732051  2.597531
3  44.877611  2.000000  2.538749
4  44.877611  2.236068  5.456785
5  44.877611  2.449490  3.315546

Please help me! [1]: https://i.stack.imgur.com/ueVrG.png

2

There are 2 best solutions below

0
obscuredbyclouds On BEST ANSWER

I decided to dance around my issue since I couldn't figure out what was causing the problem. I merged the m_avg and sd dataframes and dropped the year and month columns that were causing me issues. See code below, lots of renaming.

d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std(ddof=0) 
sd = sd.rename(columns={"conc":"sd", "year":"wrongyr", "month":"wrongmth"})
m_avg_sd = pd.concat([m_avg, sd], axis = 1)
m_avg_sd.drop(columns=['wrongyr', 'wrongmth'], inplace = True)
m_avg_sd['datetime'] = pd.to_datetime(m_avg_sd.year.astype(str) + m_avg_sd.month.astype(str), format='%Y%m') + MonthEnd(1)

and here's the new dataframe:

m_avg_sd.head(5)
Out[2]: 
   year  month       conc         sd   datetime
0  2009      1  48.350105  18.394192 2009-01-31
1  2009      2  21.929383  16.293645 2009-02-28
2  2009      3  15.094729   6.821124 2009-03-31
3  2009      4  12.021009   4.391219 2009-04-30
4  2009      5  13.449100   4.081734 2009-05-31
5
DV82XL On

I tried to reproduce your error and it works fine for me. Here's my complete code sample, which is pretty much exactly the same as yours EXCEPT for the generation of the original dataframe. So I'd suspect that part of the code. Can you provide the code that creates the dataframe?

import pandas as pd

columns = ['year', 'month', 'day', 'conc']
data = [[2014, 1, 1, 2.0],
        [2014, 1, 1, 4.0],
        [2014, 1, 2, 6.0],
        [2014, 1, 2, 8.0],
        [2014, 2, 1, 2.0],
        [2014, 2, 1, 6.0],
        [2014, 2, 2, 10.0],
        [2014, 2, 2, 14.0]]

df = pd.DataFrame(data, columns=columns)
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year', 'month'], as_index=False)['conc'].mean()
m_std = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()

print(f'Concentrations:\n{df}\n')
print(f'Daily Average:\n{d_avg}\n')
print(f'Monthly Average:\n{m_avg}\n')
print(f'Standard Deviation:\n{m_std}\n')

Outputs:

Concentrations:
   year  month  day  conc
0  2014      1    1   2.0
1  2014      1    1   4.0
2  2014      1    2   6.0
3  2014      1    2   8.0
4  2014      2    1   2.0
5  2014      2    1   6.0
6  2014      2    2  10.0
7  2014      2    2  14.0

Daily Average:
   year  month  day  conc
0  2014      1    1   3.0
1  2014      1    2   7.0
2  2014      2    1   4.0
3  2014      2    2  12.0

Monthly Average:
   year  month  conc
0  2014      1   5.0
1  2014      2   8.0

Monthly Standard Deviation:
   year  month      conc
0  2014      1  2.828427
1  2014      2  5.656854