Selection over different columns after a groupby

85 Views Asked by At

I am new to pandas and hence please treat this question with patience I have a Df with year, state and population data collected over many years and across many states

I want to find the max pop during any year and the corresponding state

example:

1995 Alabama xx; 1196 New York yy; 1997 Utah zz

I did a groupby and got the population for all the states in a year; How do i iterate over all the years

state_yearwise = df.groupby(["Year", "State"])["Pop"].max()
state_yearwise.head(10)
1990  Alabama        22.5
      Arizona        29.4
      Arkansas       16.2
      California     34.1

2016 South Dakota     14.1
     Tennessee        10.2
     Texas            17.4
     Utah             16.1

Now I did

df.loc[df.pop  == df.pop.max(), ["year", "State", "pop"]]

1992    Colorado  54.1

give me only 1 year and the max over all years and states What I want is per year which state had the max population

Suggestions?

3

There are 3 best solutions below

4
On

Is this what you want:

df = pd.DataFrame([{'state' : 'A', 'year' : 2000, 'pop' : 100},
    {'state' : 'A', 'year' : 2001, 'pop' : 110},
    {'state' : 'B', 'year' : 2000, 'pop' : 210},
    {'state' : 'B', 'year' : 2001, 'pop' : 200}])
maxpop = df.groupby("state",as_index=False)["pop"].max()
pd.merge(maxpop,df,how='inner')

I see for df:

    pop state year
0   100 A     2000
1   110 A     2001
2   210 B     2000
3   200 B     2001

And for the final result:

  state pop year
0   A   110 2001
1   B   210 2000

Proof this works:

enter image description here

2
On

You can use transform to get the max for each column and get the index of the corresponding pop

idx = df.groupby(['year'])['pop'].transform(max) == df['pop']

Now you can index the df using idx

df[idx]

You get

    pop state   year
2   210 B   2000
3   200 B   2001

For the other dataframe that you updated

    Year    State       County  Pop
0   2015    Mississippi Panola  6.4
1   2015    Mississippi Newton  6.7
2   2015    Mississippi Newton  6.7
3   2015    Utah        Monroe  12.1
4   2013    Alabama     Newton  10.4
5   2013    Alabama     Georgi  4.2

idx = df.groupby(['Year'])['Pop'].transform(max) == df['Pop']

df[idx]

You get

    Year    State   County  Pop
3   2015    Utah    Monroe  12.1
4   2013    Alabama Newton  10.4
0
On

Why not get rid of group by ? By using sort_values and drop_duplicates

df.sort_values(['state','pop']).drop_duplicates('state',keep='last')
Out[164]: 
   pop state  year
1  110     A  2001
2  210     B  2000