KeyError when trying to access a newly assigned column in a pandas dataframe

6.2k Views Asked by At

None of the solution on KeyError posts addressed my problem hence this question:

I have the following column in a Pandas DataFrame:

df['EventDate']

0        26-12-2016
1        23-12-2016
2        16-12-2016
3        15-12-2016
4        11-12-2016
5        10-12-2016
6        07-12-2016

Now I am trying to split the Date and extract the last four values of the year into another Series by using the below command:

trial=df["EventDate"].str.split("-",2,expand=True)

Now using the 3rd index value I am able to get the entire values:

df.year=trial[2]

Checking the data type of the year column now:

type(df.year)
Out[80]: pandas.core.series.Series

Yes it is Pandas Series transferred through trial[2] code to df.year

print(trial[2])
0        2016
1        2016
2        2016
3        2016
4        2016

Now I am trying to Groupby the Year column and that is where I get the error:

yearwise=df.groupby('year')

Traceback (most recent call last):

File "<ipython-input-81-cf39b80933c4>", line 1, in <module>
yearwise=df.groupby('year')

File "C:\WINPYTH\python-3.5.4.amd64\lib\site-
packages\pandas\core\generic.py", line 4416, in groupby
**kwargs)

 File "C:\WINPYTH\python-3.5.4.amd64\lib\site-
 packages\pandas\core\groupby.py", line 1699, in groupby
 return klass(obj, by, **kwds)

File "C:\WINPYTH\python-3.5.4.amd64\lib\site-
packages\pandas\core\groupby.py", line 392, in __init__
mutated=self.mutated)

File "C:\WINPYTH\python-3.5.4.amd64\lib\site-
packages\pandas\core\groupby.py", line 2690, in _get_grouper
raise KeyError(gpr)

KeyError: 'year'

Can you please help to resolve this KeyError and get the Groupby value for Year column?

A THOUSAND thanks in advance for your answers.

1

There are 1 best solutions below

1
On BEST ANSWER

The fundamental misunderstanding here is that you think doing

df.year = ...

Creates a column called year in df, but this is not true! Observe:

print(df)

         Col1
0  26-12-2016
1  23-12-2016
2  16-12-2016
3  15-12-2016
4  11-12-2016
5  10-12-2016
6  07-12-2016

df.year = df.Col1.str.split('-', 2, expand=True)[2]

print(type(df.year))
pandas.core.series.Series

print(df) # where's 'year'??

         Col1
0  26-12-2016
1  23-12-2016
2  16-12-2016
3  15-12-2016
4  11-12-2016
5  10-12-2016
6  07-12-2016

So, what is df.year? It is an attribute of df, which is not the same as a column. In python, you can assign attributes using the dot notation, so this works without throwing errors. You can confirm by printing out df.__dict__:

print(df.__dict__)

{ ...
 'year': 0    2016
 1    2016
 2    2016
 3    2016
 4    2016
 5    2016
 6    2016
 Name: 2, dtype: object}

If you want to actually assign to a column, you'll need to use the [...] indexing syntax, like this:

df['year'] = df.Col1.str.split('-', 2, expand=True)[2]
print(df)

         Col1  year
0  26-12-2016  2016
1  23-12-2016  2016
2  16-12-2016  2016
3  15-12-2016  2016
4  11-12-2016  2016
5  10-12-2016  2016
6  07-12-2016  2016