I often use df.loc[:,'col'] = arr
to reassign columns rather than df['col'] = arr
. This was a recommended practice to avoid the fact that, prior to Copy-on-Write changes in pandas 2.0, we couldn't guarantee whether a view of copy was returned in various cases, and using df['col']
can sometimes lead to accidental chained assignments and the infamous SettingWithCopyWarning [example].
However, since the changes to inplace operation when setting with .loc and .iloc were implemented in pandas 1.5.0, I've had inconsistent behaviors using df.loc[:,'col'] = new_arr
where the code will execute without any warnings or errors, but the column type is not modified as expected when trying to cast a column to a different datatype.
For example, I have a dataframe weather_df, which reads in the 'year' column with a dtype of 'float64' (there are missing values that default to NaN).
import pandas as pd
weather_df = pd.read_csv(weather_file)
I want to fill missing values with 0, and cast the column dtype to 'int32' instead.
The following code executes silently, and replaces the NaN values with 0, but does not modify the type (weather_df['year].dtype remains float64) for pandas 1.5+:
weather_df.loc[:,'year'] = weather_df['year'].fillna(0).astype('int32')
Frustratingly, the following code does modify the datatype of the weather_df 'year' column values in pandas 1.5+, despite previously being not recommended practice:
weather_df['year'] = weather_df['year'].fillna(0).astype('int32')
Before the recent changes (1.-1.4), both of these lines made equivalent updates to weather_df. My understanding from the documentation was that both of these lines should continue to work since I am setting the entire column. It should try to do the operation in place first, then fall back to casting when the in-place operation fails due to mismatching types (only because I am replacing the entire column), but that is not happening for .loc[:,'col']
.
The following simplified example reproduces the problem for me...
Version 1.?-1.4
>>> df = pd.DataFrame({'col1':[20.,19.5,21.,24.,23.,22.], 'col2':[2020.,2021.,2022.,2023.,np.nan,2019.]})
>>> print(df['col2'].dtype)
float64
>>> df.loc[:,'col2'] = df['col2'].fillna(0).astype('int32')
>>> print(df['col2'].dtype)
**int32**
>>> df = pd.DataFrame({'col1':[20.,19.5,21.,24.,23.,22.], 'col2':[2020.,2021.,2022.,2023.,np.nan,2019.]})
>>> df['col2'] = df['col2'].fillna(0).astype('int32')
>>> print(df['col2'].dtype)
int32
Version 1.5+
>>> df = pd.DataFrame({'col1':[20.,19.5,21.,24.,23.,22.], 'col2':[2020.,2021.,2022.,2023.,np.nan,2019.]})
>>> print(df['col2'].dtype)
float64
>>> df.loc[:,'col2'] = df['col2'].fillna(0).astype('int32')
>>> print(df['col2'].dtype)
**float64**
>>> df = pd.DataFrame({'col1':[20.,19.5,21.,24.,23.,22.], 'col2':[2020.,2021.,2022.,2023.,np.nan,2019.]})
>>> df['col2'] = df['col2'].fillna(0).astype('int32')
>>> print(df['col2'].dtype)
int32
Is this the intended behavior from now on, or is this a bug? Is there a workaround to use .loc
and get the old behavior, or should I just forget the old advice about using .loc
to avoid chained assignments?
Edit: I am running this on Linux x86_64 using Python version 3.8.3, and the error occurs for any version 1.5.0 to 2.0.3 (latest). I've also clarified that fillna seems to work in both cases, but the typecast does not work using .loc
.
Second edit: I have further clarified that this is not a post about the SettingWithCopyWarning. No warnings are raised by the example code. It's just that pandas 1.5+ causes the behavior of the two lines to diverge (because df.loc[:,'col2'] = df['col2'].fillna(0).astype('int32')
no longer changes the dtype of 'col2', as was the case before pandas version 1.5. With pandas version 1.something to 1.4, the second print statement of the example code will change to 'int32'. This could cause bugs (as it did for me) for anyone updating pandas from an older version.
There is a subtle note about this in the docs under astype():
The same issue was raised for older versions of pandas (see this post about pandas 0.18). At some point (I have not been able to identify when), this was changed such that
df.loc[:,'col']
anddf['col']
produced the same result when using .astype() to cast a column to a new type. I know this because pandas versions 1.2-1.4 (at least) produce the result I expected wheredf['col'].dtype
was updated regardless of whetherdf.loc[:,'col']
ordf['col']
was used on the left hand side of the assignment.However, with pandas version 1.5-2.0.3 (Windows/Linux, Python 3.8), this changed again such that the
loc()
andastype()
interaction follows the note in the docs (see this post regarding pandas 2.0.0).In general, unless you are trying to cast your column to a new datatype using
astype()
, you can useloc()
with a row mask of:
to make assignments, but for type casting you will need to use[]
for now.I feel like this behavior should raise a warning to inform the user that the intended typecast fails using
.loc[]
. I will raise an issue on the pandas-dev GitHub and update this answer with a link.