Issue setting entire column (and changing dtype) with .loc[:,'col'] in pandas 1.5+

146 Views Asked by At

I often use df.loc[:,'col'] = arr to reassign columns rather than df['col'] = arr. This was a recommended practice to avoid the fact that, prior to Copy-on-Write changes in pandas 2.0, we couldn't guarantee whether a view of copy was returned in various cases, and using df['col'] can sometimes lead to accidental chained assignments and the infamous SettingWithCopyWarning [example].

However, since the changes to inplace operation when setting with .loc and .iloc were implemented in pandas 1.5.0, I've had inconsistent behaviors using df.loc[:,'col'] = new_arr where the code will execute without any warnings or errors, but the column type is not modified as expected when trying to cast a column to a different datatype.

For example, I have a dataframe weather_df, which reads in the 'year' column with a dtype of 'float64' (there are missing values that default to NaN).

import pandas as pd
weather_df = pd.read_csv(weather_file)

I want to fill missing values with 0, and cast the column dtype to 'int32' instead.

The following code executes silently, and replaces the NaN values with 0, but does not modify the type (weather_df['year].dtype remains float64) for pandas 1.5+:

weather_df.loc[:,'year'] = weather_df['year'].fillna(0).astype('int32')

Frustratingly, the following code does modify the datatype of the weather_df 'year' column values in pandas 1.5+, despite previously being not recommended practice:

weather_df['year'] = weather_df['year'].fillna(0).astype('int32')

Before the recent changes (1.-1.4), both of these lines made equivalent updates to weather_df. My understanding from the documentation was that both of these lines should continue to work since I am setting the entire column. It should try to do the operation in place first, then fall back to casting when the in-place operation fails due to mismatching types (only because I am replacing the entire column), but that is not happening for .loc[:,'col'].

The following simplified example reproduces the problem for me...

Version 1.?-1.4

>>> df = pd.DataFrame({'col1':[20.,19.5,21.,24.,23.,22.], 'col2':[2020.,2021.,2022.,2023.,np.nan,2019.]})
>>> print(df['col2'].dtype)
float64
>>> df.loc[:,'col2'] = df['col2'].fillna(0).astype('int32')
>>> print(df['col2'].dtype)
**int32**
>>> df = pd.DataFrame({'col1':[20.,19.5,21.,24.,23.,22.], 'col2':[2020.,2021.,2022.,2023.,np.nan,2019.]})
>>> df['col2'] = df['col2'].fillna(0).astype('int32')
>>> print(df['col2'].dtype)
int32

Version 1.5+

>>> df = pd.DataFrame({'col1':[20.,19.5,21.,24.,23.,22.], 'col2':[2020.,2021.,2022.,2023.,np.nan,2019.]})
>>> print(df['col2'].dtype)
float64
>>> df.loc[:,'col2'] = df['col2'].fillna(0).astype('int32')
>>> print(df['col2'].dtype)
**float64**
>>> df = pd.DataFrame({'col1':[20.,19.5,21.,24.,23.,22.], 'col2':[2020.,2021.,2022.,2023.,np.nan,2019.]})
>>> df['col2'] = df['col2'].fillna(0).astype('int32')
>>> print(df['col2'].dtype)
int32

Is this the intended behavior from now on, or is this a bug? Is there a workaround to use .loc and get the old behavior, or should I just forget the old advice about using .loc to avoid chained assignments?

Edit: I am running this on Linux x86_64 using Python version 3.8.3, and the error occurs for any version 1.5.0 to 2.0.3 (latest). I've also clarified that fillna seems to work in both cases, but the typecast does not work using .loc.

Second edit: I have further clarified that this is not a post about the SettingWithCopyWarning. No warnings are raised by the example code. It's just that pandas 1.5+ causes the behavior of the two lines to diverge (because df.loc[:,'col2'] = df['col2'].fillna(0).astype('int32') no longer changes the dtype of 'col2', as was the case before pandas version 1.5. With pandas version 1.something to 1.4, the second print statement of the example code will change to 'int32'. This could cause bugs (as it did for me) for anyone updating pandas from an older version.

1

There are 1 best solutions below

0
On

There is a subtle note about this in the docs under astype():

When trying to convert a subset of columns to a specified type using astype() and loc(), upcasting occurs.

loc() tries to fit in what we are assigning to the current dtypes, while [] will overwrite them taking the dtype from the right hand side.

The same issue was raised for older versions of pandas (see this post about pandas 0.18). At some point (I have not been able to identify when), this was changed such that df.loc[:,'col'] and df['col'] produced the same result when using .astype() to cast a column to a new type. I know this because pandas versions 1.2-1.4 (at least) produce the result I expected where df['col'].dtype was updated regardless of whether df.loc[:,'col'] or df['col'] was used on the left hand side of the assignment.

However, with pandas version 1.5-2.0.3 (Windows/Linux, Python 3.8), this changed again such that the loc() and astype() interaction follows the note in the docs (see this post regarding pandas 2.0.0).

In general, unless you are trying to cast your column to a new datatype using astype(), you can use loc() with a row mask of : to make assignments, but for type casting you will need to use [] for now.

I feel like this behavior should raise a warning to inform the user that the intended typecast fails using .loc[]. I will raise an issue on the pandas-dev GitHub and update this answer with a link.