Why dtypes are not changing when updating columns in Pandas 2.x but would change in Pandas 1.x?

42 Views Asked by At

When changing the values and/or dtypes of specific columns there is a different behaviour from Pandas 1.x to 2.x.

For example, on column e in the example below:

  • Pandas 1.x: Using pd.to_datetime to update the column will parse the date and change its dtype
  • Pandas 2.x: Using pd.to_datetime to update the column will parse the date but will not change its dtype

What change from Pandas 1.x to 2.x explains this behavior?

Example code

import pandas as pd

# Creates example DataFrame
df = pd.DataFrame({
    'a': ['1', '2'],
    'b': ['1.0', '2.0'],
    'c': ['True', 'False'],
    'd': ['2024-03-07', '2024-03-06'],
    'e': ['07/03/2024', '06/03/2024'],
    'f': ['aa', 'bb'],
})

# Changes dtypes of existing columns
df.loc[:, 'a'] = df.a.astype('int')
df.loc[:, 'b'] = df.b.astype('float')
df.loc[:, 'c'] = df.c.astype('bool')

# Parses and changes dates dtypes
df.loc[:, 'd'] = pd.to_datetime(df.d)
df.loc[:, 'e'] = pd.to_datetime(df.e, format='%d/%m/%Y')

# Changes values of existing columns
df.loc[:, 'f'] = df.f + 'cc'

# Creates new column
df.loc[:, 'g'] = [1, 2]

Results in Pandas 1.5.2

In [2]: df
Out[2]: 
   a    b     c          d          e     f  g
0  1  1.0  True 2024-03-07 2024-03-07  aacc  1
1  2  2.0  True 2024-03-06 2024-03-06  bbcc  2

In [3]: df.dtypes
Out[3]: 
a             int64
b           float64
c              bool
d    datetime64[ns]
e    datetime64[ns]
f            object
g             int64
dtype: object

Results in Pandas 2.1.4

In [2]: df
Out[2]: 
   a    b     c                    d                    e     f  g
0  1  1.0  True  2024-03-07 00:00:00  2024-03-07 00:00:00  aacc  1
1  2  2.0  True  2024-03-06 00:00:00  2024-03-06 00:00:00  bbcc  2

In [3]: df.dtypes
Out[3]: 
a    object
b    object
c    object
d    object
e    object
f    object
g     int64
dtype: object
1

There are 1 best solutions below

0
e-motta On BEST ANSWER

From What’s new in 2.0.0 (April 3, 2023):

Changed behavior in setting values with df.loc[:, foo] = bar or df.iloc[:, foo] = bar, these now always attempt to set values inplace before falling back to casting (GH 45333).

So in Pandas 2+, whenever you set values with .loc, it will try to set them in place. If it succeeds, it will not create a new column, and will preserve the existing column's dtype.

Compare this with df[foo] = bar: this will create a new column with the dtype inferred from the values that are being set. The same happens when you do df['d'] = pd.to_datetime(df.d), i.e., even in Pandas 2+, it will create a new column with dtype of datetime64[ns].