From my understanding, pandas.DataFrame.apply does not apply changes inplace and we should use its return object to persist any changes. However, I've found the following inconsistent behavior:
Let's apply a dummy function for the sake of ensuring that the original df remains untouched:
>>> def foo(row: pd.Series):
... row['b'] = '42'
>>> df = pd.DataFrame([('a0','b0'),('a1','b1')], columns=['a', 'b'])
>>> df.apply(foo, axis=1)
>>> df
a b
0 a0 b0
1 a1 b1
This behaves as expected. However, foo will apply the changes inplace if we modify the way we initialize this df:
>>> df2 = pd.DataFrame(columns=['a', 'b'])
>>> df2['a'] = ['a0','a1']
>>> df2['b'] = ['b0','b1']
>>> df2.apply(foo, axis=1)
>>> df2
a b
0 a0 42
1 a1 42
I've also noticed that the above is not true if the columns dtypes are not of type 'object'. Why does apply() behave differently in these two contexts?
Python: 3.6.5
Pandas: 0.23.1
Interesting question! I believe the behavior you're seeing is an artifact of the way you use
apply
.As you correctly indicate,
apply
is not intended to be used to modify a dataframe. However, sinceapply
takes an arbitrary function, it doesn't guarantee that applying the function will be idempotent and will not change the dataframe. Here, you've found a great example of that behavior, because your functionfoo
attempts to modify the row that it is passed byapply
.Using
apply
to modify a row could lead to these side effects. This isn't the best practice.Instead, consider this idiomatic approach for
apply
. The functionapply
is often used to create a new column. Here's an example of howapply
is typically used, which I believe would steer you away from this potentially troublesome area:Notice that pandas passed a row or a cell to the function you give as the first argument to
apply
, then stores the function's output in a column of your choice.If you'd like to modify a dataframe row-by-row, take a look at
iterrows
andloc
for the most idiomatic route.