Why do I need to convert to_numpy() otherwise loc assignment does not work?

49 Views Asked by At

I am using this csv.

import pandas as pd
import numpy as np

real_estate = pd.read_csv('real_estate.csv',index_col=0)

buckets = pd.cut(real_estate['X2 house age'],4,labels=False)

for i in range(len(real_estate['X2 house age'])):
    real_estate.loc[i,'X2 house age'] = buckets[i]

It gives me:

KeyError: 0

for the line real_estate.loc[i,'X2 house age'] = buckets[i] it fails just at the first iteration

Why do I need to change the line to buckets = pd.cut(real_estate['X2 house age'],4,labels=False).to_numpy() to make it work?

2

There are 2 best solutions below

0
mozway On

You shouldn't need the loop, just use:

real_estate['X2 house age'] = pd.cut(real_estate['X2 house age'], 4, labels=False)

Your current approach is failing because you don't have a range index starting from 0. Thus, when assigning to index 0, 1, …, pandas is not finding the correct index and shifts the data.

Output:

    X1 transaction date  X2 house age  X3 distance to the nearest MRT station  X4 number of convenience stores  X5 latitude  X6 longitude  Y house price of unit area
No                                                                                                                                                                   
1              2012.917             2                                84.87882                               10     24.98298     121.54024                        37.9
2              2012.917             1                               306.59470                                9     24.98034     121.53951                        42.2
3              2013.583             1                               561.98450                                5     24.98746     121.54391                        47.3
4              2013.500             1                               561.98450                                5     24.98746     121.54391                        54.8
5              2012.833             0                               390.56840                                5     24.97937     121.54245                        43.1
0
Vitalizzare On

Apart from the fact that we can assign the result directly to a new column, the main problem is the confusion between positional and labeled indexing.

You should either iterate over real_estate.index or address positional data with .iloc or .iat:

# labeled indexing
for i in real_estate.index:
    real_estate.loc[i,'X2 house age'] = buckets[i]

or

# positional indexing
pos_house_age = real_estate.columns.get_loc('X2 house age')
for i in range(len(real_estate)):
    real_estate.iloc[i, pos_house_age] = buckets.iloc[i]

where

buckets = pd.cut(real_estate['X2 house age'], 4, labels=False)

Using .to_numpy() causes labeled indexes to be erased, and after that buckets[i] is equivalent to positional indexing.

See also:


p.s. Just in case: pandas.cut(..., labels=False) doesn't affect indexes of a returned sequence, but replaces category labels with category codes.