Performing object column manipulation in python

Question

Performing object column manipulation in python

151 Views Asked by AudioBubble At 17 August 2025 at 23:05

I have a dataset on Google Playstore data. It has twelve features (one float, the rest objects) and I would like to manipulate one of them a bit so that I can convert it to numeric form. The feature column I'm talking about is the Size column, and here's a snapshot of what it looks like:

As you can see, it's in a string form consisting of the number with the scale appended to it. Checking through the rest of the feature, I discovered that asides from megabytes (M), there are also some entries in kilobytes (K) and also some entries where the size is the string "Varies according to device".

So my ultimate plan to deal with this is to :

Strip the last character from all the entries under size.
Convert the convertible entries to floats
Rescale the k entries by dividing them by 1000 so as to represent them properly
Replace the "Varies according to device" entries with the mean of the feature.

I know how to do 1,2 and 4, but 3 is giving me trouble because I'm not sure how to go about differentiating the k entries from the M ones and dividing those specific entries by 1000. If all of them were M or K, there'd be no issue as I've dealt with that before, but having to discriminate makes it trickier and I'm not sure what form the syntax should take (my attempts continuously throw errors).

By the way if anyone has a smarter way of going about this, I'd love to hear it. This is a learning exercise if anything!

Any help would be greatly appreciated. Thank you!!

------------------------Edit------------------------

A minimum reproducible example of an attempt would be

import pandas as pd

data = pd.read_csv("playstore-edited.csv",
                   index_col=("App"),
                   parse_dates=True,
                   infer_datetime_format=True)

x = data

var = [i[-1] for i in x.Size]
sar = dict(list(enumerate(var)))
ls = []
for i in sar:
    if sar[i]=="k":
        ls.append(i)
x.Size.loc[ls,"Size"]=x.Size.loc[ls,"Size"]/1000

This throws the following error:

IndexingError: Too many indexers

I know the last part of the code is off, but I'm not sure how to express what I want.

Original Q&A

There are 2 best solutions below

Mehdi Golzadeh On 15 October 2020 at 17:15

Process size column using regex and then do your conversions:

df = (
    df
    #extract numeric part
    .assign(New_Size = lambda x: x['Size'].str.replace('([A-Za-z]+)', ''))   
    #extract Scale part
    .assign(Scale = lambda x: x['Size'].str.extract('([A-Za-z]+)'))
    #convert KB to MB
    .assign(Size = lambda x: np.where(x['Scale'] =='K', x['New_Size']/1000,x['New_Size']))
    #update converted rows to MB
    .assign(Scale = lambda x: np.where(x['Scale'] =='K', 'M',x['Scale']))
    #replace those do not have value with mean of the size column
    .assign(Size= lambda x: np.where(x['Scale']!='M',mean(x['New_Size']), x['New_Size']))
)

**Joel Leeb-du Toit** · Accepted Answer

As written in the comment: If you strip the final letter to a new column you can then condition on that column for the division.

df = pd.DataFrame({'APP': ['A', 'B'], 'Size': ['5M','6K']})
df['Scale'] = df['Size'].str[-1]
df['Size'] = df['Size'].str[:-1].astype(int)
df.loc[df['Scale'] == 'K', 'Size'] = df.loc[df['Scale'] == 'K', 'Size'] / 1000
df = df.drop('Scale', axis=1)
df

Performing object column manipulation in python

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in MACHINE-LEARNING

Related Questions in FEATURE-SELECTION

Related Questions in FEATURE-ENGINEERING

Trending Questions

Popular # Hahtags

Popular Questions