Performing object column manipulation in python

163 Views Asked by At

I have a dataset on Google Playstore data. It has twelve features (one float, the rest objects) and I would like to manipulate one of them a bit so that I can convert it to numeric form. The feature column I'm talking about is the Size column, and here's a snapshot of what it looks like:

enter image description here

As you can see, it's in a string form consisting of the number with the scale appended to it. Checking through the rest of the feature, I discovered that asides from megabytes (M), there are also some entries in kilobytes (K) and also some entries where the size is the string "Varies according to device".

So my ultimate plan to deal with this is to :

  1. Strip the last character from all the entries under size.
  2. Convert the convertible entries to floats
  3. Rescale the k entries by dividing them by 1000 so as to represent them properly
  4. Replace the "Varies according to device" entries with the mean of the feature.

I know how to do 1,2 and 4, but 3 is giving me trouble because I'm not sure how to go about differentiating the k entries from the M ones and dividing those specific entries by 1000. If all of them were M or K, there'd be no issue as I've dealt with that before, but having to discriminate makes it trickier and I'm not sure what form the syntax should take (my attempts continuously throw errors).

By the way if anyone has a smarter way of going about this, I'd love to hear it. This is a learning exercise if anything!

Any help would be greatly appreciated. Thank you!!

------------------------Edit------------------------

A minimum reproducible example of an attempt would be

import pandas as pd

data = pd.read_csv("playstore-edited.csv",
                   index_col=("App"),
                   parse_dates=True,
                   infer_datetime_format=True)

x = data

var = [i[-1] for i in x.Size]
sar = dict(list(enumerate(var)))
ls = []
for i in sar:
    if sar[i]=="k":
        ls.append(i)
x.Size.loc[ls,"Size"]=x.Size.loc[ls,"Size"]/1000

This throws the following error:

IndexingError: Too many indexers

I know the last part of the code is off, but I'm not sure how to express what I want.

2

There are 2 best solutions below

0
On BEST ANSWER

As written in the comment: If you strip the final letter to a new column you can then condition on that column for the division.

df = pd.DataFrame({'APP': ['A', 'B'], 'Size': ['5M','6K']})
df['Scale'] = df['Size'].str[-1]
df['Size'] = df['Size'].str[:-1].astype(int)
df.loc[df['Scale'] == 'K', 'Size'] = df.loc[df['Scale'] == 'K', 'Size'] / 1000
df = df.drop('Scale', axis=1)
df
0
On

Process size column using regex and then do your conversions:

df = (
    df
    #extract numeric part
    .assign(New_Size = lambda x: x['Size'].str.replace('([A-Za-z]+)', ''))   
    #extract Scale part
    .assign(Scale = lambda x: x['Size'].str.extract('([A-Za-z]+)'))
    #convert KB to MB
    .assign(Size = lambda x: np.where(x['Scale'] =='K', x['New_Size']/1000,x['New_Size']))
    #update converted rows to MB
    .assign(Scale = lambda x: np.where(x['Scale'] =='K', 'M',x['Scale']))
    #replace those do not have value with mean of the size column
    .assign(Size= lambda x: np.where(x['Scale']!='M',mean(x['New_Size']), x['New_Size']))
)