How to convert a column of string to numerical?

441 Views Asked by At

I have this pandas dataframe from a query:

|    name    |    event    |
----------------------------
| name_1     | event_1     |
| name_1     | event_2     |
| name_2     | event_1     |

I need to convert the column event to numerical, or something to look like this:

| name    | event_1 | event_2 |
-------------------------------
| name_1  | 1       | 0       |
| name_1  | 0       | 1       |
| name_2  | 1       | 0       |

In the software rapidminer, i can do this with an operator "nominal to numerical", so i assume that in python convert the type of the column should be effective, but i can be mistaken.

In the final, the idea is make a sum on the columns value with same name and have as result a table that should look like this:

| name    | event_1 | event_2 |
-------------------------------
| name_1  | 1       | 1       |
| name_2  | 1       | 0       |

There is a function that returns what a expected?

important: i can't do a simple count of the events because i do not know them, and the events is different for the users

EDIT: well thanks guys, i can see there is multiple ways to do this, can you guys say which one of these is the most pythonic way?

3

There are 3 best solutions below

2
On BEST ANSWER

Some ways of doing it

1)

In [366]: pd.crosstab(df.name, df.event)
Out[366]:
event   event_1  event_2
name
name_1        1        1
name_2        1        0

2)

In [367]: df.groupby(['name', 'event']).size().unstack(fill_value=0)
Out[367]:
event   event_1  event_2
name
name_1        1        1
name_2        1        0

3)

In [368]: df.pivot_table(index='name', columns='event', aggfunc=len, fill_value=0)
Out[368]:
event   event_1  event_2
name
name_1        1        1
name_2        1        0

4)

In [369]: df.assign(v=1).pivot(index='name', columns='event', values='v').fillna(0)
Out[369]:
event   event_1  event_2
name
name_1      1.0      1.0
name_2      1.0      0.0
0
On

Option 1
pir1 and pir1_5

df.set_index('name').event.str.get_dummies()

        event_1  event_2
name                    
name_1        1        0
name_1        0        1
name_2        1        0

Then you could sum across the index

df.set_index('name').event.str.get_dummies().sum(level=0)

        event_1  event_2
name                    
name_1        1        1
name_2        1        0

Option 2
pir2
Or you could dot product

pd.get_dummies(df.name).T.dot(pd.get_dummies(df.event))

        event_1  event_2
name_1        1        1
name_2        1        0

Option 3
pir3
Advanced Mode

i, r = pd.factorize(df.name.values)
j, c = pd.factorize(df.event.values)
n, m = r.size, c.size

b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)

pd.DataFrame(b, r, c)

        event_1  event_2
name_1        1        1
name_2        1        0

Timing

res.plot(loglog=True)

enter image description here

res.div(res.min(1), 0)

            pir1      pir2  pir3      john1     john2      john3
10      9.948396  3.399913   1.0  20.478368  4.460466  10.642113
30      9.350524  2.681178   1.0  16.589248  3.847666   9.168907
100    11.414536  3.079463   1.0  18.076040  4.277752   9.949305
300    15.769594  2.940529   1.0  16.745889  3.945470   9.069265
1000   26.869451  2.617564   1.0  12.789570  3.236390   7.279205
3000   42.229542  2.099541   1.0   8.716600  2.429847   4.785814
10000  52.571678  1.716088   1.0   4.597598  1.691989   2.800455
30000  58.644764  1.469827   1.0   2.818744  1.535012   1.929452

Functions

pir1 = lambda df: df.set_index('name').event.str.get_dummies().sum(level=0)
pir1_5 = lambda df: pd.get_dummies(df.set_index('name').event).sum(level=0)
pir2 = lambda df: pd.get_dummies(df.name).T.dot(pd.get_dummies(df.event))

def pir3(df):
    i, r = pd.factorize(df.name.values)
    j, c = pd.factorize(df.event.values)
    n, m = r.size, c.size

    b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)

    return pd.DataFrame(b, r, c)

john1 = lambda df: pd.crosstab(df.name, df.event)
john2 = lambda df: df.groupby(['name', 'event']).size().unstack(fill_value=0)
john3 = lambda df: df.pivot_table(index='name', columns='event', aggfunc='size', fill_value=0)

Test

res = pd.DataFrame(
    index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    columns='pir1 pir2 pir3 john1 john2 john3'.split(),
    dtype=float
)

for i in res.index:
    d = pd.concat([df] * i, ignore_index=True)
    for j in res.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        res.at[i, j] = timeit(stmt, setp, number=100)
0
On

You are asking for the pythonic ways , i think in python this way is to use a technic called one-hot encoding this technic is well implemented in libraries likes sklearn and after one hot encoding you will need to group your dataframe by the first column and apply sum function.

here is a code :

import pandas as pd #the useful libraries
import numpy as np
from sklearn.preprocessing import LabelBinarizer #form sklmearn
dataset = pd.DataFrame([['name_1', 'event_1' ], ['name_1', 'event_2'], ['name_2', 'event_1']], columns=['name', 'event'], index=[1, 2, 3])
data = dataset['event'] #just reproduce your dataframe
enc = LabelBinarizer(neg_label=0)
dataset['event_2'] = enc.fit_transform(data)
event_two = dataset['event_2']
dataset['event_1'] = (~event_two.astype(np.bool)).astype(np.int64) #this is a tip to reproduce the event_1 columns
dataset = dataset.groupby('name').sum()
dataset.reset_index(inplace=True)

and the output is :

enter image description here