Dataframe conversion from Python to R: keep Python string as R chr not R factor

1k Views Asked by At

First happy new year to everybody and happy coding for 2017.

I have a Python pandas dataframe that I need to convert to a R dataframe. My Python pandas dataframe looks like this:

'data.frame':   302 obs. of  19 variables:
 $ typ     : chr  "page" "area" "par" "line" ...
 $ id      : chr  "page_1" "block_1_1" "par_1_1" "line_1_1" ...
 $ page    : num  1 1 1 1 1 1 1 1 1 1 ...
 $ area    : num  NA 1 1 1 2 2 2 2 3 3 ...
 $ par     : num  NA NA 1 1 NA 2 2 2 NA 3 ...
 $ line    : num  NA NA NA 1 NA NA 2 2 NA NA ...
 $ x1      : num  0 0.02 36.91 36.91 0.03 ...
 $ y1      : num  0 26.1 4.2 4.2 26.1 ...
 $ x2      : num  100 5.95 36.92 36.92 5.97 ...
 $ y2      : num  100 26.09 8.29 8.29 44.54 ...
 $ length  : num  100 5.93 0.02 0.02 5.93 ...
 $ heigth  : num  100 0.01 4.09 4.09 18.44 ...
 $ txt     : chr  "" "" "" "" ...
 $ strong  : chr  "" "" "" "" ...
 $ special : chr  "" "" "" "" ...
 $ AVGx    : num  50 2.98 36.91 36.91 3 ...
 $ AVGy    : num  50 26.09 6.24 6.24 35.31 ...
 $ SC_NR   : chr  "41151000029" "41151000029" "41151000029" "41151000029" ...
 $ DOK_LFNR: chr  "640" "640" "640" "640" ...

I am using:

pandas2ri.activate() 
pandas2ri.py2ri(dataframe)

and I got the following R dataframe:

'data.frame':   302 obs. of  19 variables:
 $ typ     : Factor w/ 5 levels "area","line",..: 3 1 4 2 1 4 2 5 1 4 ...
 $ id      : Factor w/ 302 levels "block_1_1","block_1_10",..: 77 1 78 28 12 89 39 216 21 100 ...
 $ page    : num  1 1 1 1 1 1 1 1 1 1 ...
 $ area    : num  NA 1 1 1 2 2 2 2 3 3 ...
 $ par     : num  NA NA 1 1 NA 2 2 2 NA 3 ...
 $ line    : num  NA NA NA 1 NA NA 2 2 NA NA ...
 $ x1      : num  0 0.02 36.91 36.91 0.03 ...
 $ y1      : num  0 26.1 4.2 4.2 26.1 ...
 $ x2      : num  100 5.95 36.92 36.92 5.97 ...
 $ y2      : num  100 26.09 8.29 8.29 44.54 ...
 $ length  : num  100 5.93 0.02 0.02 5.93 ...
 $ heigth  : num  100 0.01 4.09 4.09 18.44 ...
 $ txt     : Factor w/ 189 levels "","[e]","{minutes}",..: 1 1 1 1 1 1 1 107 1 1 ...
 $ strong  : Factor w/ 3 levels "","0","1": 1 1 1 1 1 1 1 2 1 1 ...
 $ special : Factor w/ 1 level "": 1 1 1 1 1 1 1 1 1 1 ...
 $ AVGx    : num  50 2.98 36.91 36.91 3 ...
 $ AVGy    : num  50 26.09 6.24 6.24 35.31 ...
 $ SC_NR   : Factor w/ 1 level "41151000029": 1 1 1 1 1 1 1 1 1 1 ...
 $ DOK_LFNR: Factor w/ 1 level "640": 1 1 1 1 1 1 1 1 1 1 ...

The issue is that the R dataframe has factor type instead of chr type. I managed to fix it with R code:

i <- sapply(df, is.factor)
df[i] <- lapply(df[i], as.character)

Is there a way to do that during the conversion directly?

I am using :

python 2.7.12
rpy2 2.8.2
pandas 0.18.1

Thanks Fabien

2

There are 2 best solutions below

4
On BEST ANSWER

Consider converting to character columns in Python by importing R's base package. Apparently, the pandas2ri.py2ri() method only uses the default features of R's data.frame() which renders characters to factors. Below uses the rclass method as described in rpy2 docs:

from rpy2.robjects import pandas2ri
from rpy2.robjects.packages import importr

base = importr('base')
pandas2ri.activate()
...

# CONVERT PANDAS DF TO R DF
rdf = pandas2ri.py2ri(pydf)

# FIND COLUMN INDEX OF EACH FACTOR IN DF
factors = [i for i,col in enumerate(rdf) if col.rclass[0] == 'factor']

# CONVERT COLS ITERATIVELY
for f in factors:
    rdf[f] = base.as_character(rdf[f])
3
On

I've tried to google a bit but I don't seem able to find a good documentation of pandas2ri.py2ri(dataframe) function.

The R data.frame function (and the as.data.frame function as well) has the boolean stringsAsFactors argument which, from the documentation is

logical: should character vectors be converted to factors? The ‘factory-fresh’ default is TRUE, but this can be changed by setting options(stringsAsFactors = FALSE).

I guess the pandas2ri.py2ri(dataframe) function supports in some way this, and all the other optional arguments.

The following links will bring you to the full documentation of the R functions:

I'm sorry I can't help you more than this but I don't know the Python language nor the pandas package ;(