How to change dtype of each dataframe column to float if it best fits?

477 Views Asked by At

I have an excel that has columns with values like 1.32131**, among columns of dtype string. As a result the dtypes of these columns in the dataframe are object. I have cleaned the asterisks from the dataframe and now I need to convert the dtypes of these columns to float64. I am aware of ways to do it if I define the columns that need to be changed or the desired dtypes for each column (like the functions mentioned here), but I have too many columns to use such solutions. Thus I am looking for a more efficient and clean way.

For example, if I wanted to convert to int64 I would use convert_dtypes(), but it seems that it doesn't support floats and it returns these columns with object dtype.

Then, if possible, convert to StringDtype, BooleanDtype or an appropriate integer extension type, otherwise leave as object.

Right now I am using the following script that works but I think it's to big for its purpose and it a bit slow.

# Create df and clean it
# note that the data exist in an excel normally and the dict is only for reproducibility purposes
dict = {'Name':['BPh1', 'BPh2', 'BPh3', 'BPh4', 'BPh5', 'BPh6', 'BPh7'], 'BBB':['2.00755**', '2.7766**', '0.490127**','0.490127**', '0.87667**', '0.899189**', '3.084**'], 'Buffer_solubility_mg_L':['0.00112934**','0.000798559**', '0.000218191**', '0.000122249**', '0.00382848**', '0.00109165**', '0.000665366**'], 'CYP_2C19_inhibition':['Inhibitor','Inhibitor','Non','Non','Inhibitor','Inhibitor',
'Inhibitor']}

ss =  pd.DataFrame(dict).replace("\*",'',regex=True)

# Convert dtype to float when possible
for col in ss.columns[1:]:
    print(col,'\n',ss[col].dtypes)
    try:
        ss[col] = pd.to_numeric(ss[col])
    except:
        pass
    print(ss[col].dtypes,'\n')

Is there a cleaner way to do this conversion?

1

There are 1 best solutions below

4
On

I'd change/clean the values before creating the dataframe, that way you're not first creating one, and then converting it to something else (might save a little bit of time as well). The advantage is that you can do it in a single line. I don't think you can get much faster than this, given the input data that you have to work with.

import pandas

# Create df and clean it
dict = {'Name':['BPh1', 'BPh2', 'BPh3', 'BPh4', 'BPh5', 'BPh6', 'BPh7'], 'BBB':['2.00755**', '2.7766**', '0.490127**','0.490127**', '0.87667**', '0.899189**', '3.084**'], 'Buffer_solubility_mg_L':['0.00112934**','0.000798559**', '0.000218191**', '0.000122249**', '0.00382848**', '0.00109165**', '0.000665366**'], 'CYP_2C19_inhibition':['Inhibitor','Inhibitor','Non','Non','Inhibitor','Inhibitor',
'Inhibitor']}

# Perform the conversion on creation
df = pandas.DataFrame(
    {
        col: pandas.to_numeric([v.replace("*", "") for v in values], errors="ignore")
        for col, values in dict.items()
    }
)