How to use pandas' df.get function for a dataframe column so that each row in the column maintains its own value?

145 Views Asked by At

To summarize as concisely as I can, I have data file containing a list of chemical compounds along with their ID numbers ("CID" numbers). My goal is to use pubchempy's pubchempy.get_properties function along with pandas' df.map function to essentially obtain the properties of each compound (there is one compound per row) using the "CID" number as an identifier. The parameters of pubchempy.get_properties is an identifier ("CID" number in this case) along with the property of the chemical that you want to obtain from the pubchem website (Molecular weight in this case).


This is the code that I have written currently:

import pandas as pd

import pubchempy

import numpy as np

df = pd.read_csv("Data.tsv.txt", sep="\t")

from pubchempy import get_properties

df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('.0',''))

df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('0',''))

df = df.drop(df[df.CID=='nan'].index)

df = df.drop( df.index.to_list()[5:] ,axis = 0 )

df['CID']= df['CID'].map(lambda x: get_properties(identifier=x, properties='MolecularWeight') if float(x) > 0 else pd.NA)

df = df.rename(columns={'CID.': 'MolecularWeight'})

print(df)

This is the output that I was initially getting for that column (only including a few rows, in reality, dataset is very big):

MolecularWeight

[{'CID': 5339, 'MolecularWeight': '398.4'}]

[{'CID': 3889, 'MolecularWeight': '520.5'}]

[{'CID': 2788, 'MolecularWeight': '305.50'}]

[{'CID': 1422517, 'MolecularWeight': '440.5'}]

.

.

.


Now, the code was somewhat working in that it is providing me with the molecular weight of the compound (398.4) but I didn't want all that extra bit of writing nor did I want the quote marks around the molecular weight number (both of these get in the way of the next bit of code that I plan to write).

So I then added this bit of code:

df['MolecularWeight'] = df.MolecularWeight[0][0].get('MolecularWeight')

This is the output that I am now getting:

MolecularWeight

398.4

398.4

398.4

398.4

.

.

.


What I want to do is pretty much exactly the same it's just that instead of getting the molecular weight of the first row in the MolecularWeight column and copying it onto all the other rows, I want to have the molecular weight value of each individual row in that column as the output.

What I was hoping to get is something like this:

MolecularWeight

398.4

520.5

305.50

440.5

.

.

.

Does anyone know how I can solve this issue? I've spent many hours trying to figure it out myself with no luck. I'd appreciate any help!


Few lines of text file:

NO. compound_name   IUPAC_name  SMILES  CID     Inchi   threshold   reference   group   comments
1   sulphasalazine  2-hydroxy-5-[[4-(pyridin-2-ylsulfamoyl)phenyl]diazenyl]benzoic acid O=C(O)c1cc(N=Nc2ccc(S(=O)(=O)Nc3ccccn3)cc2)ccc1O    5339    InChI=1S/C18H14N4O5S/c23-16-9-6-13(11-15(16)18(24)25)21-20-12-4-7-14(8-5-12)28(26,27)22-17-3-1-2-10-19-17/h1-11,23H,(H,19,22)(H,24,25)      R2|R2|R25|R46|  A   
2   moxalactam  7-[[2-carboxy-2-(4-hydroxyphenyl)acetyl]amino]-7-methoxy-3-[(1-methyltetrazol-5-yl)sulfanylmethyl]-8-oxo-5-oxa-1-azabicyclo[4.2.0]oct-2-ene-2-carboxylic acid   COC1(NC(=O)C(C(=O)O)c2ccc(O)cc2)C(=O)N2C(C(=O)O)=C(CSc3nnnn3C)COC21 3889    InChI=1S/C20H20N6O9S/c1-25-19(22-23-24-25)36-8-10-7-35-18-20(34-2,17(33)26(18)13(10)16(31)32)21-14(28)12(15(29)30)9-3-5-11(27)6-4-9/h3-6,12,18,27H,7-8H2,1-2H3,(H,21,28)(H,29,30)(H,31,32)      R25|    A   
3   clioquinol  5-chloro-7-iodoquinolin-8-ol    Oc1c(I)cc(Cl)c2cccnc12  2788    InChI=1S/C9H5ClINO/c10-6-4-7(11)9(13)8-5(6)2-1-3-12-8/h1-4,13H      R18|R26|R27|    A   
2

There are 2 best solutions below

1
Jan Alexander On

If you cast the column to float, that should help you: df['MolecularWeight'] = df['MolecularWeight'].astype(float).

4
BeRT2me On

It appears that you may want multiple properties from each CID:

props = ['HBondDonorCount', 'RotatableBondCount', 'MolecularWeight']
df2 = pd.DataFrame(get_properties(identifier=df.CID.to_list(), properties=props))
print(df2)

Output:

    CID HBondDonorCount  RotatableBondCount  MolecularWeight
0  5339           398.4                   3                6
1  3889           520.5                   4                9
2  2788          305.50                   1                0

You can then merge this information onto the original dataframe:

df = df.merge(df2)  # df = df.merge(pd.DataFrame(get_properties(identifier=df.CID.to_list(), properties=props)))
print(df)

...

   NO.   compound_name                                         IUPAC_name                                             SMILES   CID                                              Inchi  threshold       reference group  comments MolecularWeight  HBondDonorCount  RotatableBondCount
0    1  sulphasalazine  2-hydroxy-5-[[4-(pyridin-2-ylsulfamoyl)phenyl]...   O=C(O)c1cc(N=Nc2ccc(S(=O)(=O)Nc3ccccn3)cc2)ccc1O  5339  InChI=1S/C18H14N4O5S/c23-16-9-6-13(11-15(16)18...        NaN  R2|R2|R25|R46|     A       NaN           398.4                3                   6
1    2      moxalactam  7-[[2-carboxy-2-(4-hydroxyphenyl)acetyl]amino]...  COC1(NC(=O)C(C(=O)O)c2ccc(O)cc2)C(=O)N2C(C(=O)...  3889  InChI=1S/C20H20N6O9S/c1-25-19(22-23-24-25)36-8...        NaN            R25|     A       NaN           520.5                4                   9
2    3      clioquinol                       5-chloro-7-iodoquinolin-8-ol                             Oc1c(I)cc(Cl)c2cccnc12  2788  InChI=1S/C9H5ClINO/c10-6-4-7(11)9(13)8-5(6)2-1...        NaN    R18|R26|R27|     A       NaN          305.50                1                   0