Reasons why swifter/dask/ray only use one core for an apply task?

334 Views Asked by At

I have this function that I would like to apply to a large dataframe in parallel:

from rdkit import Chem
from rdkit.Chem.MolStandardize import rdMolStandardize

from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*')

def standardize_smiles(smiles):
    if smiles is None: 
        return None
    
   try:
   
    mol = Chem.MolFromSmiles(smiles)

    # removeHs, disconnect metal atoms, normalize the molecule, reionize the molecule
    clean_mol = rdMolStandardize.Cleanup(mol) 

    # if many fragments, get the "parent" (the actual mol we are interested in) 
    parent_clean_mol = rdMolStandardize.FragmentParent(clean_mol)

    # try to neutralize molecule
    uncharger = rdMolStandardize.Uncharger() # annoying, but necessary as no convenience method exists
    uncharged_parent_clean_mol = uncharger.uncharge(parent_clean_mol)

    # note that no attempt is made at reionization at this step
    # nor at ionization at some pH (rdkit has no pKa caculator)
    # the main aim to to represent all molecules from different sources
    # in a (single) standard way, for use in ML, catalogue, etc.

    te = rdMolStandardize.TautomerEnumerator() # idem
    taut_uncharged_parent_clean_mol = te.Canonicalize(uncharged_parent_clean_mol)
    return Chem.MolToSmiles(taut_uncharged_parent_clean_mol)
    #except:
    #    return False

standardize_smiles('CCC')

'CCC'

However, neither Dask, nor Swifter, nor Ray can do the job. All frameworks use a single CPU for some reason.

Native Pandas

import pandas as pd

N = 1000
smilest_test = pd.DataFrame({'smiles': ['CCC']*N})
smilest_test

CPU times: user 3.58 s, sys: 0 ns, total: 3.58 s Wall time: 3.58 s

Swifter 1.3.4

smiles_test['standardized_siles'] = smiles_test.smiles.swifter.allow_dask_on_strings(True).apply(standardize_smiles)

CPU times: user 892 ms, sys: 31.4 ms, total: 923 ms Wall time: 5.14 s

While this WORKS with the dummy data, it does not with the real data, which looks like this:

enter image description here

The strings are a bit more complicated than the ones in the dummy data. it seems first swifter needs some time to prepare the parallel execution and only uses one core, but then uses more cores. However, for the real data, it only uses 3 out of 8 cores.

I have the same issue with other frameworks such as dask, ray, modin, swifter.

Is there something that I miss here? Is there a problem when the dataframe contains stings? Why does the parallel execution take so much time even on a single computer (with multiple cores)? Or is there an issue with the RDKit library that I am using that makes it difficult to parallelize the above function?

0

There are 0 best solutions below