How to extract elements from a list in pandas through regex?

615 Views Asked by At

I'm looking to extract the string of numbers that come after 'accession' in this Dataframe. My dataframe looks like this:

targets_list = pd.DataFrame(targets_df[['target_components', 'target_chembl_id']])

and the elements in each column of the target_components looks like the following:

[{'accession': 'O43451', 'component_description': 'Maltase-glucoamylase, intestinal', 'component_id': 434, 'component_type': 'PROTEIN', 'relationship': 'SINGLE PROTEIN', 'target_component_synonyms',...}]

I would just like to extract the number code after 'accession'. As I thought it was the first element of the list, I tried to tgt = targets_list['target_components'][0][0], but this returns the first element of that list, but not the accession number.

I can see that it is a list that's in each row, but how to parse that list and get that number and add it to a column is what's missing for me. It should be possible with Regex maybe? But I'm not sure how Regex works at all.

4

There are 4 best solutions below

1
Timus On BEST ANSWER

You could try:

tgt = targets_list["target_components"].str[0].str["accession"]

Result for

targets_list = pd.DataFrame(
    {"target_components": [
        [{"accession": "O43451", "b": "c", "d": 1}],
        [{"accession": "012345", "b": "e", "d": 2}],
        [{"b": "f", "d": 3}],
        []]}
)
                              target_components
0  [{'accession': 'O43451', 'b': 'c', 'd': 1}]
1  [{'accession': '012345', 'b': 'e', 'd': 2}]
2                         [{'b': 'f', 'd': 3}]
3                                           []

is

0    O43451
1    012345
2      None
3       NaN
Name: target_components, dtype: object
0
Vedant Pople On

You can use the .findall() function or .extract() to get the id.

Refer to : Use regular expression to extract elements from a pandas data frame

2
edmz On

You can try this:

targets_list['target_components'].map(lambda x: x[0].get("accession"] if x else '')
6
Ynjxsjmh On

First there is no need to use pd.DataFrame again to create dataframe from existing columns:

targets_list = targets_df[['target_components', 'target_chembl_id']]

Then you can use apply to access the column element

tgt = targets_list['target_components'].apply(lambda x: x[0]['accession'])