subset pandas df columns with partial string match OR match before "?" using lists of names

874 Views Asked by At

I hope someone might help me.

I have a dataframe that inculdes columns with similar names (see example data)

I have 3 additional lists of column names which include the original names of the columns (i.e. the string occurring before the question mark (see lists of column names)

I need to subset the df dataframe into 3 separate dataframes, based on matching the first part of the column names present in the 3 lists. The expected output at the bottom.

It has to be in lists (or something programmatic) as I have lots and lots of columns like this. I tried pattern matching but because some names are very similar, they match to multiple lists.

thank you in advance!

example data

df = {'id': ['1','2','3','4'],
        'ab? op':  ['green', 'red', 'blue', 'None'],
        'ab? 1': ['red', 'yellow', 'None', 'None'],
        'cd': ['L', 'XL', 'M','L'],
        'efab? cba' : ['husband', 'wife', 'husband', 'None'],
        'efab? 1':['son', 'grandparent', 'son', 'None'],
        'efab? 2':['None', 'son', 'None', 'None'],
        'fab? 4':['9', '10', '5', '3'], 
        'fab? po':['England', 'Scotland', 'Wales', 'NA'] }


df = pd.DataFrame(df, columns = ['id','ab? op', 'ab? 1', 'cd', 'efab? cba', 'efab? 1', 'efab? 2', 'fab? 4', 'fab? po'])

list of column names in other 3 data frames


df1_lst = ['ab', 'cd']
df2_lst = ['efab']
df3_lst = ['fab']

desired output

df1 = ['ab? op', 'ab? 1', 'cd']
df2 = ['efab? cba', 'efab? 1', 'efab? 2']
df3 = ['fab? 4', 'fab? po']
2

There are 2 best solutions below

2
On BEST ANSWER

You can form a dynamic regex for each df lists:

df_lists = [df1_lst, df2_lst, df3_lst]

result = [df.filter(regex=fr"\b({'|'.join(names)})\??") for names in df_lists]

e.g., for the first list, the regex is \b(ab|cd)\?? i.e. look for either ab or cd but they should be standalone from the left side (\b) and there might be an optional ? afterwards.

The desired entries are in the result list e.g.

>>> result[1]

  efab? cba      efab? 1 efab? 2
0   husband          son    None
1      wife  grandparent     son
2   husband          son    None
3      None         None    None
0
On

Split column names by ?, keep the first part and check if they are in list:

df1 = df.loc[:, df.columns.str.split('?').str[0].isin(df1_lst)]
df2 = df.loc[:, df.columns.str.split('?').str[0].isin(df2_lst)]
df3 = df.loc[:, df.columns.str.split('?').str[0].isin(df3_lst)]
>>> df1
  ab? op   ab? 1  cd
0  green     red   L
1    red  yellow  XL
2   blue    None   M
3   None    None   L

>>> df2
  efab? cba      efab? 1 efab? 2
0   husband          son    None
1      wife  grandparent     son
2   husband          son    None
3      None         None    None

>>> df3
  fab? 4   fab? po
0      9   England
1     10  Scotland
2      5     Wales
3      3        NA