Complex partial string matching in pandas

972 Views Asked by At

Given a dataframe with the following structure and values json_path -

json_path Reporting Group Entity/Grouping
data.attributes.total.children.[0] Christian Family Abraham Family
data.attributes.total.children.[0].children.[0] Christian Family In Estate
data.attributes.total.children.[0].children.[0].children.[0].children.[0] Christian Family Cash
data.attributes.total.children.[0].children.[0].children.[1].children.[0] Christian Family Investment Grade Fixed Income

How would I filter on the json_path rows which containchildren four times? i.e., I want to filter on index position 2-3 -

json_path Reporting Group Entity/Grouping
data.attributes.total.children.[0].children.[0].children.[0].children.[0] Christian Family Cash
data.attributes.total.children.[0].children.[0].children.[1].children.[0] Christian Family Investment Grade Fixed Income

I know how to obtain a partial match, however the integers in the square brackets will be inconsistent, so my instinct is telling me to somehow have logic that counts the instances of children (i.e., children appearing 4x) and using that as a basis to filter.

Any suggestions or resources on how I can achieve this?

1

There are 1 best solutions below

0
On BEST ANSWER

As you said, a naive approach would be to count the occurrence of .children and compare the count with 4 to create boolean mask which can be used to filter the rows

df[df['json_path'].str.count(r'\.children').eq(4)]

A more robust approach would be to check for the consecutive occurrence of 4 children

df[df['json_path'].str.contains(r'(\.children\.\[\d+\]){4}')]

                                                                   json_path   Reporting Group                Entity/Grouping
2  data.attributes.total.children.[0].children.[0].children.[0].children.[0]  Christian Family                           Cash
3  data.attributes.total.children.[0].children.[0].children.[1].children.[0]  Christian Family  Investment Grade Fixed Income