Pandas Match list of URLs to check dependency

82 Views Asked by At

From a list of URLs, I want to check for each value in complete_path if it is a subfolder of another row.

The criteria for subfolder is:

  • A subfolder starts with and fully contains the URL of a parent row URL
  • A subfolder has more count of the backslash \ than the parent.

Here's my pandas dataframe sample.

ID      complete_path
1       Ajax
2       Ajax\991\1
3       Ajax\991
4       BVB
5       BVB\Christy
6       BVB_Christy

Here's my output sample

ID      complete_path  dependency
1       Ajax           None
2       Ajax\991\1     1,3
3       Ajax\991       1
4       BVB            None
5       BVB\Christy    4
6       BVB_Christy    None
2

There are 2 best solutions below

1
On

This sound like a network problem. networkx is helpful.

import networkx as nx 

new_df = (df.assign(path=df.complete_path.str.split('\\'))
   .explode('path')
)

base = new_df.duplicated('ID', keep='last')
new_df['path_id'] = new_df['path'].map(new_df.loc[~base].set_index('path')['ID'])

# create the graph
G = nx.from_pandas_edgelist(new_df, source='path_id',target='ID', create_using=nx.DiGraph)

df['dependency'] = [nx.ancestors(G,i) or None for i in df['ID']]

Output:

   ID complete_path dependency
0   1          Ajax       None
1   2    Ajax\991\1     {1, 3}
2   3      Ajax\991        {1}
3   4           BVB       None
4   5   BVB\Christy        {4}
5   6   BVB_Christy       None
0
On

Please try:

import networkx as nx
prepath = lambda x: x if not "\\" in x else "\\".join(x.split("\\")[:-1])
df = df.assign(prepath = df["complete_path"].apply(prepath))
df["source_ID"] = df["prepath"].map(df.set_index("complete_path")["ID"])
g = nx.from_pandas_edgelist(df, source="source_ID", target="ID", create_using=nx.MultiDiGraph)
df["dependency"] = [nx.ancestors(g,i) or None for i in df["ID"]]
print(df[["ID","complete_path","dependency"]])
   ID complete_path dependency
0   1          Ajax       None
1   2    Ajax\991\1     {1, 3}
2   3      Ajax\991        {1}
3   4           BVB       None
4   5   BVB\Christy        {4}
5   6   BVB_Christy       None