I have a graph that shows the connection of subdomains with their domains, but I want to select only the domains that are being queried more than 10 times.
B=nx.Graph()
B.add_nodes_from(data['subdomain'],bipartite=0)
B.add_nodes_from(data['domain'],bipartite=1)
B.add_edges_from([(row['subdomain'] , row['domain']) for idx,row in data.iterrows()])
print (B.degree(data['domain']).items())
and
print (B.degree(data['domain']).values())
give me the values that I need, but I dont know how to use them in order to produce the graph only with those data['domain'] values that are higher than a threshold (for example 10).
The rest of the code for the graph construction:
pos = {node:[0, i] for i,node in enumerate(data['domain'])}
pos.update({node:[1, i] for i,node in enumerate(data['subdomain'])})
nx.draw(B, pos, with_labels=False)
nx.draw_networkx_labels(B, pos)
plt.show
NOTE: Would be easier to select those values before constructing the graph and how is this possible? I mean, to select values from one dataframe column that correspond to many values from another dataframe column.
EDIT: So , I have these two dataframe columns , and the main idea is to try to find which domain names are being mapped by some subdomains more than 10 times, and then select these domain names and further process them.
So after B.add_edges_from([(row['subdomain'] , row['domain']) for idx,row in data.iterrows()])
i get my graph which looks chaotic due to the big amount of data.
First of all, I want to show on my graph only the nodes that have more than 10 edges , and then from that new graph , I want to be able to select/store these domain names/nodes into a new dataframe.
What bothers me , is that I dont know if it is possible to select data out of a graph or not!
Without being able to see what is in
data
and purely judging by the way the code is written at the moment, it appears to be simply relating subdomains to domains in a simple Graph and thereforeB.degree([bunch of nodes])
would return a dictionary whose keys are nodes and the values are node degrees.If all you are trying to do is induce a subgraph C from your original B whose nodes will be the domains with more than 10 subdomains, then you can do something like:
Which is basically using subgraph to induce C with the degree criterion being enforced by Python's filter.
Keep in mind however that in a bipartite graph, this is likely to simply return a set of nodes, because B associates subdomains with domains and the domains themselves will not have any connection between them.
If you are trying to retrieve the domains that have more than 10 subdomains and depict this later on as a set of star graphs (a domain in the middle with all of its subdomains around it), then the easiest way to retrieve those graphs would be to run a dfs_tree (or a bfs_tree).
If on the other hand,
data
contains one line for each "hit" of the domain by one of its subdomains then this means that more than one edges between a pair of subdomain / domain would be required and for this reason you would need to start with a Multigraph rather than a Graph.Hope this helps, happy to amend the response if more details about
data
or the actual problem being dealt with are provided.EDIT: In view of additional comments and earlier relevant question, it is not clear if you want to proceed with a graph based solution or not.
If you want to proceed with a graph based solution, then the response above will filter those root domains which have more than a number of subdomains attached on them.
You can, of course, do the same thing using a pandas DataFrame with something like:
With the
testFile.csv
having columns: subdomain, domain, ipbusyDomains
is now another DataFrame. The main result here, using the test data on your previous question is "example.org".Hope this helps.