I have a database of 20000 domain names, including Top Level Domains, Second-level and lower level domains. For example
.biz stackoverflow.com ru.wikipedia.com
I would like to perform fast lookups to see if input URLs match any of these 20000. I could use Dictionary key or HashSet.Contains, but it is good for exact matches only. As the database contains also TLD names I would like acmeycompany.biz also return as match because of .biz TLD. On the other hand, fr.wikipedia.com should not match, because sub-domain is different.
Simply looping over the list and doing string based comparison is also not an option. If I have 1000 URLs to compare it's simply too slow. So it must be key based index lookups.
I was thinking of building a tree structure like below and then do key based look-ups, for example:
.com .wikipedia .ru .stackoverflow .biz
Then I can split input Url (sampledomain.com) to parts and do lookups like this .com -> .sampledomain
Could anybody point me to a sample how to do it? Or what are other alternatives? Any samples are appreciated.
Thanks!
This is how I started... It is vb.net code but you get the idea.
Public Class TreeNode
Sub New()
ChildNodes = New Dictionary(Of String, TreeNode)
End Sub
Public Property Key As String
Public Property ChildNodes As Dictionary(Of String, TreeNode)
End Class
Private Tree As New Dictionary(Of String, TreeNode)
Sub BuildTree()
For Each Link In Links
If Uri.IsWellFormedUriString(Link, UriKind.Absolute) Then
Dim Url As New Uri(Link)
Dim Domain As String
If Url.HostNameType = UriHostNameType.Dns Then
Domain = Url.Host.ToLower.Replace("www.", "")
Dim DomainParts() As String = Domain.Split(CChar("."))
'lets start from TLD
For Each Key In DomainParts.Reverse
'dont konw how to populate tree
Next
End If
End If
Next
End Sub
Function TreeLookup(Link As String) As Boolean
Dim Url As New Uri(Link)
Dim Domain As String
Dim IsMatch As Boolean = False
If Url.HostNameType = UriHostNameType.Dns Then
Domain = Url.Host.ToLower.Replace("www.", "")
Dim DomainParts() As String = Domain.Split(CChar("."))
Dim DomainPartsCount As Integer = DomainParts.Length
Dim Index As Integer = 0
For Each Key In DomainParts
Index += 1
'use recursive loop to see if
'returns true if directory contains key and we have reached to the last part of the domain name
If Index = DomainPartsCount Then
IsMatch = True
Exit For
End If
Next
End If
Return IsMatch
End Function
You may want to create a dictionary of dictionaries of hashmaps. The first dictionary can contain all the TLD entries paired with a dictionary of all the second-level domains that have that TLD. Then each second-level domain can contain a hashmap of all the lower level domains it contains. Each entry will also have a flag to indicate if that entry is actually in the database or just a placeholder for lower level entries. Like with the short list you used,
.com
is not actually in the list, but would still be an entry in the TLD so give access tostackoverflow.com
andwikipedia.com
(which itself would be a placeholder forru.wikipedia.com
). A lookup would then just start with the URLs TLD, then second and finally lower level if it needs to go that deep.I hope I understood your dilemma correctly and adequately explained my idea.
Edit: lowest level need only be a hashmap.
You will want to add an indicator in your Tree node to indicate if that node is a matched key or simply a stepping stone to get to the secondary/lower level domains.
To add the domain, you can do something like this (this could be made recursive if you like, but with only three levels its not a big deal):
To Look up a domain (again, could be made recursive):