dictionary/tree fast key based lookup

563 Views Asked by At

I have a database of 20000 domain names, including Top Level Domains, Second-level and lower level domains. For example

.biz stackoverflow.com ru.wikipedia.com

I would like to perform fast lookups to see if input URLs match any of these 20000. I could use Dictionary key or HashSet.Contains, but it is good for exact matches only. As the database contains also TLD names I would like acmeycompany.biz also return as match because of .biz TLD. On the other hand, fr.wikipedia.com should not match, because sub-domain is different.

Simply looping over the list and doing string based comparison is also not an option. If I have 1000 URLs to compare it's simply too slow. So it must be key based index lookups.

I was thinking of building a tree structure like below and then do key based look-ups, for example:

.com .wikipedia .ru .stackoverflow .biz

Then I can split input Url (sampledomain.com) to parts and do lookups like this .com -> .sampledomain

Could anybody point me to a sample how to do it? Or what are other alternatives? Any samples are appreciated.

Thanks!

This is how I started... It is vb.net code but you get the idea.

 Public Class TreeNode

    Sub New()
        ChildNodes = New Dictionary(Of String, TreeNode)
    End Sub

    Public Property Key As String
    Public Property ChildNodes As Dictionary(Of String, TreeNode)

End Class

Private Tree As New Dictionary(Of String, TreeNode)

Sub BuildTree()

    For Each Link In Links

        If Uri.IsWellFormedUriString(Link, UriKind.Absolute) Then

            Dim Url As New Uri(Link)
            Dim Domain As String

            If Url.HostNameType = UriHostNameType.Dns Then

                Domain = Url.Host.ToLower.Replace("www.", "")

                Dim DomainParts() As String = Domain.Split(CChar("."))

                'lets start from TLD
                For Each Key In DomainParts.Reverse

                    'dont konw how to populate tree

                Next

            End If

        End If

    Next

End Sub

Function TreeLookup(Link As String) As Boolean

    Dim Url As New Uri(Link)
    Dim Domain As String
    Dim IsMatch As Boolean = False

    If Url.HostNameType = UriHostNameType.Dns Then

        Domain = Url.Host.ToLower.Replace("www.", "")

        Dim DomainParts() As String = Domain.Split(CChar("."))
        Dim DomainPartsCount As Integer = DomainParts.Length
        Dim Index As Integer = 0


        For Each Key In DomainParts

            Index += 1

            'use recursive loop to see if 

            'returns true if directory contains key and we have reached to the last part of the domain name
            If Index = DomainPartsCount Then

                IsMatch = True
                Exit For

            End If

        Next

    End If

    Return IsMatch

End Function
2

There are 2 best solutions below

14
On

You may want to create a dictionary of dictionaries of hashmaps. The first dictionary can contain all the TLD entries paired with a dictionary of all the second-level domains that have that TLD. Then each second-level domain can contain a hashmap of all the lower level domains it contains. Each entry will also have a flag to indicate if that entry is actually in the database or just a placeholder for lower level entries. Like with the short list you used, .com is not actually in the list, but would still be an entry in the TLD so give access to stackoverflow.com and wikipedia.com (which itself would be a placeholder for ru.wikipedia.com). A lookup would then just start with the URLs TLD, then second and finally lower level if it needs to go that deep.

I hope I understood your dilemma correctly and adequately explained my idea.

Edit: lowest level need only be a hashmap.


You will want to add an indicator in your Tree node to indicate if that node is a matched key or simply a stepping stone to get to the secondary/lower level domains.

To add the domain, you can do something like this (this could be made recursive if you like, but with only three levels its not a big deal):

// TLD, SLD and LLD are the three levels of the current domain you are adding into the tree
if Tree does not contain the TLD
    Add TLD to the Tree with a new TreeNode

if SLD does not exist for the current domain
    Mark the Tree at TLD as a match
else   
    if Tree[TLD] does not contain the SLD
        Add SLD to the Tree[TLD] Node

    if LLD does not exist for the current domain
        Mark the Tree[TLD] at SLD as a match
    else   
        if Tree[TLD][SLD] does not contain the LLD
            Add LLD to the Tree[TLD][SLD] Node
            // Don't really need to mark the node
            // as every LLD would be a match
            // Probably would need to mark if made recursive

To Look up a domain (again, could be made recursive):

// TLD, SLD and LLD are for the domain you looking for
if Tree does not contain TLD
    you are done, no match
else
    if Tree[TLD] is marked
        done, match
    else
        if Tree[TLD] does not contain SLD
            done, no match
        else
            if Tree[TLD][SLD] is marked
                done, match
            else
                if Tree[TLD][SLD] contains LLD
                    done, match
                    // would need to check if the node
                    // is marked if made recursive
0
On

When you store items in your database, make each part of the URL have their own column. So the TLD, domain, and subdomain would each be their own column.

create table MyList
(
    [TLD] nvarchar(10) not null,
    [Domain] nvarchar(50) not null,
    [Subdomain] nvarchar(50) not null,

    unique ([TLD], [Domain], [Subdomain]) 
    --this means that you can't add the same data twice
)

Now use SQL to get the data you want.

select *
from MyList
where [TDL] = '.com'

This is the most efficient way to solve your problem, because the data will never leave your database before it will be filtered.

As for the reason for the table, read up on Database Normalization

You will have to do some data conversion if you are only storing your urls in a single column.