Canonical URL Extraction

318 Views Asked by At

I'm very much new to the VBA programming. I have a list of keywords that are to be sourced in a certain website and I want their sourced links as the end result. I want this to be happened for all the keywords in my list using a VBA program. I tried this with below code, but I'm not getting the desired result. Please advise where am I going wrong.

Sub GetCanonicalURL()
    On Error Resume Next

    Dim ie As New SHDocVw.InternetExplorer
    Dim mykeyword As String
    Dim result As String
    Dim lastrow As Integer
    Dim mylinks As Object
    Dim mylink As Object

    lastrow = Sheet1.Cells(Rows.Count, "A").End(xlUp).Row
    For i = 2 To lastrow
        mykeyword = Sheet1.Cells(i, 1).Value
        ie.Visible = True
        ie.navigate "https://en.wikipedia.org/wiki/Home_page"

        Do While ie.readyState <> READYSTATE_COMPLETE
        Loop

        ie.document.forms("searchform").elements("searchInput").Value = mykeyword
        ie.document.forms("searchform").elements("searchButton").Click
        result = ie.document.body.innerHTML
        Set HTML = CreateObject("htmlfile")
        HTML.body.innerHTML = result
        Set mylinks = HTML.getElementsByTagName("link")

        For Each mylink In mylinks
            If mylink.hasAttribute("canonical") Then
                Sheet1.Cells(i, "B").Value = mylink

        Next mylink

        If i = lastrow Then
            ie.Quit
        End If

    Next i

End Sub

In the above code, I'm stuck in that empty part of the program (yellow color highlighted - see screenshots), confused on which strategy to use to get the canonical URL of each sourced websites. Please see the screenshots of the images attached below.

canonical href link:
[canonical href link1

Excel sheet where result link to be reflected:
Excel sheet where result link to be reflected

Approach 1(Not so efficient) - Error Approach(Efficient) 2 - Error

1

There are 1 best solutions below

11
QHarr On

There are a number of issues with your code.

  1. You don't declare all your variables
  2. You rely on auto-instantiation of the internet explorer instance
  3. You keep trying to make ie.visible inside the loop as well as navigating to the same homepage. These activities can be done before the loop
  4. You are looking in the wrong place for the element of interest
  5. There are more efficient ways to retrieve the data you want using css selectors to target by attribute = value and querySelector to match only first node. This is more efficient as rel=canonical should only be in the head part of the html and be present only once.

See 2 approaches below to keep the head content. One approach similar to yours. Another is a more efficient approach.


Option 1: Less efficient as involves looping, greater code complexity, multiple node matching, late binding etc.

Option Explicit

Public Sub GetCanonicalURL1()

    Dim ie As SHDocVw.InternetExplorer, html As Object, lastrow As Long
    
    lastrow = Sheet1.Cells(Rows.Count, "A").End(xlUp).Row
    
    Set ie = New SHDocVw.InternetExplorer
    
    ie.Visible = True

    Dim i As Long
    
    For i = 2 To lastrow
        
        Dim mykeyword As String
        
        mykeyword = Sheet1.Cells(i, 1).Value
        
        ie.navigate "https://en.wikipedia.org/w/index.php?search=" & mykeyword
        
        Do While ie.readyState <> READYSTATE_COMPLETE Or ie.Busy: DoEvents: Loop

        Dim head As String, headAndBody As String
        
        head = ie.document.head.innerHTML        'canonical should be in head and only 1.
        headAndBody = ie.document.DocumentElement.innerHTML 'in case you wanted to know how to get both
        Set html = CreateObject("htmlfile")
        html.write head

        Dim mylinks As Object, mylink As Object

        Set mylinks = html.getElementsByTagName("link")
        'With loop. Inefficient.
        For Each mylink In mylinks
            If mylink.hasAttribute("rel") Then
                Dim linkText As String
                linkText = Trim$(mylink.getAttribute("rel"))
                If linkText = "canonical" Then
                    Sheet1.Cells(i, "B").Value = mylink.href 'assumes one canonical link
                    Exit For
                End If
            End If
        Next mylink
        
        If i = lastrow Then
            ie.Quit
        End If

    Next i

End Sub

Option 2: More efficient. Early binding, single node matching with optimized css selectors. Targeting only head html.

Option Explicit

Public Sub GetCanonicalURL()

    Dim ie As SHDocVw.InternetExplorer, lastrow As Long

    lastrow = Sheet1.Cells(Rows.Count, "A").End(xlUp).Row
    
    Set ie = New SHDocVw.InternetExplorer
    
    ie.Visible = True

    Dim i As Long, html2 As MSHTML.HTMLDocument
        
    Set html2 = New MSHTML.HTMLDocument
    
    For i = 2 To lastrow
        
        Dim mykeyword As String

        mykeyword = Sheet1.Cells(i, 1).Value
        ie.navigate "https://en.wikipedia.org/w/index.php?search=" & mykeyword

        Do While ie.readyState <> READYSTATE_COMPLETE Or ie.Busy: Loop

        Dim head As String
        
        head = ie.document.head.innerHTML        'canonical should be in head and only 1.
  
        'Alternate without loop. Efficient.
        html2.body.innerHTML = head
        Sheet1.Cells(i, "B").Value = html2.querySelector("[rel=canonical]").href 'seek only 1 match as there should be only 1 and others are ignored in SEO for example.
 
        If i = lastrow Then
            ie.Quit
        End If

    Next i

End Sub