Web-Crawler for VBA

14.6k Views Asked by At

I am trying to program a Webcrawler, using Visual Basic. I have a list with links, stored in an Excel (column 1). The Macro should then open each link and add certain information from the website to the excel file. Here's the first link (stored in field A2).

The Macro should identify and insert the name of the hotel into column 2 (B2), the rating in column 3 (C2) and the address in column 4 (D2). This process could then be repeated with a loop for all other links (all websites have the same structure).

My code so far (I did not add the loop yet):

Sub Hoteldetails()

Dim IEexp As Object

Set IEexp = CreateObject("InternetExplorer.Application")
IEexp.Visible = True

Range("A2").Select
Selection.Hyperlinks(1).Follow NewWindow:=False, AddHistory:=True

End Sub

How can I "select" the specific data I want and insert it into the excel file? I tried to record the macro via "Add Data", but was not able to import the data from the website. I also tried to do it by using various example codes, but it did not work out for my specific website.

Thanks a lot for any assistance!

2

There are 2 best solutions below

0
On

You have several options:

Option 1: IEObject

Either you need to use the getElementBy methods in IEObject and use string manipulation to extract the data you need. 2 options for string extractions:

  1. Extracting a top-level element by Name or by Id then use string manipulation functions such as Mid, InStr, Left and Right
  2. Use Regex (VBA Vbscript object) to extract the data (recommended)

Option 2: Scrape HTML Add-In

Sometime ago I developed an AddIn for Excel that allows you to easily scrape HTML data within an Excel formula. The process is similar as above as you still need to create a relevant Regex. See an example below for TripAdvisor:

enter image description here

The formula in B2 looks like this (A2 is the link, and the second argument is the Regex):

=GetElementByRegex(A2;"<h1 id=""HEADING"".*?>(?:(?:.|\n)*?)</div>((?:.|\n)*?)</h1>")

You can download the AddIn here: http://www.analystcave.com/excel-tools/excel-scrape-html-add/

0
On

tl;dr;

I am not going to do all the work for you but this is fairly easy if the pages have the same structure.

You can issue a browserless XMLHTTP request, to get a nice fast response, and then select the items of interest using either id or classname and collection index.

Here is an example, using the link you provided, which you can adapt into a loop over all links.


Webpage view:

Webpage view


Code output:

Code output


VBA:

Option Explicit
Public Sub GetInfo()
    Dim sResponse As String, HTML As New HTMLDocument
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.tripadvisor.co.uk/Hotel_Review-g198832-d236315-Reviews-Grand_Hotel_Kronenhof-Pontresina_Engadin_St_Moritz_Canton_of_Graubunden_Swiss_Alps.html", False
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
    End With

    sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))

    With HTML
       .body.innerHTML = sResponse
       Debug.Print "HotelName: " & .getElementById("HEADING").innerText
       Debug.Print "Address: " & .getElementsByClassName("detail")(0).innerText
       Debug.Print "Rating: " & .getElementsByClassName("overallRating")(0).innerText
    End With
End Sub

References:

VBE > Tools > References > HTML Object Library