Using regex in Classic ASP to get content of specific elements

3.4k Views Asked by At

So I am loading some remote content and need to use regex to isolate the the content of some tags.

  set xmlhttp = CreateObject("MSXML2.ServerXMLHTTP") 
 xmlhttp.open "GET", url, false 
 xmlhttp.setRequestHeader "Content-Type", "application/x-www-form-urlencoded" 
 xmlhttp.setRequestHeader "Accept-Language", "en-us" 
 xmlhttp.send "x=hello" 
 status = xmlhttp.status 
    if err.number <> 0 or status <> 200 then 
        if status = 404 then 
            Response.Write "[EFERROR]Page does not exist (404)." 
        elseif status >= 401 and status < 402 then 
            Response.Write "[EFERROR]Access denied (401)." 
        elseif status >= 500 and status <= 600 then 
            Response.Write "[EFERROR]500 Internal Server Error on remote site." 
        else 
            Response.write "[EFERROR]Server is down or does not exist." 
        end if 
    else  
 data =  xmlhttp.responseText 

I basically need to get the content of the <title>Here is the title</title> also the meta description, keywords and some selected open graph meta data.

And finally I need to get the content of the first <h1>Heading</h1> and <p>Paragraph</p>

How can I parse the html data to get these things? Should I use regex?

3

There are 3 best solutions below

0
On

Use the Mid function combined with the Instr function. I built a function which uses the Mid function to determine the tag wrapped text by finding the position of each tag using the Instr function:

 Function GetInnerData(Data,TagOpen,TagClose)
   OpenPos = Instr(1,data,TagOpen,1)
   ClosePos = Instr(1,data,TagClose,1)
   If OpenPos > 0 And ClosePos > 0 Then GetInnerData = Trim(Mid(data,OpenPos+Len(TagOpen),ClosePos-(OpenPos+Len(TagOpen))))
 End Function

When you run this function like this, it will return My Title

<%=GetInnerData("any text <title>My Title</title> any text","<title>","</title>")%>

And in your case, You would do it like this:

 TitleData = GetInnerData(data,"<title>","</title>")

This will get the content in your <title> tag. or

 H1Data = GetInnerData(data,"<h1>","</h1>")

This will get the content in your <h1> tag.

The Instr function returns the first string found in the data, so this function will do exactly what you need.

0
On

You may be able to use the .responseXML property to retrieve the content you want without using regex. Because you are looking for data inside <title>, <h1> and <p> tags, the document returned is probably HTML. If the HTML document is well-formed according to the XML specifications it could mean it is already automatically parsed and accessible after you get the response.

So you could try this:

Dim objData
Set objData = xmlhttp.responseXML.selectSingleNode("//*[local-name() = 'title']")

If objData Is Nothing Then
    Response.Write "# no result #<br />"
Else
    Response.Write "title: " & objData.Text & "<br />"
End If

Note though, that this XPath expression may not be the most efficient way to query an XML document (in case you want to process large amounts of data).

0
On

I actually used this solution in the end as it also solve the problem of having class names in the code.

Function GetFirstMatch(PatternToMatch, StringToSearch)
    Dim regEx, CurrentMatch, CurrentMatches

    Set regEx = New RegExp
    regEx.Pattern = PatternToMatch
    regEx.IgnoreCase = True
    regEx.Global = True
    regEx.MultiLine = True
    Set CurrentMatches = regEx.Execute(StringToSearch)

    GetFirstMatch = ""
    If CurrentMatches.Count >= 1 Then
        Set CurrentMatch = CurrentMatches(0)
        If CurrentMatch.SubMatches.Count >= 1 Then
            GetFirstMatch = CurrentMatch.SubMatches(0)
        End If
    End If
    Set regEx = Nothing
End Function

    title = clean_str(GetFirstMatch("<title[^>]*>([^<]+)</title>",data))
    firstpara = clean_str(GetFirstMatch("<p[^>]*>([^<]+)</p>",data))
    firsth1 = clean_str(GetFirstMatch("<h1[^>]*>([^<]+)</h1>",data))