Parse URLs out of a HTML page

460 Views Asked by At

I have a string containing an HTML page downloaded via WinHttpReadData. The string is a simple char*.
I've been trying to figure a way to extract only the URL's that are on that page. To give you an example, imagine you are searching google for the word WinHTTP and you are presented with an HTML page full of links. I need now to check each link, extract it and save it to a file.

I tried searching for HREF, http:// and other keywords and then try to extract the string all the way to the </a> but it's not really working. It'll be nice also to get the description out that URL (like <a href="http://someurl.com/somepage.html">some text</a> get some text) but it's not as important as the URL itself.

The tricky thing here is that I cannot use 3rd party libraries since I don't want to have to deal with licenses and the like.

Any ideas on how to do this? Does WinHTTP provide a way to do this? in C (not C++)?

Thanks for the help

1

There are 1 best solutions below

5
On BEST ANSWER

Maybe you should go for the PCRE C API (Available on PCRE site)

The regex you'll need will be like :

<a.*?href=[""'](?<url>.*?)[""'].*?>(?<name>.*?)</a>

This should map too group <url> and <name> within the group structure.