I'm currently using a fusion of urllib2, pyquery, and json to scrape a site, and now I find that I need to extract some data from JavaScript. One thought would be to use a JavaScript engine (like V8), but that seems like overkill for what I need. I would use regular expressions, but the expression for this seems way to complex.
JavaScript:
(function(){DOM.appendContent(this, HTML("<html>"));;})
I need to extract the <html>
, but I'm not entirely sure how to do so. The <html>
itself can contain basically every character under the sun, so [^"]
won't work.
Any thoughts?
Why regex? Can't you just use two substrings as you know how many characters you want to trim off the beginning and end?
As well as being quicker than a regex, it then doesn't matter if quotes inside
<html>
are escaped or not.