I am writing a web monitoring script using python that will look at a archived version of the page, compare it to the current, online version, and notify me if there are any changes. I have the basics of this working, but am running into a problem with sites that have a dynamic attribute in a web form. The page in general hasn't changed, but a hidden attribute in the form has, which triggers a notification.
Using python's diflib on two HTML files with diff = difflib.unified_diff(content1, content2)
, I am able to get the truncated output below.
-<input type='hidden' value='contact-us' name='ufo-form-pagename' id='ufo-form-pagename'/><input type='hidden' value='927eea55b8e87e961314033fce84de4a1418504077' name='ufo-sign' id='ufo-sign'/>
+<input type='hidden' value='contact-us' name='ufo-form-pagename' id='ufo-form-pagename'/><input type='hidden' value='1ccb910cbb9dc0d6f6dd5ed99212df741418800872' name='ufo-sign' id='ufo-sign'/>
I would like to 'read' through this output, and return the HTML attribute that do not have the same value, in this case value='927eea55b8e87e961314033fce84de4a1418504077'
, and value='1ccb910cbb9dc0d6f6dd5ed99212df741418800872'
How would I go about doing this?
I am writing a web monitoring script using python that will look at a archived version of the page, compare it to the current, online version, and notify me if there are any changes.
Didn't you just answer your own question? If there's a diff then the file changed. :)
It sounds like what you want to do is ignore certain classes of changes. If you're not interested in properly parsing the HTML a naive hack could be to convert all whitespace to newlines and then run your diff. In this case the only difference you would see would be
value='927eea55...'
which you could have a regex pick up and ignore.If you want to properly parse the HTML and do something more intelligent differencing, LMGTFY: