Comparing two HTML files and return the HTML tags that differ between the two

2.3k Views Asked by At

I am writing a web monitoring script using python that will look at a archived version of the page, compare it to the current, online version, and notify me if there are any changes. I have the basics of this working, but am running into a problem with sites that have a dynamic attribute in a web form. The page in general hasn't changed, but a hidden attribute in the form has, which triggers a notification.

Using python's diflib on two HTML files with diff = difflib.unified_diff(content1, content2), I am able to get the truncated output below.

-<input type='hidden' value='contact-us' name='ufo-form-pagename' id='ufo-form-pagename'/><input type='hidden' value='927eea55b8e87e961314033fce84de4a1418504077' name='ufo-sign' id='ufo-sign'/>

+<input type='hidden' value='contact-us' name='ufo-form-pagename' id='ufo-form-pagename'/><input type='hidden' value='1ccb910cbb9dc0d6f6dd5ed99212df741418800872' name='ufo-sign' id='ufo-sign'/>

I would like to 'read' through this output, and return the HTML attribute that do not have the same value, in this case value='927eea55b8e87e961314033fce84de4a1418504077', and value='1ccb910cbb9dc0d6f6dd5ed99212df741418800872'

How would I go about doing this?

1

There are 1 best solutions below

1
On

I am writing a web monitoring script using python that will look at a archived version of the page, compare it to the current, online version, and notify me if there are any changes.

Didn't you just answer your own question? If there's a diff then the file changed. :)

It sounds like what you want to do is ignore certain classes of changes. If you're not interested in properly parsing the HTML a naive hack could be to convert all whitespace to newlines and then run your diff. In this case the only difference you would see would be value='927eea55...' which you could have a regex pick up and ignore.

If you want to properly parse the HTML and do something more intelligent differencing, LMGTFY: