Cleanup HTML using lxml and XPath in Python

365 Views Asked by At

I'm learning python and lxml toolkit. I need process multiple .htm files in the local directory (recursively) and remove unwanted tags include its content (divs with IDs "box","columnRight", "adbox", footer", div class="box", plus all stylesheets and scripts). Can't figure out how to do this. I have code that list all .htm files in directory:

#!/usr/bin/python
import os
from lxml import html
import lxml.html as lh

path = '/path/to/directory'
for root, dirs, files in os.walk(path):
    for name in files:
        if name.endswith(".htm"):
        doc=lh.parse(filename)

So I need to add part, that creates a tree, process html and remove unnecessary divs, like

for element in tree.xpath('//div[@id="header"]'):
    element.getparent().remove(element) 

how to adjust the code for this?

html page example.

1

There are 1 best solutions below

9
Jack Fleeting On

It's hard to tell without seeing your actual files, but try the following and see if it works:

First you don't need both

from lxml import html
import lxml.html as lh

So you can drop the first. Then

for root, dirs, files in os.walk(path):
    for name in files:
        if name.endswith(".htm"):           
           tree = lh.parse(name)
           root = tree.getroot()
           for element in root.xpath('//div[@id="header"]'):
               element.getparent().remove(element)