Method to parse multiple .html files and remove part of html code

168 Views Asked by At

what is the proper way to parse multiple .html files within directory, search and remove part of html code in these files? For example, I need to remove a html code from all files:

    <div class="box">
        <h2>Book Search</h2>
        <div id="search">
            <form action="http://www.biology35.com/search.php" method="post">
                <input type="text" name="searchfor" class="txtField" />
                <input type="image" src="new/images/btn-go.png" name="Submit" value="Submit" class="button" />
                <div class="clear"><!-- --></div>
            </form>
        </div>
    </div>

I use Geany 1.29 file editor on Debian. Regex is probably not suitable for this. Some shell script or python?

1

There are 1 best solutions below

0
seagulf On

You can use htql, for example:

html = """
something before
    <div class="box">
        <h2>Book Search</h2>
        <div id="search">
            <form action="http://www.biology35.com/search.php" method="post">
                <input type="text" name="searchfor" class="txtField" />
                <input type="image" src="new/images/btn-go.png" name="Submit" value="Submit" class="button" />
                <div class="clear"><!-- --></div>
            </form>
        </div>
    </div>

html after
"""

import htql
x=htql.query(html, "<div norecur (class='box') > &delete ")[0][0]

You get:

>>> x
'\nsomething before\n    \n\nhtml after\n'