Is there a pure-Python tool to take some HTML and truncate it as close to a given length as possible, but make sure the resulting snippet is well-formed? For example, given this HTML:
<h1>This is a header</h1>
<p>This is a paragraph</p>
it would not produce:
<h1>This is a hea
but:
<h1>This is a header</h1>
or at least:
<h1>This is a hea</h1>
I can't find one that works, though I found one that relies on pullparser
, which is both obsolete and dead.
I don't think you need a full-fledged parser - you only need to tokenize the the input string into one of:
Once you have a stream of tokens like that, it's easy to use a stack to keep track of what tags need closing. I actually ran into this problem a while ago and wrote a small library to do this:
https://github.com/eentzel/htmltruncate.py
It's worked well for me, and handles most of the corner cases well, including arbitrarily nested markup, counting character entities as a single character, returning an error on malformed markup, etc.
It will produce:
on your example. This could perhaps be changed, but it's hard in the general case - what if you're trying to truncate to 10 characters, but the
<h1>
tag isn't closed for another, say, 300 characters?