We have a glossary with up to 2000 terms (where each glossary term may consist of one, two or three words (either separated with whitespaces or a dash).
Now we are looking for a solution for highlighting all terms inside a (longer) HTML document (up to 100 KB of HTML markup) in order to generate a static HTML page with the highlighted terms.
The constraints for a working solution are: large number of glossary terms and long HTML documents...what would be the blueprint for an efficient solution (within Python).
Right now I am thinking about parsing the HTML document using lxml, iterating over all text nodes and then matching the contents within each text node against all glossary terms.
Client-side (browser) highlighting on the fly is not an option since IE will complain about long running scripts with a script timeout...so unusable for production use.
Any better idea?
I think highlighting with client-side javascript is the best option. It saves your server processing time and bandwidth, and more important, keeps html clean and usable for those who don't need unnecessary markup, for example, when printing or converting to other formats.
To avoid timeouts, just split the job into chunks and process them one by one in a setTimeout'ed threaded function. Here's an example of this approach
Use it like this:
Let me know if you have questions.