We have a glossary with up to 2000 terms (where each glossary term may consist of one, two or three words (either separated with whitespaces or a dash).
Now we are looking for a solution for highlighting all terms inside a (longer) HTML document (up to 100 KB of HTML markup) in order to generate a static HTML page with the highlighted terms.
The constraints for a working solution are: large number of glossary terms and long HTML documents...what would be the blueprint for an efficient solution (within Python).
Right now I am thinking about parsing the HTML document using lxml, iterating over all text nodes and then matching the contents within each text node against all glossary terms.
Client-side (browser) highlighting on the fly is not an option since IE will complain about long running scripts with a script timeout...so unusable for production use.
Any better idea?
How about going through each term in the glossary and then, for each term, using regex to find all occurrences in the HTML? You could replace each of those occurrences with the term wrapped in a span with a class "highlighted" that will be styled to have a background color.