I am storing the HTML from the body of emails in a SQL Server nvarchar(max) column. Is there any benefit in minimizing the HTML on the way in?
By minimizing I mean removing redundant white space and carriage returns/linefeeds in the HTML text stream. My terminology might not be quite right: I'm not looking at removing any HTML tags/comments or anything like that.
By benefit I mean in terms of efficiency of storage space, speed of insert/retrieval, so benefits are focused on the database side.
If it is worthwhile to do, what should I look out for (e.g. if I replace linefeeds with a single space, might it render the HTML incorrectly at a later time)?
You'd still have to have a full HTML parser to understand what's HTML and whats not. Most browsers do a bit of 'fixing up' to make otherwise unpresentable HTML graphically renderable -- in such a way that without fully parsing the tree would be impossible.
Someone could stick some bad HTML in that'd goof up your 'simple' parser pretty easily more often by mistake than malice. Don't get in the business of fixing HTML, handle it verbatim and let the bad content hang itself.