Dear Lazyweb,
I find myself needing to create excerpts of HTML text with full markup, which sounds like an easy problem to solve incorrectly but is actually quite interesting and difficult to do correctly. The brief description of the problem is: take arbitrary HTML and produce an excerpt containing the first N characters of the text, including the markup but only counting the characters in the text. In other words, while “<b>bold</b>” is 11 bytes long, it only represents 4 characters of text towards the N character excerpt. This is why taking a simple substring of N bytes doesn’t work: it counts any markup as characters and worse, could break/unbalance tags, since the closing tags are likely to be truncated away.
My first thought to solving this problem simply — in under 5 minutes, would be my definition of simple — would be to write some code that walks the string, counting characters and stepping over markup, and truncating after the first N characters. This at least solves the “excerpt of N characters” portion of the problem. However, this leaves two problems unsolved: (1) what if the 100th character falls in the middle of a word, and (2) what do I do about any unclosed tags? Solving the word-boundary problem is simple: if the 100th character falls in the middle of the word, back up until the start of the word and truncate there. This does mean our excerpt will be less than N characters, but only by a word fragment which is acceptable in most cases, since I’m guessing most words are shorter than 15 characters. But, closing unclosed tags … in the odd edge case, this can get messy. If you’re starting out with well-formed HTML or XHTML, perhaps it’s a simple problem. But, in the general case, we know the world’s HTML is far from clean — plenty of it is invalid, which means creating rules to close tags based on the assumption that you’re working with well-formed input is not going to work.
There’s a well-known solution to cleaning up HTML called TagSoup and it does a magnificent job and has been packaged for easy use from the command line, which is an added bonus. The only downside is that it’s in Java and I primarily work in Tcl. Now, I can execute stuff from Tcl and grab the output, but that’s far from desirable from a performance standpoint — firing up a JVM every time I need to sanitize some HTML string would be insane. Sure, I could go through the gyrations to write a simple TCP server and make TagSoup available via network RPC, but that’d mean writing a mound of Java code and that’s a deep rathole that I want to avoid (it sure won’t take me 5 minutes). So, here’s my plea: Lazyweb, please implement TagSoup in Tcl, please.
In the meantime, I’m going to work on a simple implementation that’s based around a whole lot of assumptions about the input data I currently need to work with, but a solid, robust solution in the general case for Tcl would be really useful.
Thanks for the kind words about TagSoup. It was hard enough to write in Java; redoing it in pure Tcl is rather beyond me, I’m afraid, though I’d support anyone who wanted to do so. It would be much simpler, however, for someone who knows Java to make it into a network server, since the Parser object is reusable: you just need to listen on a socket and all that.
Hi, John! Thanks for stopping by!
I’m sure it would be “simple” to create a network server for TagSoup in Java … the minimum required features to implement would certainly be a short list.
So, any chance someone’s already hacked together a tagsoupd already? :-)