Archives for January 2006

The Engadget Bot … it’s ALIIIIIIVE!

Okay, I’ve been sitting on my hands for a day or two, itching to say something about this … but I resisted until Jason let the cat out of the bag, first. I consider this a mental health project, something I could hack on for fun. So, when he asked about an Engadget AIM Bot back on December 29, I decided to start hacking on it. Four days later, at the end of my New Years vacation on January 2, it was up and running and mostly functional.

For folks who aren’t sure what I’m talking about, let me back up a step. Engadget is a gadget blog that is part of the Weblogs, Inc. Network. A number of people contribute articles to it, covering all sorts of news about gadgets and technology. Traditionally, you might subscribe to its syndication feeds through an aggregator and keep track of updates to the site that way. But, what if notification of updates could be pushed to you via instant messaging instead? That’s one of the things the Engadget Bot does — it allows you to subscribe to any number of categories at Engadget and receive IM alerts when new entries are posted to those categories. For example, there’s lots of new entries being posted in the CES category because the CES tradeshow is going on right now in Vegas. To subscribe, you’d would send an IM to the screenname EngadgetBot with the message subscribe ces. Also, it seems that the bot has trouble sending IMs back to you if you don’t have it on your Buddy List, so it might be a good idea to add it to your Buddy List first.

Here’s a screenshot of an example interaction with the bot, receiving IM alerts and querying it for the latest headlines:

(That’s a screenshot of a Trililan window. Trillian is a multi-IM application for Wintel which I use regularly for my IM needs — so much so that I wrote a plugin for it called Tcllian which embeds my favorite scripting language, Tcl, so I can write scripts for it in Tcl that run inside Trillian.)

For the geeks in the audience, the Engadget Bot is written in Tcl … roughly 2,000 lines at this moment. For persistent data storage, I opted to use the light-weight SQLite 3, which has a really convenient Tcl binding. The source for the bot isn’t available, but it might be someday. A lot of it has to do with the fact that the code is embarassingly simple and I’d honestly be embarassed to have folks looking at it until I can clean it up and make it presentable.

Anyway, I really enjoyed hacking on this and feel really proud to have gotten it working. It’s very simple, but I think it’s already very useful if you’re interested in Engadget’s content and keeping up to date. I get to build something really simple because all the hard stuff — feed syndication of Engadget content, the AIM messaging network, etc. — are already all in place. This is just another example of what Web 2.0 mash-ups can enable folks to build.


UPDATE: My friend Og Maciel just blogged about the bot after I told him about it. Considering he’s on the Ubuntu team doing the translation to Brazilian Portuguese, it’s only natural that his entry is in Portugese. Way cool!

Dear Lazyweb: Please implement TagSoup in pure Tcl

Dear Lazyweb,

I find myself needing to create excerpts of HTML text with full markup, which sounds like an easy problem to solve incorrectly but is actually quite interesting and difficult to do correctly. The brief description of the problem is: take arbitrary HTML and produce an excerpt containing the first N characters of the text, including the markup but only counting the characters in the text. In other words, while “<b>bold</b>” is 11 bytes long, it only represents 4 characters of text towards the N character excerpt. This is why taking a simple substring of N bytes doesn’t work: it counts any markup as characters and worse, could break/unbalance tags, since the closing tags are likely to be truncated away.

My first thought to solving this problem simply — in under 5 minutes, would be my definition of simple — would be to write some code that walks the string, counting characters and stepping over markup, and truncating after the first N characters. This at least solves the “excerpt of N characters” portion of the problem. However, this leaves two problems unsolved: (1) what if the 100th character falls in the middle of a word, and (2) what do I do about any unclosed tags? Solving the word-boundary problem is simple: if the 100th character falls in the middle of the word, back up until the start of the word and truncate there. This does mean our excerpt will be less than N characters, but only by a word fragment which is acceptable in most cases, since I’m guessing most words are shorter than 15 characters. But, closing unclosed tags … in the odd edge case, this can get messy. If you’re starting out with well-formed HTML or XHTML, perhaps it’s a simple problem. But, in the general case, we know the world’s HTML is far from clean — plenty of it is invalid, which means creating rules to close tags based on the assumption that you’re working with well-formed input is not going to work.

There’s a well-known solution to cleaning up HTML called TagSoup and it does a magnificent job and has been packaged for easy use from the command line, which is an added bonus. The only downside is that it’s in Java and I primarily work in Tcl. Now, I can execute stuff from Tcl and grab the output, but that’s far from desirable from a performance standpoint — firing up a JVM every time I need to sanitize some HTML string would be insane. Sure, I could go through the gyrations to write a simple TCP server and make TagSoup available via network RPC, but that’d mean writing a mound of Java code and that’s a deep rathole that I want to avoid (it sure won’t take me 5 minutes). So, here’s my plea: Lazyweb, please implement TagSoup in Tcl, please.

In the meantime, I’m going to work on a simple implementation that’s based around a whole lot of assumptions about the input data I currently need to work with, but a solid, robust solution in the general case for Tcl would be really useful.

del.icio.us/dossy links since December 26, 2005 at 09:05 AM

del.icio.us/dossy (RSS) links since December 26, 2005 at 09:05 AM: