Archives for January 2006

The Engadget Bot … it’s ALIIIIIIVE!

January 4, 2006 by Dossy Shiobara Leave a Comment

Okay, I’ve been sitting on my hands for a day or two, itching to say something about this … but I resisted until Jason let the cat out of the bag, first. I consider this a mental health project, something I could hack on for fun. So, when he asked about an Engadget AIM Bot back on December 29, I decided to start hacking on it. Four days later, at the end of my New Years vacation on January 2, it was up and running and mostly functional.

For folks who aren’t sure what I’m talking about, let me back up a step. Engadget is a gadget blog that is part of the Weblogs, Inc. Network. A number of people contribute articles to it, covering all sorts of news about gadgets and technology. Traditionally, you might subscribe to its syndication feeds through an aggregator and keep track of updates to the site that way. But, what if notification of updates could be pushed to you via instant messaging instead? That’s one of the things the Engadget Bot does — it allows you to subscribe to any number of categories at Engadget and receive IM alerts when new entries are posted to those categories. For example, there’s lots of new entries being posted in the CES category because the CES tradeshow is going on right now in Vegas. To subscribe, you’d would send an IM to the screenname EngadgetBot with the message subscribe ces. Also, it seems that the bot has trouble sending IMs back to you if you don’t have it on your Buddy List, so it might be a good idea to add it to your Buddy List first.

Here’s a screenshot of an example interaction with the bot, receiving IM alerts and querying it for the latest headlines:

(Click for larger image.)

(That’s a screenshot of a Trililan window. Trillian is a multi-IM application for Wintel which I use regularly for my IM needs — so much so that I wrote a plugin for it called Tcllian which embeds my favorite scripting language, Tcl, so I can write scripts for it in Tcl that run inside Trillian.)

For the geeks in the audience, the Engadget Bot is written in Tcl … roughly 2,000 lines at this moment. For persistent data storage, I opted to use the light-weight SQLite 3, which has a really convenient Tcl binding. The source for the bot isn’t available, but it might be someday. A lot of it has to do with the fact that the code is embarassingly simple and I’d honestly be embarassed to have folks looking at it until I can clean it up and make it presentable.

Anyway, I really enjoyed hacking on this and feel really proud to have gotten it working. It’s very simple, but I think it’s already very useful if you’re interested in Engadget’s content and keeping up to date. I get to build something really simple because all the hard stuff — feed syndication of Engadget content, the AIM messaging network, etc. — are already all in place. This is just another example of what Web 2.0 mash-ups can enable folks to build.

UPDATE: My friend Og Maciel just blogged about the bot after I told him about it. Considering he’s on the Ubuntu team doing the translation to Brazilian Portuguese, it’s only natural that his entry is in Portugese. Way cool!

Filed Under: Geeking out Leave a Comment

Dear Lazyweb: Please implement TagSoup in pure Tcl

January 2, 2006 by Dossy Shiobara Leave a Comment

Dear Lazyweb,

I find myself needing to create excerpts of HTML text with full markup, which sounds like an easy problem to solve incorrectly but is actually quite interesting and difficult to do correctly. The brief description of the problem is: take arbitrary HTML and produce an excerpt containing the first N characters of the text, including the markup but only counting the characters in the text. In other words, while “<b>bold</b>” is 11 bytes long, it only represents 4 characters of text towards the N character excerpt. This is why taking a simple substring of N bytes doesn’t work: it counts any markup as characters and worse, could break/unbalance tags, since the closing tags are likely to be truncated away.

My first thought to solving this problem simply — in under 5 minutes, would be my definition of simple — would be to write some code that walks the string, counting characters and stepping over markup, and truncating after the first N characters. This at least solves the “excerpt of N characters” portion of the problem. However, this leaves two problems unsolved: (1) what if the 100th character falls in the middle of a word, and (2) what do I do about any unclosed tags? Solving the word-boundary problem is simple: if the 100th character falls in the middle of the word, back up until the start of the word and truncate there. This does mean our excerpt will be less than N characters, but only by a word fragment which is acceptable in most cases, since I’m guessing most words are shorter than 15 characters. But, closing unclosed tags … in the odd edge case, this can get messy. If you’re starting out with well-formed HTML or XHTML, perhaps it’s a simple problem. But, in the general case, we know the world’s HTML is far from clean — plenty of it is invalid, which means creating rules to close tags based on the assumption that you’re working with well-formed input is not going to work.

There’s a well-known solution to cleaning up HTML called TagSoup and it does a magnificent job and has been packaged for easy use from the command line, which is an added bonus. The only downside is that it’s in Java and I primarily work in Tcl. Now, I can execute stuff from Tcl and grab the output, but that’s far from desirable from a performance standpoint — firing up a JVM every time I need to sanitize some HTML string would be insane. Sure, I could go through the gyrations to write a simple TCP server and make TagSoup available via network RPC, but that’d mean writing a mound of Java code and that’s a deep rathole that I want to avoid (it sure won’t take me 5 minutes). So, here’s my plea: Lazyweb, please implement TagSoup in Tcl, please.

In the meantime, I’m going to work on a simple implementation that’s based around a whole lot of assumptions about the input data I currently need to work with, but a solid, robust solution in the general case for Tcl would be really useful.

Filed Under: Geeking out Leave a Comment

del.icio.us/dossy links since December 26, 2005 at 09:05 AM

January 2, 2006 by Dossy Shiobara Leave a Comment

del.icio.us/dossy (RSS) links since December 26, 2005 at 09:05 AM:

ConceptNet

“ConceptNet is a freely available commonsense knowledgebase and natural-language-processing toolkit which supports many practical textual-reasoning tasks over real-world documents right out-of-the-box (without additional statistical training) […]” (via

Tags: language, library, python, semantic, software, tools
John Battelle’s Searchblog: What Happens When You Mashup RSS, IM, and Publishing Services?

John writes about MAKE’s bot.

Tags: aol, blog, bot, im, news
MAKE: Blog: The MAKEbot is here! (12/08/2005)

So, MAKE has an AIM bot.

Tags: aol, blog, bot, im, news
YouTube – SNL – The Chronic of Narnia Rap

Tags: 2005, comedy, humor, movie, satire, video
Free Online Translator

Has Dutch-to-English, which as of Dec 2005, Google Translate does not.

Tags: free, language, tools, translation, web
Xooglers: Let’s get a real database

Google apparently uses MySQL for AdWords, but not without migrating to a commercial RDBMS then rumored to have switched back. Hat tip: John Sequeira.

Tags: blog, database, google, mysql, opensource
growabrain: The Holocaust Archives

Tags: history, politics, reference
Israel

“It is currently fashionable to demonize Adolf Hitler and the Germans who voted for him and his policies. However it is worth pointing out that Hitler original plan was not to kill Jews; he wanted to take their property and then kick them out of Europe.”

Tags: essay, history, politics
Chinese Horoscopes – The Dragon

Another good explanation of the dragons.

Tags: dragons, zodiac
Official Google Blog: About the AOL announcement

Summary: “Biased results? No way. / Indexing more of AOL’s content. / AOL will receive a credit towards advertising purchased through Google’s ad program.”

Tags: aol, blog, business, google
National Film Preservation Board (Library of Congress)

Tags: government, history, library, movie, reference
National Film Registry, 2004\5

FILMS SELECTED TO THE NATIONAL FILM REGISTRY, LIBRARY OF CONGRESS – 2005
(RHPS made the list of 25 titles!)

Tags: history, movie, reference

Filed Under: Links Leave a Comment

Archives for January 2006

The Engadget Bot … it’s ALIIIIIIVE!

Dear Lazyweb: Please implement TagSoup in pure Tcl

del.icio.us/dossy links since December 26, 2005 at 09:05 AM

Stay up to date

Social Networking

Retail Therapy

Latest comments

More posts

This blog now has a comments RSS feed!

del.icio.us/dossy links since August 13, 2007 at 09:00 AM

Twitter, haiku-style

Will the “splintering” of the Interweb result in a Tipping Point?