Dear Lazyweb: Please implement TagSoup in pure Tcl

Dear Lazyweb,

I find myself needing to create excerpts of HTML text with full markup, which sounds like an easy problem to solve incorrectly but is actually quite interesting and difficult to do correctly. The brief description of the problem is: take arbitrary HTML and produce an excerpt containing the first N characters of the text, including the markup but only counting the characters in the text. In other words, while “<b>bold</b>” is 11 bytes long, it only represents 4 characters of text towards the N character excerpt. This is why taking a simple substring of N bytes doesn’t work: it counts any markup as characters and worse, could break/unbalance tags, since the closing tags are likely to be truncated away.

My first thought to solving this problem simply — in under 5 minutes, would be my definition of simple — would be to write some code that walks the string, counting characters and stepping over markup, and truncating after the first N characters. This at least solves the “excerpt of N characters” portion of the problem. However, this leaves two problems unsolved: (1) what if the 100th character falls in the middle of a word, and (2) what do I do about any unclosed tags? Solving the word-boundary problem is simple: if the 100th character falls in the middle of the word, back up until the start of the word and truncate there. This does mean our excerpt will be less than N characters, but only by a word fragment which is acceptable in most cases, since I’m guessing most words are shorter than 15 characters. But, closing unclosed tags … in the odd edge case, this can get messy. If you’re starting out with well-formed HTML or XHTML, perhaps it’s a simple problem. But, in the general case, we know the world’s HTML is far from clean — plenty of it is invalid, which means creating rules to close tags based on the assumption that you’re working with well-formed input is not going to work.

There’s a well-known solution to cleaning up HTML called TagSoup and it does a magnificent job and has been packaged for easy use from the command line, which is an added bonus. The only downside is that it’s in Java and I primarily work in Tcl. Now, I can execute stuff from Tcl and grab the output, but that’s far from desirable from a performance standpoint — firing up a JVM every time I need to sanitize some HTML string would be insane. Sure, I could go through the gyrations to write a simple TCP server and make TagSoup available via network RPC, but that’d mean writing a mound of Java code and that’s a deep rathole that I want to avoid (it sure won’t take me 5 minutes). So, here’s my plea: Lazyweb, please implement TagSoup in Tcl, please.

In the meantime, I’m going to work on a simple implementation that’s based around a whole lot of assumptions about the input data I currently need to work with, but a solid, robust solution in the general case for Tcl would be really useful.

Is it wrong to poison candy? Is it more wrong than stealing?

With the proliferation of wireless computing technology (aka “wi-fi ethernet”) and people freely using other people’s open wireless networks — I have a hard time calling it “stealing” but it is — I started asking myself: is it morally wrong to set up a transparent HTTP proxy that injected malicious scripts into the HTTP response to exploit people’s computers who are using your wi-fi? My gut says that knowingly destroying other people’s computers is wrong, always wrong, even if they’re illegally trespassing on your wireless network and stealing your bandwidth without your permission. But, then I wonder if it’s a framing or context problem. Is it so obviously wrong because it’s happening in an intangible space and all highly theoretical? So, I tried to redescribe the problem in more mundane terms.

What if you were a candy store, and you had a serious shoplifting problem: where people would randomly walk into your store and help themselves to some of your candy without paying for it? Would it be wrong to poison a subset of candy and mark the poisoned goods in such a way that only you could identify them? If a real customer came along and wanted to purchase the candy, you’d recognize it as being poisoned and replace it with a clean version. But, if someone just came along and grabbed it and walked off with it, if they proceeded to eat it, then they got what they deserved? If they didn’t get permission to take your candy, you have no obligation or responsibility as to what happens to them if they steal it, right?

Are the two situations (unauthorized use of wi-fi vs. owning a candy store) really different? Is the aggressive defense mechanism acceptable in one situation but not the other? Are both unacceptable? Should the entire burden of securing a wireless network rest on the shoulders of the owner of the network, or should there be some responsibility and etiquette for people not to just assume that because a wi-fi network is unrestricted that it doesn’t make it open for public use? If I set up open wi-fi and want to signal that it’s open for public use, I’ll include “public” in the SSID to signal it as such.

What do you do with your wi-fi networks? Do you secure yours, or leave it wide open, or what? If you leave yours open, do you have a problem with people jumping onto it and using it? Have you ever had someone use your wireless network and send spam using it, or anything else you’d not want them to do, but you still want to leave it open so that other good people can use it when they need to?

A question to people reading my blog at Blogger: why?

For the past ten months, I’ve been reposting entries from my blog at dossy.org to this blog here at Blogger. I did this to see if it would encourage folks to post comments if they already had Blogger accounts, since due to a large volume of blog comment spam I decided to require sign-in in order to comment. However, there’s been hardly any comments being posted to the Blogger blog, so I wonder how useful it is for me to keep reposting the entries.

Unfortunately, Blogger doesn’t give me any stats about who’s subscribed to my blog’s feeds, so I really have no idea who’s even reading it over there. If you’re reading this over at my Blogger blog, please let me know by either emailing me or leaving a comment either here or in my own blog. I’d like to know why you’re reading it, if you are. If you’re subscribing to its feeds, I’d like to ask you to instead subscribe to my blog’s feeds: RSS 1.0, RSS 2.0, RSS 2.0 comments feed.

If it seems that nobody’s reading the Blogger blog or its feeds, I’ll likely stop reposting stuff there just to save me a bit of time, so if there’s a reason you read it instead of my blog at dossy.org, please let me know.

Remote arbitrary code execution vulnerability in MSIE. Anyone surprised?

This article at eWEEK.com is the reason why I’ve switched to using the Mozilla Firefox web browser for everything except a few web applications at work which only work with MSIE.

Read the article, but the gist of it is that a fully up-to-date Windows system, if browsing a malicious site containing the exploit, can be made to execute arbitrary code that the attacker has intended on your system. Quoting from the article:

The proof-of-concept exploit, which is available from the FrSirt site, currently launched the Windows Calculator (calc.exe) but can be easily modified by malicious hackers.

What I’d really like to see is someone modifying the proof-of-concept exploit to instead fetch a copy of Firefox, perform an unattended install of it, then rename IEXPLORE.EXE (the MSIE executable) to something else and replacing it with a copy of FIREFOX.EXE. Of course, I’m sure anyone who did such a thing could go to jail because there’s no distinction made between “good hacking” and “bad hacking” in our lovely legal system. You know, the same legal system that lets killers walk free but makes hackers go to jail.

Dossy’s Blog, now with del.icio.us links!

Something that I noticed about my recent posting funk is that I still find myself writing small blurbs about links when saving them to del.icio.us. I’ve been thinking about using the links to power a linkblog, and seeing Joe Grossberg go and set one up, it motivated me to hack one out for my own blog.

So, twice a day, 9:00 AM and 9:00 PM US/Eastern time, you’ll see a post auto-publish using my del.icio.us links. Here’s the first post generated by my script, to give you an example of what to expect. Of course, you could just subscribe to the RSS feed in your aggregator, if you don’t want to wait until the links get posted to my blog.

What do you think? Is auto-generating an entry twice a day, lame? Should I relegate these links to the sidebar of the blog instead? Do you just not care about what links I find interesting and I shouldn’t bother with a linkblog at all?

Oh, and the folks who are reading this via my crossposting to LiveJournal or Blogger: sorry, you won’t see the del.icio.us links, because my script only posts them to my actual blog.

The world gets closer to proving my “donut shaped universe” theory!

Back in the 1990’s, I formulated a theory that the universe is “donut shaped” — well, I said “toroidal” which most people I hung out with at the time would probably have to look up the definition of, so I avoided using the term. I had no scientific expression of my theory, no empirical evidence to prove it — matter of fact, I asserted that the truth value of my theory was unprovable from any other compatible theory because of its properties, nor any math to back it up. It just seemed like the only simple explanation that I could accept at the time. I didn’t even express it in any kind of written form to document it; I just threw it out during conversations that would turn to matters of physics or philosophy where it seemed appropriate to bring it up. Everyone I explained my theory to said I was just being silly and it made no sense, but it made perfect sense to me. Let me explain …

Suppose we want to believe that the known universe is either expanding or contracting inside a larger body of “space” in three dimensions. Suppose we formulated a theory about weak and strong forces between “things” that exist in the universe, for microscopic interactions between particles and macroscopic interactions between large bodies (humans, planets, etc.). What we want to believe is that the universe is finite but the space it occupies has all appearances of being infinite but is likely to be finite. Suppose we want to believe that at fundamental levels, things are spherical in nature but simple spheres and even oblates are too simple of a structure to explain what we observe. I first thought: Well, what if the universe was really the surface of a Moebius strip? You know, that clever strip where if you trace a line on one side all the way around you end up back where you started? You can travel “infinitely” far — or at least, at the surface, perceive it to be — on a segment that has a finite length. But, the universe isn’t a two-dimensional planar thing, and the half-twist inversion is hard to rationalize — too complex of a shape to explain simply. Well, how about a toroid, then? It’s a simple closed geometric shape in three dimensions with an interior space and an exterior surface, that offers symmetry which should make the math simple because you don’t have to have all sorts of exceptional explanations at the half-twist like you would in a Moebius strip. Sounds great, right? Except I’m not smart enough to take this theory any further and I can’t seem to get anyone else to understand what I see.

But, Steven D. Levitt over at the Freakonomics Blog recently wrote about a physicist named Lisa Randall and quoted a passage from her new book titled Warped Passages. (Oh, and if you haven’t heard about Freakonomics yet, check out their site and maybe even buy their book.) What made me do a double-take was the explanation that Randall gave, according to the quote by Levitt:

“[…] Dr. Randall and Dr. Sundrum’s model consisted of a pair of universes, four-dimensional branes, thinly separated by a five-dimensional space poetically called the bulk.

When they solved the equations for this setup, they discovered that the space between the branes would be warped. Objects, for example, would appear to grow larger or smaller and get less massive or more massive as they moved back and forth between the branes. […]”

While this isn’t the same thing as my “universe as a donut” theory, it gets close to describing what I’ve been trying to explain. Think of the branes as the donut and the hole as what they call the “bulk.” I don’t see why they need a pair of universes but I’m sure it’s to tie up some loose end they had to explain. That’s the problem with scientists: they have solutions looking for a problem. If you just solve the problem, you just get the actual solution.

Another thing that makes my theory useful and simple is that it not only explains the universe at the macroscopic end, but also explains microscopic particle behavior — quantum mechanics and all that. Recently, Randell Mills and his company BlackLight Power have come forth describing a new form of the simplest atom, hydrogen, calling it a “hydrino.” (For more background, read this article in the Guardian: Fuel’s paradise? Power source that turns physics on its head.) What’s so controversial is that Mills is describing something which current beliefs in quantum mechanics would assert is impossible. But, Mills apparently is actually demonstrating his findings: how do you argue with reality? Like a fool, that’s how.

Suppose the donut hole represents the proton. Suppose the donut represents the path around the proton that the electron can take. The only rules here is that the proton and electron can’t occupy the same position at the same time, and that the closer the electron gets the “faster” it travels around the center, and centrifugal force says that the electron will favor staying near the outer ring of the donut than the inner ring closer to the hole. The whole notion in quantum mechanics that there’s a fixed distance where the electron can’t get closer to the proton sounds foolish. It might take a lot of energy to do it, but okay, that’s fine. A lot more energy than is available? Possibly. But, space isn’t discrete, it’s continuous, as well as time. Heisenberg figured this out back in 1927 when he expressed his uncertainty principle. According to the Hydrino Study Group page, there’s a “1986 Herman Haus paper that explains how charged particles may undergo acceleration without radiation.” Suppose there actually is radiation but it’s practically unobservable because the radiation event happens closer to the center of the donut’s hole, which keeps it unobservable because as the radiation moves towards the outer edge, it gets absorbed back into the source of the radiation itself. In this way, you have the whole kinetic-potential energy conversion happening, like the swing of a pendulum, but it’s not observable. All we can observe is the acceleration, not the radiation. Why not, right?

Where am I going with all this? Well, I hope people like Randell Mills continues to try and solve problems and not work within solutions and finds a new source of safe energy that everyone wants to believe is “impossible” — you know, because the Earth is flat and all that stuff that scientists know with certainty. I hope that someone like Lisa Randall figures out my everything-is-a-donut theory and proves it for me, somehow. Granted, I think the former will happen sooner than the latter, but that’s fine by me. I don’t need to be proven right; people can continue to argue with reality. Like fools.

UPDATE: Found this article by Richard L. Marker on his Discrete Donut Twisted Chain (ddtc) Model of Space, Matter and Origin of Gravity.

UPDATE: Here’s an article from March 11, 2003: Universe as Doughnut: New Data, New Debate.

JMS announces Babylon 5 scripts will be available in published book form, October 2005

J. Michael Straczynski (of Babylon 5 fame) recently posted a message to rec.arts.sf.tv.babylon5.moderated announcing that there could be a new series on its way, and that he’s making the Babylon 5 scripts available for purchase in published book form.

According to JMS, there will be 14 books each containing 7 scripts. A bonus 15th book will be available for those who purchase the whole 14-book set. The whole thing will cost anywhere from $420 to $560. Go to babylon5scripts.com — currently, you can only sign up to receive an email alert, but when the scripts become available, this is the site to go to.

UPDATE: The scripts and other B5-related items are now available for sale in Joe’s CafePress store. Thanks ukbab5!

Tags:
,
,

Modular exponentiation in Tcl for DSA signature verification using mpexpr

(Originally, I wrote this up on the Tcl’ers Wiki but wanted to post it to my blog, too.)

In struggling with implementing DSA signature verification (aka FIPS 186-2), I discovered that math::bignum::powm is slow. Using this algorithm for modular exponentiation (i.e., x = a^b mod y), it yielded a slightly faster implementation:

proc _modexp_bignum {m e n} {
    set p [fromstr 1]
    set zero [fromstr 0]
    set one [fromstr 1]
    set two [fromstr 2]
    while {[gt $e $zero]} {
        if {[eq [mod $e $two] $one]} {
            set p [mod [mul $p $m] $n]
        }
        set m [mod [mul $m $m] $n]
        set e [div $e $two]
    }
    return $p
}

However, this is still quite slow for large values. So, I converted the inner-workings to use mpexpr and the speedup is tremendous:

proc _modexp_mpexpr {m e n} {
    foreach v {m e n} {
        set $v [mpexpr [tostr [set $v]]]
    }
    set p [mpexpr 1]
    while {[mpexpr $e > 0]} {
        if {[mpexpr $e % 2 == 1]} {
            set p [mpexpr $p * $m % $n]
        }
        set m [mpexpr $m * $m % $n]
        set e [mpexpr $e / 2]
    }
    return [fromstr $p]
}

Here’s my script that I used to benchmark performance:

package require math::bignum
package require Mpexpr

set g [math::bignum::fromstr 0x626d027839ea0a13413163a55b4cb500299d5522956cefcb3bff10f399ce2c2e71cb9de5fa24babf58e5b79521925c9cc42e9f6f464b088cc572af53e6d78802]
set u1 [math::bignum::fromstr 0xbf655bd046f0b35ec791b004804afcbb8ef7d69d]
set p [math::bignum::fromstr 0x8df2a494492276aa3d25759bb06869cbeac0d83afb8d0cf7cbb8324f0d7882e5d0762fc5b7210eafc2e9adac32ab7aac49693dfbf83724c2ec0736ee31c80291]

# contains ::dsa namespace with _modexp_bignum and _modexp_mpexpr inside.
source dsa.tcl

set start [clock seconds]
puts "math::bignum::powm  [time {math::bignum::powm $g $u1 $p} 5]"
puts "dsa::_modexp_bignum [time {dsa::_modexp_bignum $g $u1 $p} 5]"
puts "dsa::_modexp_mpexpr [time {dsa::_modexp_mpexpr $g $u1 $p} 5]"
set end [clock seconds]

puts "Total elapsed: [expr {$end - $start}] seconds."

Here’s the output:

math::bignum::powm  55341757 microseconds per iteration
dsa::_modexp_bignum 56942386 microseconds per iteration
dsa::_modexp_mpexpr 311979 microseconds per iteration
Total elapsed: 563 seconds.

As the timings show, the math::bignum::powm and _modexp_bignum are comparable, but the _modexp_mpexpr trashes them both.

Is Socialmarks just another del.icio.us or Yahoo! My Web 2.0?

While looking over my blog’s referrer log reports, I spotted some hits from ping.socialmarks.com, which I didn’t recognize. Apparently, Socialmarks is a new project by Matt Kaufman which seems to be in an invite-only beta right now. (I submitted my email address, perhaps I’ll be offered an opportunity to give it a test-drive while its still in beta.)

Looking over the brief dsecription on the Socialmarks site itself, and eyeballing the screenshot, I still can’t figure out what this service is going to be. Is it just another del.icio.us shared bookmarks site, or more like Yahoo! My Web 2.0 which has more focus on social networks of bookmarks? There’s mention of feeds so perhaps it will be a bit like Bloglines, too.

Whatever it is, looking at the screenshot, at least the user interface design is pleasant and clean, so if there are useful features, I’m really looking forward to trying it out.

ActiveState Adds Expect for Windows to ActiveTcl

Jeff Hobbs writes to the Tcl-announce mailing list that ActiveState has added Expect for Windows to ActiveTcl (Press Release). From the announcement, you can now go and download ActiveTcl 8.4.11 (for Windows) with Expect for Windows, for free.

Typically, only Unix system administrators have known about Expect and it’s ability simplify routine tasks through automation. Now, their Windows-laden counterparts can hopefully start to enjoy the same benefits. I’m hoping this will create an increase in the number of people learning the Tcl programming language, which happens to be one of my favorites. Maybe it could even lead to more folks getting interested in AOLserver again.

According to the announcement, ActiveState’s Expect for Windows is based on Expect 5.43, and “ActiveState will be working with the community to open source Expect for Windows in the near future.” This is great news — ActiveState has always maintained a strong committment to support and maintain its involvement in the open source community. Once those changes are integrated back into the open source version of Expect, I’m hoping a new build of Tclkit for Win32 won’t be far behind.

I’d like to take this opportunity to thank all the folks at ActiveState and in the open source community who have collectively made this all possible. It’s great to be a part of something so wonderful.