nsopenssl 3.0 beta stuck in a busy loop

Back on August 7th, Nathaniel Haggard reports a problem with nsopenssl where it repeatedly sends a bunch of stuff into the server log. Janine Sisk confirms that she is also seeing the same thing. However, neither were really able to put a finger on why it was happening or how to reproduce it, so I couldn’t really do much about it at the time. On August 12th, I identified one issue with the sample config. that ships with nsopenssl having to do with “SSLv2” being omitted from the “protocols” list but being (incorrectly) included in the “ciphersuite” list which would result in the server crashing when SSL clients attempt a SSLv2 connection. But, this wasn’t the root cause of the problem.

Then, almost a week later on August 18th, Bruno Mattarollo brought up the issue again, but this time was different. Bruno indicated that he was able to reproduce the problem fairly reliably! He said,

What I did, that
triggered the error was click on a link and immediately click on
another link without giving the server time to actually return the
page, so I guess what’s happening is that there is no socket for
nsopenssl to send the results to … right?

Bruno and I spent the next few days trying to diagnose the problem — he even blogged about it. Along the way, I found some other unrelated issues which I logged at SourceForge in Bug #1012892 along with patches against AOLserver 4.0.8a and 4.1.0a that address them. However, continuing to try and get at the root cause of our nsopenssl issue, I realized that fixing the problem would not be a trivial change. The nsopenssl code needed some serious clean-up — I was having a hard time getting a grasp of what it was doing (or, more importantly, what it wasn’t doing).

So, today, I sat down and began to clean up the nsopenssl code. After several cigarettes and some head-scratching, I got the code to a state where I could really start tracing it in the debugger and see what was happening. And, what I found was that when the remote client abruptly terminates the SSL connection, the server notices (because SSL_write() fails) but because the browser requested an HTTP Keep-Alive connection, the server returns the connection to the pool to read the next HTTP request. When it goes to read, it fails on SSL_read() (because there’s no peer connected) and so begins the error loop. I managed to clean up the code and ensure that when an error occurs, we mark the SSL connection as “shut down” so the driver knows not to use it for Keep-Alive and thus will properly close the connection. I announced the fix at 4:13 PM today, and around 9:01 PM, Bruno logged in, applied the patch to nsopenssl, tested and verified that he can no longer reproduce the problem!

I’m going to wait a few days for others to apply and test the patch, then I’ll commit the patches in Bug #1012892 to CVS. Noah Robin asked if a similar fix could be backported to nsopenssl 2.x which I said could be possible if folks verify the fix to nsopenssl 3.0 to be complete, that I would look into backporting the fix.

Overall, I’m hoping this makes nsopenssl 3.0 beta stable enough for us to consider the upcoming nsopenssl 3.0 beta 22 a release candidate. We’ll see …

Speak Your Mind

*