encodings and character sets

Janine Sisk at Furfly has been working hard to migrate her customers to the latest AOLserver 4.0.x release, which has been helping the project find bugs and answer migration questions. This has been incredibly helpful and the most recent issue involved encoding and character set settings.

Her problem, in a nutshell, is that users are entering data into the system originally authored in Microsoft Word which uses the Microsoft Windows Codepage 1252 character set which contains codepoints for things like en-dashes (0x96 = U+2013), ellipses (0x85 = U+2026), and some other elements. However, if you treat CP1252-encoded data as UTF-8 (RFC 2279), it won’t render correctly: all bytes in the range of 0x80-0xFF, which contains en-dashes and ellipses, are encoded using two bytes instead of just one. In this particular case, the en-dashes become transcoded from 0x96 to the sequence 0xC2 0x96. The 0xC2 looks like this — Â or 쎂 — depending on which one your browser rendered correctly. You may recognize it and wonder how it got in your own pages and how to remove it, too.

For Janine, her solution was to instruct AOLserver to use the ISO-8859-1 character set as the output character set, which would mean the inputs by the users originally in CP1252 would get output from the server as ISO-8859-1 which is a close enough match, I guess. The configuration might look like this:

ns_section "ns/parameters"
ns_param OutputCharset iso-8859-1

I’m not sure what she has the other related encoding and character set settings set to, but I’ve asked and hopefully will find out and update this entry with the details.

Speak Your Mind