Troubleshooting Unicode in the Khan Academy exercise framework

I started playing around with Khan Academy’s exercise framework. If I am successful in figuring this out sufficiently, the efforts would be applied for teaching a foreign language (Thamil) rather than teaching math. I had gotten the basic case figured out — a single-question, question-answer exercise. From there, I inserted Thamil characters and saved the HTML doc into UTF-16 encoding, and by then, I experienced serious errors. Fortunately, it didn’t take much more than learning the basics of Unicode and a little poking around to figure it out, fix the issue, and get a better understanding of how it all fits together.

In pictures

First, what the first sample exercise looked like.  You have to type the word “green” (no quotes) to get the problem right.

success with a basic KA exercise

success with a basic KA exercise

When typing in the answer on the original exercise webpage, you have to type it as “green”, with quotes, since the text you provide is interpreted as Javascript code by the KA exercise framework code.  (Side note: since the KA framework code is in Javascript, too, it gives me the odd feeling like the framework code is like one giant macro.)

With the first step done, I decided to mix in a little Thamil.  Here’s what I got instead:

first look after inserting Thamil text

first look after inserting Thamil text

All seemed well except for the representation of the Thamil characters.  The file was saved using the gEdit text editor in the UTF-8 encoding.  I vaguely remembered UTF-8 to be like fancy ASCII.  Figuring that Thamil characters were in the range of the spec where at least 2 bytes were necessary, I tried again, but this time saving the file in UTF-16:

after changing the page's encoding to UTF-16

after changing the page's encoding to UTF-16

Either the KA code or JS apparently didn’t like pages in UTF-16.  The HTML rendered is that of the original HTML prior to KA’s “macro” code execution, and browsers detect the encoding as UTF-16, but it doesn’t change the fact that the rendered page most likely failed to execute the KA code.  At this point, I needed to review Unicode for more clues.

Background on Unicode

Probably the best 2 quick references for Unicode in the context of using it in programs are the following:

People may have come across Joel Spolsky’s post already, as I did a year or two ago.  It’s only now that came across the other link above for the first time.  But I must admit that I found it to explain things better, even the quirks, although YMMV.  With that knowledge seemingly refreshed, it seemed as if UTF-16 was the only encoding that made sense to use, and the situation remained puzzling.

The breakthrough came when it seemed that the UTF-8 file rendered fine in Mac OS X.  That meant that: 1) UTF-8 was not an impediment to rendering Thamil and other “complex” scripts, and 2) the Unicode-rendering OSes have long been able to interpret UTF-16 characters from UTF-8 character streams.  That means for the “complex scripts” (‘Indic’, Arabic, etc.), two bytes can be put together to represent every UTF-16 double-byte/character/codepoint in a way that UTF-8 can handle.  Also, for Thamil and other scripts, one glyph (what gets displayed to the screen as a single “character”) can be represented by one or two characters/codepoints.

In light of this, I checked what the default encodings were in the browsers I was using.  Sure enough, my browsers were still using ISO-8859-1/Latin-1 as the default encoding for reason (I never changed it from the defaults?), and that encoding will crap out on codepoints it isn’t defined for, including Thamil.

setting the browser's default encoding for interpreting pages

setting the browser's default encoding for interpreting pages

For good measure, I inserted the line

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

in the HTML head tag to help give the browser a hint of what encoding to use in case it was set on “Auto-detect”.

In the end, what I learned was nothing that wasn’t already mentioned in those articles — but I guess it will stick this time around much better!  The fact that Javascript and UTF-16 don’t always play well still doesn’t identify whether the problem is in KA code or JS itself, but fortunately that is moot.  And the final result, saved as a UTF-8 file with the UTF-8 content type meta tag:

the final result, after the fixes and workarounds

the final result, after the fixes and workarounds

Update (12/1/11): It turns out that the Apache web server can (and does) by default set the encoding of the files it serves to be the extended ASCII / Latin extended, i.e. ISO-8859-1. When this conflicts with the HTTP header information set by the tag, then sometimes, browsers will use the web server’s reported encoding and not the one in the HTTP headers. This can be fixed by adding a .htaccess file overriding default Apache encoding settings with UTF-8.