This interactive Unicode character table lets you browse through the vast sea of exotic characters you probably never knew your PC could display. You can click on characters & find out what other, similar characters you can substitute for them, and you can even find out what characters others have been browsing.

All this data comes from Unicode's NamesList.txt file. This is an awesome annotated listing of every printable character in every block that Unicode defines. (55,633 characters in all.)

Unicode provides several files that are considered to be normative, machine-readable lists of the block names, sections, & characters. The NamesList.txt file isn't supposed to be treated like one. But this is the only file I've found that includes the extensive annotations for cross-references, similar characters, alternate names, and alternate ways of constructing some characters by combining others. (If there is a normative file of all the annotations that are in NamesList.txt, I'd like to know where it is.)

This project has three main components that perform these tasks: Parsing the NamesList.txt file, serving up the initial webpage, and serving up new blocks of characters & statistics.

Parsing NamesList.txt is a one-time operation. It populates several tables with the names, codes, & annotations for the Unicode blocks, the sections within the blocks, and the characters themselves.

This page builds the framework for the project page, inserts the user's current font preferences, and also fills out the listbox with the Unicode block names as found in the DB.

When the page is loaded, it makes an AHAH call to the server to fill out the initial Unicode block. This is like AJAX, except we return a block of fully-formed HTML instead of XML or JSON.

I accomplish AHAH by simply sending back XML, with the HTML data enclosed inside a CDATA section. This way I get the best of both worlds: I can use the same callback function to process the statistical information as XML and to receive the pre-formatted block for display as HTML. Works like a champ.

"People who chose this character also chose..."
This project uses a brute force algorithm for tallying this statistic. (I guess it should be called a "click similarity".) When a user clicks on a character, we search their history for the last 10 characters they've clicked on that are different than this character. For each previous character, we increment a hit counter in a record in the "chars_x_chars" table: For each previous character, a record in the table means that the current character is related to the previous character. Think of a 55,336x55,336 matrix - the rows represent each possible "this" char, the columns each possible "previous" char. Each cell of the matrix would be a single combination of two characters that at least one person has copied at some time.

This implies that the algorithm uses memory at a rate of O(n2), where n=total number of characters (or SKUs if this was an ecommerce site). For this project, this means a 3 billion-element matrix of hit counters! However, the real-world memory requirements should be much less, for two reasons:

  1. In a straightforward approach you would save two records for each combination of characters, one for "this char is related to the previous char" and one for "the previous char is related to this char". But that's redundant. In other words, the resulting matrix of possible combinations is symmetric around the diagonal. So instead we save one record, setting the "this" char to whichever one has the smaller character code. In effect we're only saving the bottom left half of the matrix. When we call up the click similarity for a character, we must make a slightly more complex query than before:
    SELECT * FROM chars_x_chars WHERE this-char='$nThisChar' || other-char='$nThisChar' ORDER BY num-hits DESC LIMIT 10

    This returns the 10 characters that are the most click-similar to $nThisChar with either higher or lower character codes than $nThisChar. So the theoretical maximum table size is now "only" 1.5 billion records.

  2. Since we're storing this information in a database we only save the character combinations that have actually occurred at least once. We're effectively saving the matrix as a sparse array. So if most of the possible combinations of characters never get chosen by people, its impact on the table size should depend more on the total number of choices made instead of the universe of all possible character combinations. But as more people use the character table, more & more combinations will end up occurring at least once, and if the page becomes very popular the total size of the chars_x_chars table might conceivably approach a billion records, or at least hundreds of millions. So far there's no risk of that happening. But it's conceivable that over the years this site could get so popular that a large percentage of possible combinations do eventually get clicked on by somebody. But I doubt it.

    At a site like Amazon, on the other hand, they've probably had to confront the spectre of O(n2) already. I wonder how they handle that.