Mar 192016
 

As you may or may not know, Unicode is the standard for encoding text into ‘computer speak’. There have been many different encodings of characters (graphemes, and other symbols) into “computer speak” by different manufacturers; all of which were severely limited. Almost all of them included the normal Roman alphabet plus a variety of other symbols. Amongst other things, this had two main problems.

Firstly, it excludes large parts of the world that use other writing systems from using computers sensibly – computers are difficult enough to use (at least at first) without learning a whole new writing system. And even then storing a document by transliterating it is not ideal – you have changed the original and introduced another possible source of errors.

Secondly there is the problem of errors introduced into documents when moving them from one computer system to another – for example it was not unknown for a “#” to become a “£” (or very similarly, “£” will appear as “£”). Very much more extreme examples exist.

Of course because of the enormous number of symbols in Unicode, there is a great deal of fun to be had with Unicode – not infrequently poking fun at Unicode for including ridiculous symbols. And why not? There’s no harm in having a bit of fun :-

ɥʇıpǝɹǝɯ ǝʞıɯ

However it is worth pointing out that Unicode standards are a serious business and slipping in “fun” symbols is not likely to happen. Although I was not directly involved, I did help out one of the people who pushed for the inclusion of medieval Slavonic characters within the Unicode standard, and it is not a trivial process.

And now for some “fun” Unicode characters…

þ

The Thorn. The English letter that got away. Before the age of printing, English had an additional letter (I’m over-simplifying here) which was used instead of the digraph th, so words such as the would have been spelt þe. By the time that printing had arrived, the shape of the thorn letter was becoming more like a “y” and because the printers imported their equipment from countries that did not have a thorn, the printed books tended to use “y” instead of þ. Which is of course where we get “Ye olde Shoppe” from.

Of course it was confusing printing ye when we said þe (or the), so the printers settled on the.

So why is þ in Unicode? Because you cannot discuss the letter without including it, and perhaps more importantly cannot encode a historical document that used þ without an encoding for it.

☃ and probably ☕

Ah! The snowman (and the cup of coffee). What sense of fun allowed these symbols into the standard?

Well according to the Unicode standard, it is contained within a block of weather symbols so it was almost certainly contained within a TV station’s encoding standard for weather forecasts. And you cannot claim to be a universal standard for text encoding without including the symbols included in other encodings.

The interrobang. The punctuation symbol used (if rarely) for signalling both a question and an exclamation: What the bleep are you doing‽

Whilst not commonly used today, it was very commonly used in the 1960s and so there are many documents that need encoding that use this symbol.

☠, ⚠, ☢

These look like fun don’t they? They certainly do to me, but in fact they are international symbols for various dangers – poison (☠), warning (⚠) of a general nature, and radioactivity (☢). All pretty serious stuff; and you really don’t want those symbols garbled in a document.

This is a Thai “letter” and I picked it out because it’s made fun of elsewhere, but it stands for all the non-European symbols used in language.

It may look kind of funny, but it probably isn’t so much to someone who knows Thai. To put it another way, if you told me that we’re not going to include the “M” in a character encoding because it looks too silly, I’d be very, very annoyed (my name contains two of ’em).

And yes I can type all of the above and the following into a text terminal 😃

 

2016-03-19_1119