« September 9, 2017 | Main | September 15, 2017 »

Tuesday, September 12, 2017

UNUM Updated to Unicode 10, HTML5

I have just posted version 2.1 of UNUM. This update, the first since 2006, updates the database of Unicode characters to Unicode version 10.0.0 (June 2017) and adds, for the first time, full support for the entire set of Chinese, Japanese, and Korean (CJK) ideographic characters, for a total of 136,755 characters in all. CJK characters are identified by their nomenclature in commonly used lexicons, and, where specified in the Unicode database, English definitions.

The table of HTML named character references (the sequences like “<” you use in HTML source code when you need to represent a character which has a syntactic meaning in HTML or which can't be directly included in a file with the character encoding you're using to write it) has been updated to the list published by the World Wide Web Consortium (W3C) for HTML5.

It used to be that HTML named character references were a convenient text-based shorthand so that, for example, if your keyboard or content management system didn't have a direct way to specify the Unicode character for a right single quote, you could write “’” instead of “’”, the numeric code for the character. This was handy, made the HTML easier to understand, and made perfect sense and so, of course, it had to be “improved”. Now, you can specify the same character as either “’”, “’”, or “’” as well. “Close Curly Quote”—are there also Larry and Moe quotes? Now, apparently to accommodate dim people who can't remember or be bothered to look up the standard character references which have been in use for more than a decade (and how many of them are writing HTML, anyway?), we have lost the ability to provide a unique HTML character reference for Unicode code points which have them. In other words, the mapping from code points to named character references has gone from one-to-one to one-to-many.

Further, named character references have been extended from a symbolic nomenclature for Unicode code points to specify logical character definitions which are composed of multiple (all the current ones specify only two) code points which are combined to generate the character. For example, the character reference “≫̸”, which stands for the mathematical symbol “not much greater than”, is actually composed of code points U+226B (MUCH GREATER-THAN) and U+0338 (COMBINING LONG SOLIDUS OVERLAY).

Previously, UNUM could assume a one-to-one mapping between HTML character references and Unicode code points, but thanks to these innovations this is no longer the case. Now, when a character is displayed, if it has more than one HTML name, they are all displayed in the HTML column, separated by commas. If the user looks up a composite character reference, all of the Unicode code points which make it up are displayed, one per line, in the order specified in the W3C specification.

The addition of the CJK characters makes the code point definition table, which was already large, simply colossal. The Perl code for UNUM including this table is now almost eight megabytes. To cope with this, there is now a compressed version of UNUM in which the table is compressed with the bzip2 utility, which reduces the size of the program to less than a megabyte. This requires a modern version of Perl and a Unix-like system on which bzip2 is installed. Users who lack these prerequisites may download an uncompressed version of UNUM, which will work in almost any environment which can run Perl.

UNUM Documentation and Download Page

Update: Version 2.2 improves compatibility of the compressed version of the utility by automatically falling back to Perl's core IO::Uncompress::Bunzip2 module if the host system does not have bunzip2 installed. (2017-09-19 18:47 UTC)

Posted at 23:35 Permalink