Hebrew Bible Updated to Unicode, XHTML Strict (Fourmilog: None Dare Call It Reason)

« Recipes: Steak with Roquefort Mushroom Sauce | Main | Twitterbot is a Bad, Bad Boy »

Tuesday, July 17, 2018

Hebrew Bible Updated to Unicode, XHTML Strict

The Web edition of the Hebrew Bible has been available at Fourmilab since 1998. It originally required a browser extension to support downloadable fonts. When this became obsolete, a second edition was released in 2002 which used the ISO 8859-8 character set, which includes the ASCII Latin character set and Hebrew letters (but no vowel signs). Most Web browsers at the time supported this character set, although some required the installation of a "language pack" or font in order to display it.

At the time, I remarked that when Unicode became widely adopted, all of the complexity of special character sets for each language would evaporate, as we'd have a single character encoding which could handle all commonly-used languages (and many obscure ones, as well). Now, in 2018, we have made our landfall on that happy shore. The vast majority of widely-used operating systems and Web browsers support Unicode and provide at least one font with characters for the major languages.

I have just released a third edition of the Fourmilab Hebrew Bible, in which all documents use Unicode for all text, using the UTF-8 representation which now accounts for more than 90% of traffic on the Web. Any browser which supports Unicode and includes a font providing the Hebrew character set will be able to display these documents without any special configuration required—it should just work.

I have also updated all documents to the XHTML 1.0 Strict standard. I prefer this standard to HTML5 for documents which do not require features of the latter standard (such as embedded audio and video or the canvas element) since, being well-formed XML, XHTML documents can easily be parsed by computer programs which wish to process their content.

You can cite a chapter within a book of the Bible with a URL like:

http://www.fourmilab.ch/etexts/www/hebrew/Bible/?Exodus.html#c10

or an individual verse with:

http://www.fourmilab.ch/etexts/www/hebrew/Bible/?Exodus.html#v5:7

Previous editions of the Hebrew Bible did not require the “c” or “v” before the chapter or chapter:verse; this is a requirement of XHTML, in which the “id=” attribute must not start with a digit. For compatibility with existing citations, the “c” or “v” may be omitted, but in direct URLs citing the book document itself, they must be supplied.

This edition of the Hebrew Bible, like its predecessors, does not rely upon the so-called “Unicode Bidirectional Algorithm”. Instead, characters appear in the source HTML documents in the order they are presented in the page, with Hebrew text being explicitly reversed in order to read from right to left. In my experience, getting involved with automatic bidirectional text handling is the royal road to madness, and programmers who wish to keep what little hair that remains after half a century unscrewing the inscrutable trust their instinct about things to avoid. Hebrew text, which would otherwise automatically be rendered right-to-left by the browser, is explicitly surrounded by HTML tags:

<bdo dir="ltr">ת ישארב</bdo>

to override the default direction based upon the characters, in the example, the first word of Genesis. (You can also override the directionality of text by prefixing the Unicode LRO [‭] or RLO [‮] character and appending a PDF [‬] to the string. I chose to use the XHTML override tag since it makes the intent clearer when processing the document with a program.)

To fully appreciate the insanity that Unicode bidirectional mode can induce in the minds of authors of multilingual documents, consider the following simplified HTML code for a sentence from the Hebrew Bible help file.

One writes:
100 as &#1511;,
101 as &#1488;&#1511;,
110 as &#1497;&#1511;, and
111 as &#1488;&#1497;&#1511;.

Want to guess how the browser renders this? Go ahead, guess. What you get is:

One writes: 100 as ק, 101 as אק, 110 as יק, and 111 as איק.

What? Why?? This way leads to the asylum. If you wrap the Hebrew with:

One writes:
100 as <bdo dir="ltr">&#1511;</bdo>,
101 as <bdo dir="ltr">&#1488;&#1511;</bdo>,
110 as <bdo dir="ltr">&#1497;&#1511;</bdo>, and
111 as <bdo dir="ltr">&#1488;&#1497;&#1511;</bdo>.

you get the desired:

One writes: 100 as ק, 101 as אק, 110 as יק, and 111 as איק.

In these examples, I have used HTML text entities (such as “א”) in the interest of comprehensibility. If you use actual Unicode characters and edit with a text editor such as Geany which infers text direction from the characters adjacent to the cursor, things get even more bewildering. The Hebrew Bible files contain Unicode characters, not text entities, but I only process them with custom Perl programs, never with a text editor.

In case somebody needs it, the ISO 8859-8 edition remains available.

Posted at July 17, 2018 13:53

Fourmilog: None Dare Call It Reason

John Walker's Fourmilab Change Log

Tuesday, July 17, 2018

Hebrew Bible Updated to Unicode, XHTML Strict