Miscellaneous (The recode reference manual)

Even if these charsets were originally added to recode for handling texts written in French, they find other uses. We did use them a lot for writing French diacriticised texts in the past, so recode knows how to handle these particularly well for French texts.

12.1 World Wide Web representations

Character entities have been introduced by SGML and made widely popular through HTML, the markup language in use for the World Wide Web, or Web or WWW for short. For representing unusual characters, HTML texts use special sequences, beginning with an ampersand & and ending with a semicolon ;. The sequence may itself start with a number sigh # and be followed by digits, so forming a numeric character reference, or else be an alphabetic identifier, so forming a character entity reference.

The HTML standards have been revised into different HTML levels over time, and the list of allowable character entities differ in them. The later XML, meant to simplify many things, has an option (‘standalone=yes’) which much restricts that list. The recode library is able to convert character references between their mnemonic form and their numeric form, depending on aimed HTML standard level. It also can, of course, convert between HTML and various other charsets.

Here is a list of those HTML variants which recode supports. Some notes have been provided by François Yergeau yergeau@alis.com.

Printable characters from Latin-1 may be used directly in an HTML text. However, partly because people have deficient keyboards, partly because people want to transmit HTML texts over non 8-bit clean channels while not using MIME, it is common (yet debatable) to use character entity references even for Latin-1 characters, when they fall outside ASCII (that is, when they have the 8th bit set).

When you recode from another charset to HTML, beware that all occurrences of double quotes, ampersands, and left or right angle brackets are translated into special sequences. However, in practice, people often use ampersands and angle brackets in the other charset for introducing HTML commands, compromising it: it is not pure HTML, not it is pure other charset. These particular translations can be rather inconvenient, they may be specifically inhibited through the command option ‘-d’ (see Mixed).

Codes not having a mnemonic entity are output by recode using the ‘&#nnn;’ notation, where nnn is a decimal representation of the UCS code value. When there is an entity name for a character, it is always preferred over a numeric character reference. ASCII printable characters are always generated directly. So is the newline. While reading HTML, recode supports numeric character reference as alternate writings, even when written as hexadecimal numbers, as in ‘&#xfffd’. This is documented in:

When recode translates to HTML, the translation occurs according to the HTML level as selected by the goal charset. When translating from HTML, recode not only accepts the character entity references known at that level, but also those of all other levels, as well as a few alternative special sequences, to be forgiving to files using other HTML standards.

The recode program can be used to normalise an HTML file using oldish conventions. For example, it accepts ‘&AE;’, as this once was a valid writing, somewhere. However, it should always produce ‘Æ’ instead of ‘&AE;’. Yet, this is not completely true. If one does:

the operation will be optimised into a mere copy, and you can get ‘&AE;’ this way, if you had some in your input file. But if you explicitly defeat the optimisation, like this maybe:

12.2 LaTeX macro calls

This charset is available in recode under the name LaTeX and has ltex as an alias. It is used for ASCII files coded to be read by LaTeX or, in certain cases, by TeX.

Whenever you recode from another charset to LaTeX, beware that all occurrences of backslashes \ are translated into the string ‘\backslash{}’. However, in practice, people often use backslashes in the other charset for introducing TeX commands, compromising it: it is not pure TeX, nor it is pure other charset. This translation of backslashes into ‘\backslash{}’ can be rather inconvenient, it may be inhibited through the command option ‘-d’ (see Mixed).

12.3 GNU project documentation files

This charset is available in recode under the name Texinfo and has texi and ti for aliases. It is used by the GNU project for its documentation. Texinfo files may be converted into Info files by the makeinfo program and into nice printed manuals by the TeX system.

Even if recode may transform other charsets to Texinfo, it may not read Texinfo files yet. In these times, usages are also changing between versions of Texinfo, and recode only partially succeeds in correctly following these changes. So, for now, Texinfo support in recode should be considered as work still in progress (!).

12.4 Vietnamese charsets

We are currently experimenting the implementation, in recode, of a few character sets and transliterated forms to handle the Vietnamese language. They are quite briefly summarised, here.

Still lacking for Vietnamese in recode, are the charsets CP1129 and CP1258.

12.5 African charsets

Some African character sets are available for a few languages, when these are heavily used in countries where French is also currently spoken.

One African charset is usable for Bambara, Ewondo and Fulfude, as well as for French. This charset is available in recode under the name AFRFUL-102-BPI_OCIL. Accepted aliases are bambara, bra, ewondo and fulfude. Transliterated forms of the same are available under the name AFRFUL-103-BPI_OCIL. Accepted aliases are t-bambara, t-bra, t-ewondo and t-fulfude.

Another African charset is usable for Lingala, Sango and Wolof, as well as for French. This charset is available in recode under the name AFRLIN-104-BPI_OCIL. Accepted aliases are lingala, lin, sango and wolof. Transliterated forms of the same are available under the name AFRLIN-105-BPI_OCIL. Accepted aliases are t-lingala, t-lin, t-sango and t-wolof.

To ease exchange with ISO-8859-1, there is a charset conveying transliterated forms for Latin-1 in a way which is compatible with the other African charsets in this series. This charset is available in recode under the name AFRL1-101-BPI_OCIL. Accepted aliases are t-fra and t-francais.

12.6 Cyrillic and other charsets

The following Cyrillic charsets are already available in recode through RFC 1345 tables: CP1251 with aliases 1251,


ms-cyrl

and windows-1251; CSN_369103 with aliases ISO-IR-139 and KOI8_L2; ECMA-cyrillic with aliases ECMA-113, ECMA-113:1986 and iso-ir-111, IBM880 with aliases 880, CP880 and EBCDIC-Cyrillic; INIS-cyrillic with alias iso-ir-51; ISO-8859-5 with aliases cyrillic, ISO-8859-5:1988 and iso-ir-144; KOI-7; KOI-8 with alias GOST_19768-74; KOI8-R; KOI8-RU and finally KOI8-U.

There seems to remain some confusion in Roman charsets for Cyrillic languages, and because a few users requested it repeatedly, recode now offers special services in that area. Consider these charsets as experimental and debatable, as the extraneous tables describing them are still a bit fuzzy or non-standard. Hopefully, in the long run, these charsets will be covered in Keld Simonsen’s works to the satisfaction of everybody, and this section will merely disappear.

12.7 Easy French conventions

This charset is available in recode under the name Texte and has txte for an alias. It is a seven bits code, identical to ASCII-BS, save for French diacritics which are noted using a slightly different convention.

At text entry time, these conventions provide a little speed up. At read time, they slightly improve the readability over a few alternate ways of coding diacritics. Of course, it would better to have a specialised keyboard to make direct eight bits entries and fonts for immediately displaying eight bit ISO Latin-1 characters. But not everybody is so fortunate. In a few mailing environments, and sadly enough, it still happens that the eight bit is often willing-fully destroyed.

Easy French has been in use in France for a while. I only slightly adapted it (the diaeresis option) to make it more comfortable to several usages in Québec originating from Université de Montréal. In fact, the main problem for me was not to necessarily to invent Easy French, but to recognise the “best” convention to use, (best is not being defined, here) and to try to solve the main pitfalls associated with the selected convention. Shortly said, we have:

There is no attempt at expressing the ae and oe diphthongs. French also uses tildes over n and a, but seldomly, and this is not represented either. In some countries, : is used instead of " to mark diaeresis. recode supports only one convention per call, depending on the ‘-c’ option of the recode command. French quotes (sometimes called “angle quotes”) are noted the same way English quotes are noted in TeX, id est by `` and ''. No effort has been put to preserve Latin ligatures (æ, œ) which are representable in several other charsets. So, these ligatures may be lost through Easy French conventions.

The convention is prone to losing information, because the diacritic meaning overloads some characters that already have other uses. To alleviate this, some knowledge of the French language is boosted into the recognition routines. So, the following subtleties are systematically obeyed by the various recognisers.

12.8 Mule as a multiplexed charset

This version of recode barely starts supporting multiplexed or super-charsets, that is, those encoding methods by which a single text stream may contain a combination of more than one constituent charset. The only multiplexed charset in recode is Mule, and even then, it is only very partially implemented: the only correspondence available is with Latin-1. The author fastly implemented this only because he needed this for himself. However, it is intended that Mule support to become more real in subsequent releases of recode.

Multiplexed charsets are not to be confused with mixed charset texts (see Mixed). For mixed charset input, the rules allowing to distinguish which charset is current, at any given place, are kind of informal, and driven from the semantics of what the file contains. On the other side, multiplexed charsets are designed to be interpreted fairly precisely, and quite independently of any informational context.

The spelling Mule originally stands for multilingual enhancement to GNU Emacs, it is the result of a collective effort orchestrated by Handa Ken’ichi since 1993. When Mule got rewritten in the main development stream of GNU Emacs 20, the FSF renamed it MULE, meaning multilingual environment in GNU Emacs. Even if the charset Mule is meant to stay internal to GNU Emacs, it sometimes breaks loose in external files, and as a consequence, a recoding tool is sometimes needed. Within Emacs, Mule comes with Leim, which stands for libraries of emacs input methods. One of these libraries is named quail¹⁶.

Footnotes

(13)

There are supposed to be seven words in this case. So, one is missing.

(14)

Look at one of the following sentences (the second has to be interpreted with the ‘-c’ option):

"Ai"e!  Voici le proble`me que j'ai"
Ai:e!  Voici le proble`me que j'ai:

There is an ambiguity between an the small animal, and the indicative future of avoir (first person singular), when followed by what could be a diaeresis mark. Hopefully, the case is solved by the fact that an apostrophe always precedes the verb and almost never the animal.

(15)

I did not pay attention to proper nouns, but this one showed up as being fairly evident.

(16)

Usually, quail means quail egg in Japanese, while egg alone is usually chicken egg. Both quail egg and chicken egg are popular food in Japan. The quail input system has been named because it is smaller that the previous EGG system. As for EGG, it is the translation of TAMAGO. This word comes from the Japanese sentence takusan matasete gomennasai, meaning sorry to have let you wait so long. Of course, the publication of EGG has been delayed many times… (Story by Takahashi Naoto)

• HTML		World Wide Web representations
• LaTeX		LaTeX macro calls
• Texinfo		GNU project documentation files
• Vietnamese
• African		African charsets
• Others
• Texte		Easy French conventions
• Mule		Mule as a multiplexed charset

12 Various other charsets