Next: , Previous: , Up: Top   [Contents][Index]

12 Various other charsets

Even if these charsets were originally added to recode for handling texts written in French, they find other uses. We did use them a lot for writing French diacriticised texts in the past, so recode knows how to handle these particularly well for French texts.


Next: , Previous: , Up: Miscellaneous   [Contents][Index]

12.1 World Wide Web representations

Character entities have been introduced by SGML and made widely popular through HTML, the markup language in use for the World Wide Web, or Web or WWW for short. For representing unusual characters, HTML texts use special sequences, beginning with an ampersand & and ending with a semicolon ;. The sequence may itself start with a number sigh # and be followed by digits, so forming a numeric character reference, or else be an alphabetic identifier, so forming a character entity reference.

The HTML standards have been revised into different HTML levels over time, and the list of allowable character entities differ in them. The later XML, meant to simplify many things, has an option (‘standalone=yes’) which much restricts that list. The recode library is able to convert character references between their mnemonic form and their numeric form, depending on aimed HTML standard level. It also can, of course, convert between HTML and various other charsets.

Here is a list of those HTML variants which recode supports. Some notes have been provided by François Yergeau yergeau@alis.com.

XML-standalone

This charset is available in recode under the name XML-standalone, with h0 as an acceptable alias. It is documented in section 4.1 of http://www.w3.org/TR/REC-xml. It only knows ‘&’, ‘>’, ‘<’, ‘"’ and ‘'’.

HTML_1.1

This charset is available in recode under the name HTML_1.1, with h1 as an acceptable alias. HTML 1.0 was never really documented.

HTML_2.0

This charset is available in recode under the name HTML_2.0, and has RFC1866, 1866 and h2 for aliases. HTML 2.0 entities are listed in RFC 1866. Basically, there is an entity for each alphabetical character in the right part of ISO 8859-1. In addition, there are four entities for syntax-significant ASCII characters: ‘&’, ‘>’, ‘<’ and ‘"’.

HTML-i18n

This charset is available in recode under the name HTML-i18n, and has RFC2070 and 2070 for aliases. RFC 2070 added entities to cover the whole right part of ISO 8859-1. The list is conveniently accessible at http://www.alis.com:8085/ietf/html/html-latin1.sgml. In addition, four i18n-related entities were added: ‘‌’ (‘‌’), ‘‍’ (‘‍’), ‘‎’ (‘&#8206’) and ‘‏’ (‘‏’).

HTML_3.2

This charset is available in recode under the name HTML_3.2, with h3 as an acceptable alias. HTML 3.2 took up the full Latin-1 list but not the i18n-related entities from RFC 2070.

HTML_4.0

This charset is available in recode under the name HTML_4.0, and has h4 and h for aliases. Beware that the particular alias h is not tied to HTML 4.0, but to the highest HTML level supported by recode; so it might later represent HTML level 5 if this is ever created. HTML 4.0 has the whole Latin-1 list, a set of entities for symbols, mathematical symbols, and Greek letters, and another set for markup-significant and internationalization characters comprising the 4 ASCII entities, the 4 i18n-related from RFC 2070 plus some more. See http://www.w3.org/TR/REC-html40/sgml/entities.html.

Printable characters from Latin-1 may be used directly in an HTML text. However, partly because people have deficient keyboards, partly because people want to transmit HTML texts over non 8-bit clean channels while not using MIME, it is common (yet debatable) to use character entity references even for Latin-1 characters, when they fall outside ASCII (that is, when they have the 8th bit set).

When you recode from another charset to HTML, beware that all occurrences of double quotes, ampersands, and left or right angle brackets are translated into special sequences. However, in practice, people often use ampersands and angle brackets in the other charset for introducing HTML commands, compromising it: it is not pure HTML, not it is pure other charset. These particular translations can be rather inconvenient, they may be specifically inhibited through the command option ‘-d’ (see Mixed).

Codes not having a mnemonic entity are output by recode using the ‘&#nnn;’ notation, where nnn is a decimal representation of the UCS code value. When there is an entity name for a character, it is always preferred over a numeric character reference. ASCII printable characters are always generated directly. So is the newline. While reading HTML, recode supports numeric character reference as alternate writings, even when written as hexadecimal numbers, as in ‘&#xfffd’. This is documented in:

http://www.w3.org/TR/REC-html40/intro/sgmltut.html#h-3.2.3

When recode translates to HTML, the translation occurs according to the HTML level as selected by the goal charset. When translating from HTML, recode not only accepts the character entity references known at that level, but also those of all other levels, as well as a few alternative special sequences, to be forgiving to files using other HTML standards.

The recode program can be used to normalise an HTML file using oldish conventions. For example, it accepts ‘&AE;’, as this once was a valid writing, somewhere. However, it should always produce ‘Æ’ instead of ‘&AE;’. Yet, this is not completely true. If one does:

recode h3..h3 < input

the operation will be optimised into a mere copy, and you can get ‘&AE;’ this way, if you had some in your input file. But if you explicitly defeat the optimisation, like this maybe:

recode h3..u2,u2..h3 < input

then ‘&AE;’ should be normalised into ‘&AElig;’ by the operation.


Next: , Previous: , Up: Miscellaneous   [Contents][Index]

12.2 LaTeX macro calls

This charset is available in recode under the name LaTeX and has ltex as an alias. It is used for ASCII files coded to be read by LaTeX or, in certain cases, by TeX.

Whenever you recode from another charset to LaTeX, beware that all occurrences of backslashes \ are translated into the string ‘\backslash{}’. However, in practice, people often use backslashes in the other charset for introducing TeX commands, compromising it: it is not pure TeX, nor it is pure other charset. This translation of backslashes into ‘\backslash{}’ can be rather inconvenient, it may be inhibited through the command option ‘-d’ (see Mixed).


Next: , Previous: , Up: Miscellaneous   [Contents][Index]

12.3 GNU project documentation files

This charset is available in recode under the name Texinfo and has texi and ti for aliases. It is used by the GNU project for its documentation. Texinfo files may be converted into Info files by the makeinfo program and into nice printed manuals by the TeX system.

Even if recode may transform other charsets to Texinfo, it may not read Texinfo files yet. In these times, usages are also changing between versions of Texinfo, and recode only partially succeeds in correctly following these changes. So, for now, Texinfo support in recode should be considered as work still in progress (!).


Next: , Previous: , Up: Miscellaneous   [Contents][Index]

12.4 Vietnamese charsets

We are currently experimenting the implementation, in recode, of a few character sets and transliterated forms to handle the Vietnamese language. They are quite briefly summarised, here.

TCVN

The TCVN charset has an incomplete name. It might be one of the three charset VN1, VN2 or VN3. Yes VN2 might be a second version of VISCII. To be clarified.

VISCII

This is an 8-bit character set which seems to be rather popular for writing Vietnamese.

VPS

This is an 8-bit character set for Vietnamese. No much reference.

VIQR

The VIQR convention is a 7-bit, ASCII transliteration for Vietnamese.

VNI

The VNI convention is a 8-bit, Latin-1 transliteration for Vietnamese.

Still lacking for Vietnamese in recode, are the charsets CP1129 and CP1258.


Next: , Previous: , Up: Miscellaneous   [Contents][Index]

12.5 African charsets

Some African character sets are available for a few languages, when these are heavily used in countries where French is also currently spoken.

One African charset is usable for Bambara, Ewondo and Fulfude, as well as for French. This charset is available in recode under the name AFRFUL-102-BPI_OCIL. Accepted aliases are bambara, bra, ewondo and fulfude. Transliterated forms of the same are available under the name AFRFUL-103-BPI_OCIL. Accepted aliases are t-bambara, t-bra, t-ewondo and t-fulfude.

Another African charset is usable for Lingala, Sango and Wolof, as well as for French. This charset is available in recode under the name AFRLIN-104-BPI_OCIL. Accepted aliases are lingala, lin, sango and wolof. Transliterated forms of the same are available under the name AFRLIN-105-BPI_OCIL. Accepted aliases are t-lingala, t-lin, t-sango and t-wolof.

To ease exchange with ISO-8859-1, there is a charset conveying transliterated forms for Latin-1 in a way which is compatible with the other African charsets in this series. This charset is available in recode under the name AFRL1-101-BPI_OCIL. Accepted aliases are t-fra and t-francais.


Next: , Previous: , Up: Miscellaneous   [Contents][Index]

12.6 Cyrillic and other charsets

The following Cyrillic charsets are already available in recode through RFC 1345 tables: CP1251 with aliases 1251, ms-cyrl and windows-1251; CSN_369103 with aliases ISO-IR-139 and KOI8_L2; ECMA-cyrillic with aliases ECMA-113, ECMA-113:1986 and iso-ir-111, IBM880 with aliases 880, CP880 and EBCDIC-Cyrillic; INIS-cyrillic with alias iso-ir-51; ISO-8859-5 with aliases cyrillic, ISO-8859-5:1988 and iso-ir-144; KOI-7; KOI-8 with alias GOST_19768-74; KOI8-R; KOI8-RU and finally KOI8-U.

There seems to remain some confusion in Roman charsets for Cyrillic languages, and because a few users requested it repeatedly, recode now offers special services in that area. Consider these charsets as experimental and debatable, as the extraneous tables describing them are still a bit fuzzy or non-standard. Hopefully, in the long run, these charsets will be covered in Keld Simonsen’s works to the satisfaction of everybody, and this section will merely disappear.

KEYBCS2

This charset is available under the name KEYBCS2, with Kamenicky as an accepted alias.

CORK

This charset is available under the name CORK, with T1 as an accepted alias.

KOI-8_CS2

This charset is available under the name KOI-8_CS2.


Next: , Previous: , Up: Miscellaneous   [Contents][Index]

12.7 Easy French conventions

This charset is available in recode under the name Texte and has txte for an alias. It is a seven bits code, identical to ASCII-BS, save for French diacritics which are noted using a slightly different convention.

At text entry time, these conventions provide a little speed up. At read time, they slightly improve the readability over a few alternate ways of coding diacritics. Of course, it would better to have a specialised keyboard to make direct eight bits entries and fonts for immediately displaying eight bit ISO Latin-1 characters. But not everybody is so fortunate. In a few mailing environments, and sadly enough, it still happens that the eight bit is often willing-fully destroyed.

Easy French has been in use in France for a while. I only slightly adapted it (the diaeresis option) to make it more comfortable to several usages in Québec originating from Université de Montréal. In fact, the main problem for me was not to necessarily to invent Easy French, but to recognise the “best” convention to use, (best is not being defined, here) and to try to solve the main pitfalls associated with the selected convention. Shortly said, we have:

e'

for e (and some other vowels) with an acute accent,

e`

for e (and some other vowels) with a grave accent,

e^

for e (and some other vowels) with a circumflex accent,

e"

for e (and some other vowels) with a diaeresis,

c,

for c with a cedilla.

There is no attempt at expressing the ae and oe diphthongs. French also uses tildes over n and a, but seldomly, and this is not represented either. In some countries, : is used instead of " to mark diaeresis. recode supports only one convention per call, depending on the ‘-c’ option of the recode command. French quotes (sometimes called “angle quotes”) are noted the same way English quotes are noted in TeX, id est by `` and ''. No effort has been put to preserve Latin ligatures (æ, œ) which are representable in several other charsets. So, these ligatures may be lost through Easy French conventions.

The convention is prone to losing information, because the diacritic meaning overloads some characters that already have other uses. To alleviate this, some knowledge of the French language is boosted into the recognition routines. So, the following subtleties are systematically obeyed by the various recognisers.

  1. A comma which follows a c is interpreted as a cedilla only if it is followed by one of the vowels a, o or u.
  2. A single quote which follows a e does not necessarily means an acute accent if it is followed by a single other one. For example:
    e'

    will give an e with an acute accent.

    e''

    will give a simple e, with a closing quotation mark.

    e'''

    will give an e with an acute accent, followed by a closing quotation mark.

    There is a problem induced by this convention if there are English quotations with a French text. In sentences like:

    There's a meeting at Archie's restaurant.
    

    the single quotes will be mistaken twice for acute accents. So English contractions and suffix possessives could be mangled.

  3. A double quote or colon, depending on ‘-c’ option, which follows a vowel is interpreted as diaeresis only if it is followed by another letter. But there are in French several words that end with a diaeresis, and the recode library is aware of them. There are words ending in “igue”, either feminine words without a relative masculine (besaiguë and ciguë), or feminine words with a relative masculine13 (aiguë, ambiguë, contiguë, exiguë, subaiguë and suraiguë). There are also words not ending in “igue”, but instead, either ending by “i”14 ending by “e” (canoë) or ending by “u”15 (Esaü).

    Just to complete this topic, note that it would be wrong to make a rule for all words ending in “igue” as needing a diaerisis, as there are counter-examples (becfigue, bèsigue, bigue, bordigue, bourdigue, brigue, contre-digue, digue, d’intrigue, fatigue, figue, garrigue, gigue, igue, intrigue, ligue, prodigue, sarigue and zigue).


Previous: , Up: Miscellaneous   [Contents][Index]

12.8 Mule as a multiplexed charset

This version of recode barely starts supporting multiplexed or super-charsets, that is, those encoding methods by which a single text stream may contain a combination of more than one constituent charset. The only multiplexed charset in recode is Mule, and even then, it is only very partially implemented: the only correspondence available is with Latin-1. The author fastly implemented this only because he needed this for himself. However, it is intended that Mule support to become more real in subsequent releases of recode.

Multiplexed charsets are not to be confused with mixed charset texts (see Mixed). For mixed charset input, the rules allowing to distinguish which charset is current, at any given place, are kind of informal, and driven from the semantics of what the file contains. On the other side, multiplexed charsets are designed to be interpreted fairly precisely, and quite independently of any informational context.

The spelling Mule originally stands for multilingual enhancement to GNU Emacs, it is the result of a collective effort orchestrated by Handa Ken’ichi since 1993. When Mule got rewritten in the main development stream of GNU Emacs 20, the FSF renamed it MULE, meaning multilingual environment in GNU Emacs. Even if the charset Mule is meant to stay internal to GNU Emacs, it sometimes breaks loose in external files, and as a consequence, a recoding tool is sometimes needed. Within Emacs, Mule comes with Leim, which stands for libraries of emacs input methods. One of these libraries is named quail16.


Footnotes

(13)

There are supposed to be seven words in this case. So, one is missing.

(14)

Look at one of the following sentences (the second has to be interpreted with the ‘-c’ option):

"Ai"e!  Voici le proble`me que j'ai"
Ai:e!  Voici le proble`me que j'ai:

There is an ambiguity between an the small animal, and the indicative future of avoir (first person singular), when followed by what could be a diaeresis mark. Hopefully, the case is solved by the fact that an apostrophe always precedes the verb and almost never the animal.

(15)

I did not pay attention to proper nouns, but this one showed up as being fairly evident.

(16)

Usually, quail means quail egg in Japanese, while egg alone is usually chicken egg. Both quail egg and chicken egg are popular food in Japan. The quail input system has been named because it is smaller that the previous EGG system. As for EGG, it is the translation of TAMAGO. This word comes from the Japanese sentence takusan matasete gomennasai, meaning sorry to have let you wait so long. Of course, the publication of EGG has been delayed many times… (Story by Takahashi Naoto)


Previous: , Up: Miscellaneous   [Contents][Index]