Even if these charsets were originally added to recode
for
handling texts written in French, they find other uses. We did use them
a lot for writing French diacriticised texts in the past, so recode
knows how to handle these particularly well for French texts.
• HTML | World Wide Web representations | |
• LaTeX | LaTeX macro calls | |
• Texinfo | GNU project documentation files | |
• Vietnamese | ||
• African | African charsets | |
• Others | ||
• Texte | Easy French conventions | |
• Mule | Mule as a multiplexed charset |
Next: LaTeX, Previous: Miscellaneous, Up: Miscellaneous [Contents][Index]
Character entities have been introduced by SGML and made widely popular through HTML, the markup language in use for the World Wide Web, or Web or WWW for short. For representing unusual characters, HTML texts use special sequences, beginning with an ampersand & and ending with a semicolon ;. The sequence may itself start with a number sigh # and be followed by digits, so forming a numeric character reference, or else be an alphabetic identifier, so forming a character entity reference.
The HTML standards have been revised into different HTML levels over time,
and the list of allowable character entities differ in them. The later XML,
meant to simplify many things, has an option (‘standalone=yes’) which
much restricts that list. The recode
library is able to convert
character references between their mnemonic form and their numeric form,
depending on aimed HTML standard level. It also can, of course, convert
between HTML and various other charsets.
Here is a list of those HTML variants which recode
supports.
Some notes have been provided by François Yergeau yergeau@alis.com.
XML-standalone
This charset is available in recode
under the name
XML-standalone
, with h0
as an acceptable alias. It is
documented in section 4.1 of http://www.w3.org/TR/REC-xml.
It only knows ‘&’, ‘>’, ‘<’, ‘"’
and ‘'’.
HTML_1.1
This charset is available in recode
under the name HTML_1.1
,
with h1
as an acceptable alias. HTML 1.0 was never really documented.
HTML_2.0
This charset is available in recode
under the name HTML_2.0
,
and has RFC1866
, 1866
and h2
for aliases. HTML 2.0
entities are listed in RFC 1866. Basically, there is an entity for
each alphabetical character in the right part of ISO 8859-1.
In addition, there are four entities for syntax-significant ASCII characters:
‘&’, ‘>’, ‘<’ and ‘"’.
HTML-i18n
This charset is available in recode
under the name
HTML-i18n
, and has RFC2070
and 2070
for
aliases. RFC 2070 added entities to cover the whole right
part of ISO 8859-1. The list is conveniently accessible at
http://www.alis.com:8085/ietf/html/html-latin1.sgml. In addition,
four i18n-related entities were added: ‘‌’ (‘‌’),
‘‍’ (‘‍’), ‘‎’ (‘‎’) and ‘‏’
(‘‏’).
HTML_3.2
This charset is available in recode
under the name
HTML_3.2
, with h3
as an acceptable alias.
HTML 3.2 took up the full
Latin-1 list but not the i18n-related entities from RFC 2070.
HTML_4.0
This charset is available in recode
under the name HTML_4.0
,
and has h4
and h
for aliases. Beware that the particular
alias h
is not tied to HTML 4.0, but to the highest HTML
level supported by recode
; so it might later represent HTML level
5 if this is ever created. HTML 4.0 has the whole Latin-1 list, a set of entities for
symbols, mathematical symbols, and Greek letters, and another set for
markup-significant and internationalization characters comprising the
4 ASCII entities, the 4 i18n-related from RFC 2070 plus some more.
See http://www.w3.org/TR/REC-html40/sgml/entities.html.
Printable characters from Latin-1 may be used directly in an HTML text. However, partly because people have deficient keyboards, partly because people want to transmit HTML texts over non 8-bit clean channels while not using MIME, it is common (yet debatable) to use character entity references even for Latin-1 characters, when they fall outside ASCII (that is, when they have the 8th bit set).
When you recode from another charset to HTML
, beware that all
occurrences of double quotes, ampersands, and left or right angle brackets
are translated into special sequences. However, in practice, people often
use ampersands and angle brackets in the other charset for introducing
HTML commands, compromising it: it is not pure HTML, not it is pure
other charset. These particular translations can be rather inconvenient,
they may be specifically inhibited through the command option ‘-d’
(see Mixed).
Codes not having a mnemonic entity are output by recode
using the
‘&#nnn;’ notation, where nnn is a decimal representation
of the UCS code value. When there is an entity name for a character, it
is always preferred over a numeric character reference. ASCII printable
characters are always generated directly. So is the newline. While reading
HTML, recode
supports numeric character reference as alternate
writings, even when written as hexadecimal numbers, as in ‘�’.
This is documented in:
http://www.w3.org/TR/REC-html40/intro/sgmltut.html#h-3.2.3
When recode
translates to HTML, the translation occurs according to
the HTML level as selected by the goal charset. When translating from
HTML, recode
not only accepts the character entity references known at
that level, but also those of all other levels, as well as a few alternative
special sequences, to be forgiving to files using other HTML standards.
The recode
program can be used to normalise an HTML file using
oldish conventions. For example, it accepts ‘&AE;’, as this once was a
valid writing, somewhere. However, it should always produce ‘Æ’
instead of ‘&AE;’. Yet, this is not completely true. If one does:
recode h3..h3 < input
the operation will be optimised into a mere copy, and you can get ‘&AE;’ this way, if you had some in your input file. But if you explicitly defeat the optimisation, like this maybe:
recode h3..u2,u2..h3 < input
then ‘&AE;’ should be normalised into ‘Æ’ by the operation.
Next: Texinfo, Previous: HTML, Up: Miscellaneous [Contents][Index]
This charset is available in recode
under the name LaTeX
and has ltex
as an alias. It is used for ASCII files coded to be
read by LaTeX or, in certain cases, by TeX.
Whenever you recode from another charset to LaTeX
, beware that all
occurrences of backslashes \ are translated into the string
‘\backslash{}’. However, in practice, people often use backslashes
in the other charset for introducing TeX commands, compromising it:
it is not pure TeX, nor it is pure other charset. This translation
of backslashes into ‘\backslash{}’ can be rather inconvenient,
it may be inhibited through the command option ‘-d’ (see Mixed).
Next: Vietnamese, Previous: LaTeX, Up: Miscellaneous [Contents][Index]
This charset is available in recode
under the name Texinfo
and has texi
and ti
for aliases. It is used by the GNU
project for its documentation. Texinfo files may be converted into Info
files by the makeinfo
program and into nice printed manuals by
the TeX system.
Even if recode
may transform other charsets to Texinfo, it may
not read Texinfo files yet. In these times, usages are also changing
between versions of Texinfo, and recode
only partially succeeds
in correctly following these changes. So, for now, Texinfo support in
recode
should be considered as work still in progress (!).
Next: African, Previous: Texinfo, Up: Miscellaneous [Contents][Index]
We are currently experimenting the implementation, in recode
, of a few
character sets and transliterated forms to handle the Vietnamese language.
They are quite briefly summarised, here.
TCVN
The TCVN charset has an incomplete name. It might be one of the three
charset VN1
, VN2
or VN3
. Yes VN2
might be a
second version of VISCII
. To be clarified.
VISCII
This is an 8-bit character set which seems to be rather popular for writing Vietnamese.
VPS
This is an 8-bit character set for Vietnamese. No much reference.
VIQR
The VIQR convention is a 7-bit, ASCII
transliteration for Vietnamese.
VNI
The VNI convention is a 8-bit, Latin-1
transliteration for Vietnamese.
Still lacking for Vietnamese in recode
, are the charsets CP1129
and CP1258
.
Next: Others, Previous: Vietnamese, Up: Miscellaneous [Contents][Index]
Some African character sets are available for a few languages, when these are heavily used in countries where French is also currently spoken.
One African charset is usable for Bambara, Ewondo and Fulfude, as well
as for French. This charset is available in recode
under the name
AFRFUL-102-BPI_OCIL
. Accepted aliases are bambara
, bra
,
ewondo
and fulfude
. Transliterated forms of the same are
available under the name AFRFUL-103-BPI_OCIL
. Accepted aliases
are t-bambara
, t-bra
, t-ewondo
and t-fulfude
.
Another African charset is usable for Lingala, Sango and Wolof, as well
as for French. This charset is available in recode
under the
name AFRLIN-104-BPI_OCIL
. Accepted aliases are lingala
,
lin
, sango
and wolof
. Transliterated forms of the same
are available under the name AFRLIN-105-BPI_OCIL
. Accepted aliases
are t-lingala
, t-lin
, t-sango
and t-wolof
.
To ease exchange with ISO-8859-1
, there is a charset conveying
transliterated forms for Latin-1 in a way which is compatible with the other
African charsets in this series. This charset is available in recode
under the name AFRL1-101-BPI_OCIL
. Accepted aliases are t-fra
and t-francais
.
Next: Texte, Previous: African, Up: Miscellaneous [Contents][Index]
The following Cyrillic charsets are already available in recode
through RFC 1345 tables: CP1251
with aliases 1251
,
ms-cyrl
and windows-1251
; CSN_369103
with aliases
ISO-IR-139
and KOI8_L2
; ECMA-cyrillic
with aliases
ECMA-113
, ECMA-113:1986
and iso-ir-111
, IBM880
with aliases 880
, CP880
and EBCDIC-Cyrillic
;
INIS-cyrillic
with alias iso-ir-51
; ISO-8859-5
with
aliases cyrillic
, ISO-8859-5:1988
and iso-ir-144
;
KOI-7
; KOI-8
with alias GOST_19768-74
; KOI8-R
;
KOI8-RU
and finally KOI8-U
.
There seems to remain some confusion in Roman charsets for Cyrillic
languages, and because a few users requested it repeatedly, recode
now offers special services in that area. Consider these charsets as
experimental and debatable, as the extraneous tables describing them are
still a bit fuzzy or non-standard. Hopefully, in the long run, these
charsets will be covered in Keld Simonsen’s works to the satisfaction of
everybody, and this section will merely disappear.
KEYBCS2
This charset is available under the name KEYBCS2
, with
Kamenicky
as an accepted alias.
CORK
This charset is available under the name CORK
, with T1
as an accepted alias.
KOI-8_CS2
This charset is available under the name KOI-8_CS2
.
Next: Mule, Previous: Others, Up: Miscellaneous [Contents][Index]
This charset is available in recode
under the name Texte
and has txte
for an alias. It is a seven bits code, identical
to ASCII-BS
, save for French diacritics which are noted using a
slightly different convention.
At text entry time, these conventions provide a little speed up. At read time, they slightly improve the readability over a few alternate ways of coding diacritics. Of course, it would better to have a specialised keyboard to make direct eight bits entries and fonts for immediately displaying eight bit ISO Latin-1 characters. But not everybody is so fortunate. In a few mailing environments, and sadly enough, it still happens that the eight bit is often willing-fully destroyed.
Easy French has been in use in France for a while. I only slightly adapted it (the diaeresis option) to make it more comfortable to several usages in Québec originating from Université de Montréal. In fact, the main problem for me was not to necessarily to invent Easy French, but to recognise the “best” convention to use, (best is not being defined, here) and to try to solve the main pitfalls associated with the selected convention. Shortly said, we have:
for e (and some other vowels) with an acute accent,
for e (and some other vowels) with a grave accent,
for e (and some other vowels) with a circumflex accent,
for e (and some other vowels) with a diaeresis,
for c with a cedilla.
There is no attempt at expressing the ae and oe diphthongs.
French also uses tildes over n and a, but seldomly, and this
is not represented either. In some countries, : is used instead
of " to mark diaeresis. recode
supports only one convention
per call, depending on the ‘-c’ option of the recode
command.
French quotes (sometimes called “angle quotes”) are noted the same way
English quotes are noted in TeX, id est by `` and ''.
No effort has been put to preserve Latin ligatures (æ, œ)
which are representable in several other charsets. So, these ligatures
may be lost through Easy French conventions.
The convention is prone to losing information, because the diacritic meaning overloads some characters that already have other uses. To alleviate this, some knowledge of the French language is boosted into the recognition routines. So, the following subtleties are systematically obeyed by the various recognisers.
will give an e with an acute accent.
will give a simple e, with a closing quotation mark.
will give an e with an acute accent, followed by a closing quotation mark.
There is a problem induced by this convention if there are English quotations with a French text. In sentences like:
There's a meeting at Archie's restaurant.
the single quotes will be mistaken twice for acute accents. So English contractions and suffix possessives could be mangled.
recode
library is aware of them. There are words ending in
“igue”, either feminine words without a relative masculine (besaiguë
and ciguë), or feminine words with a relative masculine13
(aiguë, ambiguë, contiguë, exiguë, subaiguë and suraiguë).
There are also words not ending in “igue”, but instead, either ending by
“i”14
ending by “e” (canoë) or ending by “u”15
(Esaü).
Just to complete this topic, note that it would be wrong to make a rule for all words ending in “igue” as needing a diaerisis, as there are counter-examples (becfigue, bèsigue, bigue, bordigue, bourdigue, brigue, contre-digue, digue, d’intrigue, fatigue, figue, garrigue, gigue, igue, intrigue, ligue, prodigue, sarigue and zigue).
Previous: Texte, Up: Miscellaneous [Contents][Index]
This version of recode
barely starts supporting multiplexed or
super-charsets, that is, those encoding methods by which a single text
stream may contain a combination of more than one constituent charset.
The only multiplexed charset in recode
is Mule
, and even
then, it is only very partially implemented: the only correspondence
available is with Latin-1
. The author fastly implemented this
only because he needed this for himself. However, it is intended that
Mule support to become more real in subsequent releases of recode
.
Multiplexed charsets are not to be confused with mixed charset texts (see Mixed). For mixed charset input, the rules allowing to distinguish which charset is current, at any given place, are kind of informal, and driven from the semantics of what the file contains. On the other side, multiplexed charsets are designed to be interpreted fairly precisely, and quite independently of any informational context.
The spelling Mule
originally stands for multilingual
enhancement to GNU Emacs, it is the result of a collective
effort orchestrated by Handa Ken’ichi since 1993. When Mule
got
rewritten in the main development stream of GNU Emacs 20, the FSF renamed
it MULE
, meaning multilingual environment
in GNU Emacs. Even if the charset Mule
is meant to stay
internal to GNU Emacs, it sometimes breaks loose in external files,
and as a consequence, a recoding tool is sometimes needed. Within Emacs,
Mule
comes with Leim
, which stands for libraries
of emacs input methods. One of these libraries is
named quail
16.
There are supposed to be seven words in this case. So, one is missing.
Look at one of the following sentences (the second has to be interpreted with the ‘-c’ option):
"Ai"e! Voici le proble`me que j'ai" Ai:e! Voici le proble`me que j'ai:
There is an ambiguity between an the small animal, and the indicative future of avoir (first person singular), when followed by what could be a diaeresis mark. Hopefully, the case is solved by the fact that an apostrophe always precedes the verb and almost never the animal.
I did not pay attention to proper nouns, but this one showed up as being fairly evident.
Usually, quail means quail egg in Japanese,
while egg alone is usually chicken egg. Both quail egg and chicken
egg are popular food in Japan. The quail
input system has
been named because it is smaller that the previous EGG
system.
As for EGG
, it is the translation of TAMAGO
. This word
comes from the Japanese sentence takusan matasete
gomennasai, meaning sorry to have let you wait so long.
Of course, the publication of EGG
has been delayed many times…
(Story by Takahashi Naoto)
Previous: Texte, Up: Miscellaneous [Contents][Index]