Next: Invoking recode, Previous: Tutorial, Up: Top [Contents][Index]
A few terms are used over and over in this manual, our wise reader will
learn their meaning right away. Both ISO (International Organization for
Standardisation) and IETF (Internet Engineering Task Force) have their
own terminology, this document does not try to stick to either one in a
strict way, while it does not want to throw more confusion in the field.
On the other hand, it would not be efficient using paraphrases all the time,
so recode
coins a few short words, which are explained below.
A charset, in the context of recode
, is a particular association
between computer codes on one side, and a repertoire of intended characters
on the other side. Codes are usually taken from a set of consecutive
small integers, starting at 0. Some characters have a graphical appearance
(glyph) or displayable effect, others have special uses like, for example,
to control devices or to interact with neighbouring codes to specify them
more precisely. So, a charset is roughly one of those tables,
giving a meaning to each of the codes from the set of allowable values.
MIME also uses the term charset with approximately the same meaning.
It does not exactly corresponds to what ISO calls a coded
character set, that is, a set of characters with an encoding for them.
An coded character set does not necessarily use all available code positions,
while a MIME charset usually tries to specify them all. A MIME charset
might be the union of a few disjoint coded character sets.
A surface is a term used in recode
only, and is a short for
surface transformation of a charset stream. This is any kind of mapping,
usually reversible, which associates physical bits in some medium for
a stream of characters taken from one or more charsets (usually one).
A surface is a kind of varnish added over a charset so it fits in actual
bits and bytes. How end of lines are exactly encoded is not really
pertinent to the charset, and so, there is surface for end of lines.
Base64
is also a surface, as we may encode any charset in it.
Other examples would DES
enciphering, or gzip
compression
(even if recode
does not offer them currently): these are ways to give
a real life to theoretical charsets. The trivial surface consists
into putting characters into fixed width little chunks of bits, usually
eight such bits per character. But things are not always that simple.
This recode
library, and the program by that name, have the purpose
of converting files between various charsets and surfaces. When this
cannot be done in exact ways, as it is often the case, the program may
get rid of the offending characters or fall back on approximations.
This library recognises or produces around 175 such charsets under 500
names, and handle a dozen surfaces. Since it can convert each charset to
almost any other one, many thousands of different conversions are possible.
The recode
program and library do not usually know how to split and
sort out textual and non-textual information which may be mixed in a single
input file. For example, there is no surface which currently addresses the
problem of how lines are blocked into physical records, when the blocking
information is added as binary markers or counters within files. So,
recode
should be given textual streams which are rather pure.
This tool pays special attention to superimposition of diacritics for some French representations. This orientation is mostly historical, it does not impair the usefulness, generality or extensibility of the program. ‘recode’ is both a French and English word. For those who pay attention to those things, the proper pronunciation is French (that is, ‘racud’, with ‘a’ like in ‘above’, and ‘u’ like in ‘cut’).
The program recode
has been written by François Pinard.
With time, it got to reuse works from other contributors, and notably,
those of Keld Simonsen and Bruno Haible.
• Charset overview | Overview of charsets | |
• Surface overview | Overview of surfaces | |
• Contributing | Contributions and bug reports |
Next: Surface overview, Previous: Introduction, Up: Introduction [Contents][Index]
Recoding is currently possible between many charsets, the bulk of which is
described by RFC 1345 tables or available in the iconv
library.
See Tabular, and see libiconv. The recode
library also
handles some charsets in some specialised ways. These are:
The introduction of RFC 1345 in recode
has brought with it a few
charsets having the functionality of older ones, but yet being different
in subtle ways. The effects have not been fully investigated yet, so for
now, clashes are avoided, the old and new charsets are kept well separate.
Conversion is possible between almost any pair of charsets. Here is a
list of the exceptions. One may not recode from the flat
,
count-characters
or dump-with-names
charsets, nor from
or to the data
, tree
or :libiconv:
charsets.
Also, if we except the data
and tree
pseudo-charsets, charsets
and surfaces live in disjoint recoding spaces, one cannot really transform
a surface into a charset or vice-versa, as surfaces are only meant to be
applied over charsets, or removed from them.
Next: Contributing, Previous: Charset overview, Up: Introduction [Contents][Index]
For various practical considerations, it sometimes happens that the codes making up a text, written in a particular charset, cannot simply be put out in a file one after another without creating problems or breaking other things. Sometimes, 8-bit codes cannot be written on a 7-bit medium, variable length codes need kind of envelopes, newlines require special treatment, etc. We sometimes have to apply surfaces to a stream of codes, which surfaces are kind of tricks used to fit the charset into those practical constraints. Moreover, similar surfaces or tricks may be useful for many unrelated charsets, and many surfaces can be used at once over a single charset.
So, recode
has machinery to describe a combination of a charset with
surfaces used over it in a file. We would use the expression pure
charset for referring to a charset free of any surface, that is, the
conceptual association between integer codes and character intents.
It is not always clear if some transformation will yield a charset or a
surface, especially for those transformations which are only meaningful
over a single charset. The recode
library is not overly picky as
identifying surfaces as such: when it is practical to consider a specialised
surface as if it were a charset, this is preferred, and done.
Previous: Surface overview, Up: Introduction [Contents][Index]
Even being the recode
author and current maintainer, I am no
specialist in charset standards. I only made recode
along the
years to solve my own needs, but felt it was applicable for the needs
of others. Some FSF people liked the program structure and suggested
to make it more widely available. I often rely on recode
users
suggestions to decide what is best to be done next.
Properly protecting recode
about possible copyright fights is a
pain for me and for contributors, but we cannot avoid addressing the issue
in the long run. Besides, the Free Software Foundation, which mandates
the GNU project, is very sensible to this matter. GNU standards suggest
that we stay cautious before looking at copyrighted code. The safest and
simplest way for me is to gather ideas and reprogram them anew, even if
this might slow me down considerably. For contributions going beyond a
few lines of code here and there, the FSF definitely requires employer
disclaimers and copyright assignments in writing.
When you contribute something to recode
, please explain what
it is about. Do not take for granted that I know those charsets which
are familiar to you. Once again, I’m no expert, and you have to help me.
Your explanations could well find their way into this documentation, too.
Also, for contributing new charsets or new surfaces, as much as possible,
please provide good, solid, verifiable references for the tables you
used1.
Many users contributed to recode
already, I am grateful to them for
their interest and involvement. Some suggestions can be integrated quickly
while some others have to be delayed, I have to draw a line somewhere when
time comes to make a new release, about what would go in it and what would
go in the next.
Please send suggestions, documentation errors and bug reports to recode-bugs@iro.umontreal.ca or, if you prefer, directly to pinard@iro.umontreal.ca, François Pinard. Do not be afraid to report details, because this program is the mere aggregation of hundreds of details.
I’m not prone at accepting a charset you just invented, and which nobody uses yet: convince your friends and community first!
Previous: Surface overview, Up: Introduction [Contents][Index]