Introduction (The recode reference manual)

2 Terminology and purpose

A few terms are used over and over in this manual, our wise reader will learn their meaning right away. Both ISO (International Organization for Standardisation) and IETF (Internet Engineering Task Force) have their own terminology, this document does not try to stick to either one in a strict way, while it does not want to throw more confusion in the field. On the other hand, it would not be efficient using paraphrases all the time, so recode coins a few short words, which are explained below.

A charset, in the context of recode, is a particular association between computer codes on one side, and a repertoire of intended characters on the other side. Codes are usually taken from a set of consecutive small integers, starting at 0. Some characters have a graphical appearance (glyph) or displayable effect, others have special uses like, for example, to control devices or to interact with neighbouring codes to specify them more precisely. So, a charset is roughly one of those tables, giving a meaning to each of the codes from the set of allowable values. MIME also uses the term charset with approximately the same meaning. It does not exactly corresponds to what ISO calls a coded character set, that is, a set of characters with an encoding for them. An coded character set does not necessarily use all available code positions, while a MIME charset usually tries to specify them all. A MIME charset might be the union of a few disjoint coded character sets.

A surface is a term used in recode only, and is a short for surface transformation of a charset stream. This is any kind of mapping, usually reversible, which associates physical bits in some medium for a stream of characters taken from one or more charsets (usually one). A surface is a kind of varnish added over a charset so it fits in actual bits and bytes. How end of lines are exactly encoded is not really pertinent to the charset, and so, there is surface for end of lines. Base64 is also a surface, as we may encode any charset in it. Other examples would DES enciphering, or gzip compression (even if recode does not offer them currently): these are ways to give a real life to theoretical charsets. The trivial surface consists into putting characters into fixed width little chunks of bits, usually eight such bits per character. But things are not always that simple.

This recode library, and the program by that name, have the purpose of converting files between various charsets and surfaces. When this cannot be done in exact ways, as it is often the case, the program may get rid of the offending characters or fall back on approximations. This library recognises or produces around 175 such charsets under 500 names, and handle a dozen surfaces. Since it can convert each charset to almost any other one, many thousands of different conversions are possible.

The recode program and library do not usually know how to split and sort out textual and non-textual information which may be mixed in a single input file. For example, there is no surface which currently addresses the problem of how lines are blocked into physical records, when the blocking information is added as binary markers or counters within files. So, recode should be given textual streams which are rather pure.

This tool pays special attention to superimposition of diacritics for some French representations. This orientation is mostly historical, it does not impair the usefulness, generality or extensibility of the program. ‘recode’ is both a French and English word. For those who pay attention to those things, the proper pronunciation is French (that is, ‘racud’, with ‘a’ like in ‘above’, and ‘u’ like in ‘cut’).

The program recode has been written by François Pinard. With time, it got to reuse works from other contributors, and notably, those of Keld Simonsen and Bruno Haible.

2.1 Overview of charsets

Recoding is currently possible between many charsets, the bulk of which is described by RFC 1345 tables or available in the iconv library. See Tabular, and see libiconv. The recode library also handles some charsets in some specialised ways. These are:

6-bit charsets based on CDC display code: 6/12 code from NOS; bang-bang code from Université de Montréal;
7-bit ASCII: without any diacritics, or else: using backspace for overstriking; Unisys’ Icon convention; TeX/LaTeX coding; easy French conventions for electronic mail;
8-bit extensions to ASCII: ISO Latin-1, Atari ST code, IBM’s code for the PC, Apple’s code for the Macintosh;
8-bit non-ASCII codes: three flavours of EBCDIC;
16-bit or 31-bit universal characters, and their transfer encodings.

The introduction of RFC 1345 in recode has brought with it a few charsets having the functionality of older ones, but yet being different in subtle ways. The effects have not been fully investigated yet, so for now, clashes are avoided, the old and new charsets are kept well separate.

Conversion is possible between almost any pair of charsets. Here is a list of the exceptions. One may not recode from the flat, count-characters or dump-with-names charsets, nor from or to the data, tree or :libiconv: charsets. Also, if we except the data and tree pseudo-charsets, charsets and surfaces live in disjoint recoding spaces, one cannot really transform a surface into a charset or vice-versa, as surfaces are only meant to be applied over charsets, or removed from them.

2.2 Overview of surfaces

For various practical considerations, it sometimes happens that the codes making up a text, written in a particular charset, cannot simply be put out in a file one after another without creating problems or breaking other things. Sometimes, 8-bit codes cannot be written on a 7-bit medium, variable length codes need kind of envelopes, newlines require special treatment, etc. We sometimes have to apply surfaces to a stream of codes, which surfaces are kind of tricks used to fit the charset into those practical constraints. Moreover, similar surfaces or tricks may be useful for many unrelated charsets, and many surfaces can be used at once over a single charset.

So, recode has machinery to describe a combination of a charset with surfaces used over it in a file. We would use the expression pure charset for referring to a charset free of any surface, that is, the conceptual association between integer codes and character intents.

It is not always clear if some transformation will yield a charset or a surface, especially for those transformations which are only meaningful over a single charset. The recode library is not overly picky as identifying surfaces as such: when it is practical to consider a specialised surface as if it were a charset, this is preferred, and done.

2.3 Contributions and bug reports

Even being the recode author and current maintainer, I am no specialist in charset standards. I only made recode along the years to solve my own needs, but felt it was applicable for the needs of others. Some FSF people liked the program structure and suggested to make it more widely available. I often rely on recode users suggestions to decide what is best to be done next.

Properly protecting recode about possible copyright fights is a pain for me and for contributors, but we cannot avoid addressing the issue in the long run. Besides, the Free Software Foundation, which mandates the GNU project, is very sensible to this matter. GNU standards suggest that we stay cautious before looking at copyrighted code. The safest and simplest way for me is to gather ideas and reprogram them anew, even if this might slow me down considerably. For contributions going beyond a few lines of code here and there, the FSF definitely requires employer disclaimers and copyright assignments in writing.

When you contribute something to recode, please explain what it is about. Do not take for granted that I know those charsets which are familiar to you. Once again, I’m no expert, and you have to help me. Your explanations could well find their way into this documentation, too. Also, for contributing new charsets or new surfaces, as much as possible, please provide good, solid, verifiable references for the tables you used¹.

Many users contributed to recode already, I am grateful to them for their interest and involvement. Some suggestions can be integrated quickly while some others have to be delayed, I have to draw a line somewhere when time comes to make a new release, about what would go in it and what would go in the next.

Please send suggestions, documentation errors and bug reports to recode-bugs@iro.umontreal.ca or, if you prefer, directly to pinard@iro.umontreal.ca, François Pinard. Do not be afraid to report details, because this program is the mere aggregation of hundreds of details.

Footnotes

(1)

I’m not prone at accepting a charset you just invented, and which nobody uses yet: convince your friends and community first!

• Charset overview		Overview of charsets
• Surface overview		Overview of surfaces
• Contributing		Contributions and bug reports