Surfaces (The recode reference manual)

13 All about surfaces

The trivial surface consists of using a fixed number of bits (often eight) for each character, the bits together hold the integer value of the index for the character in its charset table. There are many kinds of surfaces, beyond the trivial one, all having the purpose of increasing selected qualities for the storage or transmission. For example, surfaces might increase the resistance to channel limits (Base64), the transmission speed (gzip), the information privacy (DES), the conformance to operating system conventions (CR-LF), the blocking into records (VB), and surely other things as well¹⁷. Many surfaces may be applied to a stream of characters from a charset, the order of application of surfaces is important, and surfaces should be removed in the reverse order of their application.

Even if surfaces may generally be applied to various charsets, some surfaces were specifically designed for a particular charset, and would not make much sense if applied to other charsets. In such cases, these conceptual surfaces have been implemented as recode charsets, instead of as surfaces. This choice yields to cleaner syntax and usage. See Universal.

Surfaces are implemented within recode as special charsets which may only transform to or from the data or tree special charsets. Clever users may use this knowledge for writing surface names in requests exactly as if they were pure charsets, when the only need is to change surfaces without any kind of recoding between real charsets. In such contexts, either data or tree may also be used as if it were some kind of generic, anonymous charset: the request ‘data..surface’ merely adds the given surface, while the request ‘surface..data’ removes it.

The recode library distinguishes between mere data surfaces, and structural surfaces, also called tree surfaces for short. Structural surfaces might allow, in the long run, transformations between a few specialised representations of structural information like MIME parts, Perl or Python initialisers, LISP S-expressions, XML, Emacs outlines, etc.

We are still experimenting with surfaces in recode. The concept opens the doors to many avenues; it is not clear yet which ones are worth pursuing, and which should be abandoned. In particular, implementation of structural surfaces is barely starting, there is not even a commitment that tree surfaces will stay in recode, if they do prove to be more cumbersome than useful. This chapter presents all surfaces currently available.

13.1 Permuting groups of bytes

A permutation is a surface transformation which reorders groups of eight-bit bytes. A 21 permutation exchanges pairs of successive bytes. If the text contains an odd number of bytes, the last byte is merely copied. An 4321 permutation inverts the order of quadruples of bytes. If the text does not contains a multiple of four bytes, the remaining bytes are nevertheless permuted as 321 if there are three bytes, 21 if there are two bytes, or merely copied otherwise.

21: This surface is available in recode under the name 21-Permutation and has swabytes for an alias.
4321: This surface is available in recode under the name 4321-Permutation.

13.2 Representation for end of lines

The same charset might slightly differ, from one system to another, for the single fact that end of lines are not represented identically on all systems. The representation for an end of line within recode is the ASCII or UCS code with value 10, or LF. Other conventions for representing end of lines are available through surfaces.

CR

This convention is popular on Apple’s Macintosh machines. When this surface is applied, each line is terminated by CR, which has ASCII value 13. Unless the library is operating in strict mode, adding or removing the surface will in fact exchange CR and LF, for better reversibility. However, in strict mode, the exchange does not happen, any CR will be copied verbatim while applying the surface, and any LF will be copied verbatim while removing it.

This surface is available in recode under the name CR, it does not have any aliases. This is the implied surface for the Apple Macintosh related charsets.

CR-LF

This convention is popular on Microsoft systems running on IBM PCs and compatible. When this surface is applied, each line is terminated by a sequence of two characters: one CR followed by one LF, in that order.

For compatibility with oldish MS-DOS systems, removing a CR-LF surface will discard the first encountered C-z, which has ASCII value 26, and everything following it in the text. Adding this surface will not, however, append a C-z to the result.

This surface is available in recode under the name CR-LF and has cl for an alias. This is the implied surface for the IBM or Microsoft related charsets or code pages.

Some other charsets might have their own representation for an end of line, which is different from LF. For example, this is the case of various EBCDIC charsets, or Icon-QNX. The recoding of end of lines is intimately tied into such charsets, it is not available separately as surfaces.

13.3 MIME contents encodings

RFC 2045 defines two 7-bit surfaces, meant to prepare 8-bit messages for transmission. Base64 is especially usable for binary entities, while Quoted-Printable is especially usable for text entities, in those case the lower 128 characters of the underlying charset coincide with ASCII.

Base64: This surface is available in recode under the name Base64, with b64 and 64 as acceptable aliases.
Quoted-Printable: This surface is available in recode under the name Quoted-Printable, with quote-printable and QP as acceptable aliases.

Note that UTF-7, which may be also considered as a MIME surface, is provided as a genuine charset instead, as it necessary relates to UCS-2 and nothing else. See UTF-7.

A little historical note, also showing the three levels of acceptance of Internet standards. MIME changed from a “Proposed Standard” (RFC 1341–1344, 1992) to a “Draft Standard” (RFC 1521–1523) in 1993, and was recycled as a “Draft Standard” in 1996-11. It is not yet a “Full Standard”.

13.4 Interpreted character dumps

Dumps are surfaces meant to express, in ways which are a bit more readable, the bit patterns used to represent characters. They allow the inspection or debugging of character streams, but also, they may assist a bit the production of C source code which, once compiled, would hold in memory a copy of the original coding. However, recode does not attempt, in any way, to produce complete C source files in dumps. User hand editing or Makefile trickery is still needed for adding missing lines. Dumps may be given in decimal, hexadecimal and octal, and be based over chunks of either one, two or four eight-bit bytes. Formatting has been chosen to respect the C language syntax for number constants, with commas and newlines inserted appropriately.

However, when dumping two or four byte chunks, the last chunk may be incomplete. This is observable through the usage of narrower expression for that last chunk only. Such a shorter chunk would not be compiled properly within a C initialiser, as all members of an array share a single type, and so, have identical sizes.

Octal-1

This surface corresponds to an octal expression of each input byte.

It is available in recode under the name Octal-1, with o1 and o as acceptable aliases.

Octal-2

This surface corresponds to an octal expression of each pair of input bytes, except for the last pair, which may be short.

It is available in recode under the name Octal-2 and has o2 for an alias.

Octal-4

This surface corresponds to an octal expression of each quadruple of input bytes, except for the last quadruple, which may be short.

It is available in recode under the name Octal-4 and has o4 for an alias.

Decimal-1

This surface corresponds to an decimal expression of each input byte.

It is available in recode under the name Decimal-1, with d1 and d as acceptable aliases.

Decimal-2

This surface corresponds to an decimal expression of each pair of input bytes, except for the last pair, which may be short.

It is available in recode under the name Decimal-2 and has d2 for an alias.

Decimal-4

This surface corresponds to an decimal expression of each quadruple of input bytes, except for the last quadruple, which may be short.

It is available in recode under the name Decimal-4 and has d4 for an alias.

Hexadecimal-1

This surface corresponds to an hexadecimal expression of each input byte.

It is available in recode under the name Hexadecimal-1, with x1 and x as acceptable aliases.

Hexadecimal-2

This surface corresponds to an hexadecimal expression of each pair of input bytes, except for the last pair, which may be short.

It is available in recode under the name Hexadecimal-2, with x2 for an alias.

Hexadecimal-4

This surface corresponds to an hexadecimal expression of each quadruple of input bytes, except for the last quadruple, which may be short.

It is available in recode under the name Hexadecimal-4, with x4 for an alias.

When removing a dump surface, that is, when reading a dump results back into a sequence of bytes, the narrower expression for a short last chunk is recognised, so dumping is a fully reversible operation. However, in case you want to produce dumps by other means than through recode, beware that for decimal dumps, the library has to rely on the number of spaces to establish the original byte size of the chunk.

Although the library might report reversibility errors, removing a dump surface is a rather forgiving process: one may mix bases, group a variable number of data per source line, or use shorter chunks in places other than at the far end. Also, source lines not beginning with a number are skipped. So, recode should often be able to read a whole C header file, wrapping the results of a previous dump, and regenerate the original byte string.

13.5 Artificial data for testing

A few pseudo-surfaces exist to generate debugging data out of thin air. These surfaces are only meant for the expert recode user, and are only useful in a few contexts, like for generating binary permutations from the recoding or acting on them.

Debugging surfaces, when removed, insert their generated data at the beginning of the output stream, and copy all the input stream after the generated data, unchanged. This strange removal constraint comes from the fact that debugging surfaces are usually specified in the before position instead of the after position within a request. With debugging surfaces, one often recodes file /dev/null in filter mode. Specifying many debugging surfaces at once has an accumulation effect on the output, and since surfaces are removed from right to left, each generating its data at the beginning of previous output, the net effect is an impression that debugging surfaces are generated from left to right, each appending to the result of the previous. In any case, any real input data gets appended after what was generated.

test7: When removed, this surface produces 128 single bytes, the first having value 0, the second having value 1, and so forth until all 128 values have been generated.
test8: When removed, this surface produces 256 single bytes, the first having value 0, the second having value 1, and so forth until all 256 values have been generated.
test15: When removed, this surface produces 64509 double bytes, the first having value 0, the second having value 1, and so forth until all values have been generated, but excluding risky UCS-2 values, like all codes from the surrogate UCS-2 area (for UTF-16), the byte order mark, and values known as invalid UCS-2.
test16: When removed, this surface produces 65536 double bytes, the first having value 0, the second having value 1, and so forth until all 65536 values have been generated.

As an example, the command ‘recode l5/test8..dump < /dev/null’ is a convoluted way to produce an output similar to ‘recode -lf l5’. It says to generate all possible 256 bytes and interpret them as ISO-8859-9 codes, while converting them to UCS-2. Resulting UCS-2 characters are dumped one per line, accompanied with their explicative name.

Footnotes

(17)

These are mere examples to explain the concept, recode only has Base64 and CR-LF, actually.

• Permutations		Permuting groups of bytes
• End lines		Representation for end of lines
• MIME		MIME contents encodings
• Dump		Interpreted character dumps
• Test		Artificial data for testing