Next: Internals, Previous: Miscellaneous, Up: Top [Contents][Index]
The trivial surface consists of using a fixed number of bits
(often eight) for each character, the bits together hold the integer
value of the index for the character in its charset table. There are
many kinds of surfaces, beyond the trivial one, all having the purpose
of increasing selected qualities for the storage or transmission.
For example, surfaces might increase the resistance to channel limits
(Base64
), the transmission speed (gzip
), the information
privacy (DES
), the conformance to operating system conventions
(CR-LF
), the blocking into records (VB
), and surely other
things as well17.
Many surfaces may be applied to a stream of characters from a charset,
the order of application of surfaces is important, and surfaces
should be removed in the reverse order of their application.
Even if surfaces may generally be applied to various charsets, some
surfaces were specifically designed for a particular charset, and would
not make much sense if applied to other charsets. In such cases, these
conceptual surfaces have been implemented as recode
charsets,
instead of as surfaces. This choice yields to cleaner syntax
and usage. See Universal.
Surfaces are implemented within recode
as special charsets
which may only transform to or from the data
or tree
special charsets. Clever users may use this knowledge for writing
surface names in requests exactly as if they were pure charsets, when
the only need is to change surfaces without any kind of recoding between
real charsets. In such contexts, either data
or tree
may
also be used as if it were some kind of generic, anonymous charset: the
request ‘data..surface’ merely adds the given surface,
while the request ‘surface..data’ removes it.
The recode
library distinguishes between mere data surfaces, and
structural surfaces, also called tree surfaces for short. Structural
surfaces might allow, in the long run, transformations between a few
specialised representations of structural information like MIME parts,
Perl or Python initialisers, LISP S-expressions, XML, Emacs outlines, etc.
We are still experimenting with surfaces in recode
. The concept opens
the doors to many avenues; it is not clear yet which ones are worth pursuing,
and which should be abandoned. In particular, implementation of structural
surfaces is barely starting, there is not even a commitment that tree
surfaces will stay in recode
, if they do prove to be more cumbersome
than useful. This chapter presents all surfaces currently available.
• Permutations | Permuting groups of bytes | |
• End lines | Representation for end of lines | |
• MIME | MIME contents encodings | |
• Dump | Interpreted character dumps | |
• Test | Artificial data for testing |
A permutation is a surface transformation which reorders groups of eight-bit bytes. A 21 permutation exchanges pairs of successive bytes. If the text contains an odd number of bytes, the last byte is merely copied. An 4321 permutation inverts the order of quadruples of bytes. If the text does not contains a multiple of four bytes, the remaining bytes are nevertheless permuted as 321 if there are three bytes, 21 if there are two bytes, or merely copied otherwise.
21
This surface is available in recode
under the name
21-Permutation
and has swabytes
for an alias.
4321
This surface is available in recode
under the name
4321-Permutation
.
Next: MIME, Previous: Permutations, Up: Surfaces [Contents][Index]
The same charset might slightly differ, from one system to another, for
the single fact that end of lines are not represented identically on all
systems. The representation for an end of line within recode
is the ASCII
or UCS
code with value 10, or LF. Other
conventions for representing end of lines are available through surfaces.
CR
This convention is popular on Apple’s Macintosh machines. When this
surface is applied, each line is terminated by CR, which has
ASCII
value 13. Unless the library is operating in strict mode,
adding or removing the surface will in fact exchange CR and
LF, for better reversibility. However, in strict mode, the exchange
does not happen, any CR will be copied verbatim while applying
the surface, and any LF will be copied verbatim while removing it.
This surface is available in recode
under the name CR
,
it does not have any aliases. This is the implied surface for the Apple
Macintosh related charsets.
CR-LF
This convention is popular on Microsoft systems running on IBM PCs and compatible. When this surface is applied, each line is terminated by a sequence of two characters: one CR followed by one LF, in that order.
For compatibility with oldish MS-DOS systems, removing a CR-LF
surface will discard the first encountered C-z, which has
ASCII
value 26, and everything following it in the text.
Adding this surface will not, however, append a C-z to the result.
This surface is available in recode
under the name CR-LF
and has cl
for an alias. This is the implied surface for the IBM
or Microsoft related charsets or code pages.
Some other charsets might have their own representation for an end of
line, which is different from LF. For example, this is the case
of various EBCDIC
charsets, or Icon-QNX
. The recoding of
end of lines is intimately tied into such charsets, it is not available
separately as surfaces.
RFC 2045 defines two 7-bit surfaces, meant to prepare 8-bit messages for transmission. Base64 is especially usable for binary entities, while Quoted-Printable is especially usable for text entities, in those case the lower 128 characters of the underlying charset coincide with ASCII.
Base64
This surface is available in recode
under the name Base64
,
with b64
and 64
as acceptable aliases.
Quoted-Printable
This surface is available in recode
under the name
Quoted-Printable
, with quote-printable
and QP
as
acceptable aliases.
Note that UTF-7
, which may be also considered as a MIME surface,
is provided as a genuine charset instead, as it necessary relates to
UCS-2
and nothing else. See UTF-7.
A little historical note, also showing the three levels of acceptance of Internet standards. MIME changed from a “Proposed Standard” (RFC 1341–1344, 1992) to a “Draft Standard” (RFC 1521–1523) in 1993, and was recycled as a “Draft Standard” in 1996-11. It is not yet a “Full Standard”.
Dumps are surfaces meant to express, in ways which are a bit more readable,
the bit patterns used to represent characters. They allow the inspection
or debugging of character streams, but also, they may assist a bit the
production of C source code which, once compiled, would hold in memory a
copy of the original coding. However, recode
does not attempt, in
any way, to produce complete C source files in dumps. User hand editing
or Makefile trickery is still needed for adding missing lines.
Dumps may be given in decimal, hexadecimal and octal, and be based over
chunks of either one, two or four eight-bit bytes. Formatting has been
chosen to respect the C language syntax for number constants, with commas
and newlines inserted appropriately.
However, when dumping two or four byte chunks, the last chunk may be incomplete. This is observable through the usage of narrower expression for that last chunk only. Such a shorter chunk would not be compiled properly within a C initialiser, as all members of an array share a single type, and so, have identical sizes.
Octal-1
This surface corresponds to an octal expression of each input byte.
It is available in recode
under the name Octal-1
,
with o1
and o
as acceptable aliases.
Octal-2
This surface corresponds to an octal expression of each pair of input bytes, except for the last pair, which may be short.
It is available in recode
under the name Octal-2
and has o2
for an alias.
Octal-4
This surface corresponds to an octal expression of each quadruple of input bytes, except for the last quadruple, which may be short.
It is available in recode
under the name Octal-4
and has o4
for an alias.
Decimal-1
This surface corresponds to an decimal expression of each input byte.
It is available in recode
under the name Decimal-1
,
with d1
and d
as acceptable aliases.
Decimal-2
This surface corresponds to an decimal expression of each pair of input bytes, except for the last pair, which may be short.
It is available in recode
under the name Decimal-2
and has d2
for an alias.
Decimal-4
This surface corresponds to an decimal expression of each quadruple of input bytes, except for the last quadruple, which may be short.
It is available in recode
under the name Decimal-4
and has d4
for an alias.
Hexadecimal-1
This surface corresponds to an hexadecimal expression of each input byte.
It is available in recode
under the name Hexadecimal-1
,
with x1
and x
as acceptable aliases.
Hexadecimal-2
This surface corresponds to an hexadecimal expression of each pair of input bytes, except for the last pair, which may be short.
It is available in recode
under the name Hexadecimal-2
,
with x2
for an alias.
Hexadecimal-4
This surface corresponds to an hexadecimal expression of each quadruple of input bytes, except for the last quadruple, which may be short.
It is available in recode
under the name Hexadecimal-4
,
with x4
for an alias.
When removing a dump surface, that is, when reading a dump results back
into a sequence of bytes, the narrower expression for a short last chunk
is recognised, so dumping is a fully reversible operation. However, in
case you want to produce dumps by other means than through recode
,
beware that for decimal dumps, the library has to rely on the number of
spaces to establish the original byte size of the chunk.
Although the library might report reversibility errors, removing a dump
surface is a rather forgiving process: one may mix bases, group a variable
number of data per source line, or use shorter chunks in places other
than at the
far end. Also, source lines not beginning with a number are skipped. So,
recode
should often be able to read a whole C header file, wrapping
the results of a previous dump, and regenerate the original byte string.
A few pseudo-surfaces exist to generate debugging data out of thin air.
These surfaces are only meant for the expert recode
user, and are
only useful in a few contexts, like for generating binary permutations
from the recoding or acting on them.
Debugging surfaces, when removed, insert their generated data at the beginning of the output stream, and copy all the input stream after the generated data, unchanged. This strange removal constraint comes from the fact that debugging surfaces are usually specified in the before position instead of the after position within a request. With debugging surfaces, one often recodes file /dev/null in filter mode. Specifying many debugging surfaces at once has an accumulation effect on the output, and since surfaces are removed from right to left, each generating its data at the beginning of previous output, the net effect is an impression that debugging surfaces are generated from left to right, each appending to the result of the previous. In any case, any real input data gets appended after what was generated.
test7
When removed, this surface produces 128 single bytes, the first having value 0, the second having value 1, and so forth until all 128 values have been generated.
test8
When removed, this surface produces 256 single bytes, the first having value 0, the second having value 1, and so forth until all 256 values have been generated.
test15
When removed, this surface produces 64509 double bytes, the first having
value 0, the second having value 1, and so forth until all values have been
generated, but excluding risky UCS-2
values, like all codes from
the surrogate UCS-2
area (for UTF-16
), the byte order mark,
and values known as invalid UCS-2
.
test16
When removed, this surface produces 65536 double bytes, the first having value 0, the second having value 1, and so forth until all 65536 values have been generated.
As an example, the command ‘recode l5/test8..dump < /dev/null’ is a
convoluted way to produce an output similar to ‘recode -lf l5’. It says
to generate all possible 256 bytes and interpret them as ISO-8859-9
codes, while converting them to UCS-2
. Resulting UCS-2
characters are dumped one per line, accompanied with their explicative name.
These are mere examples to explain the concept,
recode
only has Base64
and CR-LF
, actually.