Next: Concept Index, Previous: Surfaces, Up: Top [Contents][Index]
The incoming explanations of the internals of recode
should
help people who want to dive into recode
sources for adding new
charsets. Adding new charsets does not require much knowledge about
the overall organisation of recode
. You can rather concentrate
of your new charset, letting the remainder of the recode
mechanics take care of interconnecting it with all others charsets.
If you intend to play seriously at modifying recode
, beware that
you may need some other GNU tools which were not required when you first
installing recode
. If you modify or create any .l file,
then you need Flex, and some better awk
like mawk
,
GNU awk
, or nawk
. If you modify the documentation (and
you should!), you need makeinfo
. If you are really audacious,
you may also want Perl for modifying tabular processing, then m4
,
Autoconf, Automake and libtool
for adjusting configuration matters.
• Main flow | Overall organisation | |
• New charsets | Adding new charsets | |
• New surfaces | Adding new surfaces | |
• Design | Comments on the library design |
Next: New charsets, Previous: Internals, Up: Internals [Contents][Index]
The recode
mechanics slowly evolved for many years, and it
would be tedious to explain all problems I met and mistakes I did all
along, yielding the current behaviour. Surely, one of the key choices
was to stop trying to do all conversions in memory, one line or one
buffer at a time. It has been fruitful to use the character stream
paradigm, and the elementary recoding steps now convert a whole stream
to another. Most of the control complexity in recode
exists
so that each elementary recoding step stays simple, making easier
to add new ones. The whole point of recode
, as I see it, is
providing a comfortable nest for growing new charset conversions.
The main recode
driver constructs, while initialising all
conversion modules, a table giving all the conversion routines
available (single steps) and for each, the starting charset and
the ending charset. If we consider these charsets as being the nodes
of a directed graph, each single step may be considered as oriented
arc from one node to the other. A cost is attributed to each arc:
for example, a high penalty is given to single steps which are prone
to losing characters, a lower penalty is given to those which need
studying more than one input character for producing an output
character, etc.
Given a starting code and a goal code, recode
computes the most
economical route through the elementary recodings, that is, the best
sequence of conversions that will transform the input charset into the
final charset. To speed up execution, recode
looks for
subsequences of conversions which are simple enough to be merged, and
then dynamically creates new single steps to represent these mergings.
A double step in recode
is a special concept representing a
sequence of two single steps, the output of the first single step being the
special charset UCS-2
, the input of the second single step being
also UCS-2
. Special recode
machinery dynamically produces
efficient, reversible, merge-able single steps out of these double steps.
I made some statistics about how many internal recoding steps are required
between any two charsets chosen at random. The initial recoding layout,
before optimisation, always uses between 1 and 5 steps. Optimisation could
sometimes produce mere copies, which are counted as no steps at all.
In other cases, optimisation is unable to save any step. The number of
steps after optimisation is currently between 0 and 5 steps. Of course,
the expected number of steps is affected by optimisation: it drops
from 2.8 to 1.8. This means that recode
uses a theoretical average
of a bit less than one step per recoding job. This looks good. This was
computed using reversible recodings. In strict mode, optimisation might
be defeated somewhat. Number of steps run between 1 and 6, both before
and after optimisation, and the expected number of steps decreases by a
lesser amount, going from 2.2 to 1.3. This is still manageable.
Next: New surfaces, Previous: Main flow, Up: Internals [Contents][Index]
The main part of recode
is written in C, as are most single
steps. A few single steps need to recognise sequences of multiple
characters, they are often better written in Flex. It is easy for a
programmer to add a new charset to recode
. All it requires
is making a few functions kept in a single .c file,
adjusting Makefile.am and remaking recode
.
One of the function should convert from any previous charset to the new one. Any previous charset will do, but try to select it so you will not lose too much information while converting. The other function should convert from the new charset to any older one. You do not have to select the same old charset than what you selected for the previous routine. Once again, select any charset for which you will not lose too much information while converting.
If, for any of these two functions, you have to read multiple bytes of the
old charset before recognising the character to produce, you might prefer
programming it in Flex in a separate .l file. Prototype your
C or Flex files after one of those which exist already, so to keep the
sources uniform. Besides, at make
time, all .l files are
automatically merged into a single big one by the script mergelex.awk.
There are a few hidden rules about how to write new recode
modules, for allowing the automatic creation of decsteps.h
and initsteps.h at make
time, or the proper merging of
all Flex files. Mimetism is a simple approach which relieves me of
explaining all these rules! Start with a module closely resembling
what you intend to do. Here is some advice for picking up a model.
First decide if your new charset module is to be be driven by algorithms
rather than by tables. For algorithmic recodings, see iconqnx.c for
C code, or txtelat1.l for Flex code. For table driven recodings,
see ebcdic.c for one-to-one style recodings, lat1html.c
for one-to-many style recodings, or atarist.c for double-step
style recodings. Just select an example from the style that better fits
your application.
Each of your source files should have its own initialisation function,
named module_charset
, which is meant to be executed
quickly once, prior to any recoding. It should declare the
name of your charsets and the single steps (or elementary recodings)
you provide, by calling declare_step
one or more times.
Besides the charset names, declare_step
expects a description
of the recoding quality (see recodext.h) and two functions you
also provide.
The first such function has the purpose of allocating structures,
pre-conditioning conversion tables, etc. It is also the way of further
modifying the STEP
structure. This function is executed if and
only if the single step is retained in an actual recoding sequence.
If you do not need such delayed initialisation, merely use NULL
for the function argument.
The second function executes the elementary recoding on a whole file. There are a few cases when you can spare writing this function:
file_one_to_one
, while having a delayed initialisation for
presetting the STEP
field one_to_one
to the predefined
value one_to_same
.
file_one_to_one
, while having a delayed initialisation
for presetting the STEP
field one_to_one
with your table.
file_one_to_many
, while having a delayed initialisation
for presetting the STEP
field one_to_many
with your table.
If you have a recoding table handy in a suitable format but do not use
one of the predefined recoding functions, it is still a good idea to use
a delayed initialisation to save it anyway, because recode
option
‘-h’ will take advantage of this information when available.
Finally, edit Makefile.am to add the source file name of your routines
to the C_STEPS
or L_STEPS
macro definition, depending on
the fact your routines is written in C or in Flex.
Next: Design, Previous: New charsets, Up: Internals [Contents][Index]
Adding a new surface is technically quite similar to adding a new charset.
See New charsets. A surface is provided as a set of two transformations:
one from the predefined special charset data
or tree
to the
new surface, meant to apply the surface, the other from the new surface
to the predefined special charset data
or tree
, meant to
remove the surface.
Internally in recode
, function declare_step
especially
recognises when a charset is so related to data
or tree
,
and then takes appropriate actions so that charset gets indeed installed
as a surface.
Previous: New surfaces, Up: Internals [Contents][Index]
There are many different approaches to reduce system requirements to
handle all tables needed in the recode
library. One of them is to
have the tables in an external format and only read them in on demand.
After having pondered this for a while, I finally decided against it,
mainly because it involves its own kind of installation complexity, and
it is not clear to me that it would be as interesting as I first imagined.
It looks more efficient to see all tables and algorithms already mapped into virtual memory from the start of the execution, yet not loaded in actual memory, than to go through many disk accesses for opening various data files once the program is already started, as this would be needed with other solutions. Using a shared library also has the indirect effect of making various algorithms handily available, right in the same modules providing the tables. This alleviates much the burden of the maintenance.
Of course, I would like to later make an exception for only a few tables,
built locally by users for their own particular needs once recode
is installed. recode
should just go and fetch them. But I do not
perceive this as very urgent, yet useful enough to be worth implementing.
Currently, all tables needed for recoding are precompiled into binaries,
and all these binaries are then made into a shared library. As an initial
step, I turned recode
into a main program and a non-shared library,
this allowed me to tidy up the API, get rid of all global variables, etc.
It required a surprising amount of program source massaging. But once
this cleaned enough, it was easy to use Gordon Matzigkeit’s libtool
package, and take advantage of the Automake interface to neatly turn the
non-shared library into a shared one.
Sites linking with the recode
library, whose system does not
support any form of shared libraries, might end up with bulky executables.
Surely, the recode
library will have to be used statically, and
might not very nicely usable on such systems. It seems that progress
has a price for those being slow at it.
There is a locality problem I did not address yet. Currently, the
recode
library takes many cycles to initialise itself, calling
each module in turn for it to set up associated knowledge about charsets,
aliases, elementary steps, recoding weights, etc. Then, the
recoding sequence is decided out of the command given. I would not be
surprised if initialisation was taking a perceivable fraction of a second
on slower machines. One thing to do, most probably not right in version
3.5, but the version after, would have recode
to pre-load all tables
and dump them at installation time. The result would then be compiled and
added to the library. This would spare many initialisation cycles, but more
importantly, would avoid calling all library modules, scattered through the
virtual memory, and so, possibly causing many spurious page exceptions each
time the initialisation is requested, at least once per program execution.
It would be simpler, and I would like, if something like ISO 10646 was
used as a turning template for all charsets in recode
. Even if
I think it could help to a certain extent, I’m still not fully sure it
would be sufficient in all cases. Moreover, some people disagree about
using ISO 10646 as the central charset, to the point I cannot totally
ignore them, and surely, recode
is not a mean for me to force my
own opinions on people. I would like that recode
be practical
more than dogmatic, and reflect usage more than religions.
Currently, if you ask recode
to go from charset1 to
charset2 chosen at random, it is highly probable that the best path
will be quickly found as:
charset1..UCS-2
..charset2
That is, it will almost always use the UCS
as a trampoline between
charsets. However, UCS-2
will be immediately be optimised out,
and charset1..charset2 will often be performed in a single
step through a permutation table generated on the fly for the circumstance
18.
In those few cases where UCS-2
is not selected as a conceptual
intermediate, I plan to study if it could be made so. But I guess some cases
will remain where UCS-2
is not a proper choice. Even if UCS
is
often the good choice, I do not intend to forcefully restrain recode
around UCS-2
(nor UCS-4
) for now. We might come to that
one day, but it will come out of the natural evolution of recode
.
It will then reflect a fact, rather than a preset dogma.
iconv
?
The iconv
routine and library allows for converting characters
from an input buffer to an input buffer, synchronously advancing both
buffer cursors. If the output buffer is not big enough to receive
all of the conversion, the routine returns with the input cursor set at
the position where the conversion could later be resumed, and the output
cursor set to indicate until where the output buffer has been filled.
Despite this scheme is simple and nice, the recode
library does
not offer it currently. Why not?
When long sequences of decodings, stepwise recodings, and re-encodings are involved, as it happens in true life, synchronising the input buffer back to where it should have stopped, when the output buffer becomes full, is a difficult problem. Oh, we could make it simpler at the expense of loosing space or speed: by inserting markers between each input character and counting them at the output end; by processing only one character in a time through the whole sequence; by repeatedly attempting to recode various subsets of the input buffer, binary searching on their length until the output just fits. The overhead of such solutions looks fully prohibitive to me, and the gain very minimal. I do not see a real advantage, nowadays, imposing a fixed length to an output buffer. It makes things so much simpler and efficient to just let the output buffer size float a bit.
Of course, if the above problem was solved, the iconv
library
should be easily emulated, given that recode
has similar knowledge
about charsets, of course. This either solved or not, the iconv
program remains trivial (given similar knowledge about charsets).
I also presume that the genxlt
program would be easy too, but
I do not have enough detailed specifications of it to be sure.
A lot of years ago, recode
was using a similar scheme, and I found
it rather hard to manage for some cases. I rethought the overall structure
of recode
for getting away from that scheme, and never regretted it.
I perceive iconv
as an artificial solution which surely has some
elegances and virtues, but I do not find it really useful as it stands: one
always has to wrap iconv
into something more refined, extending it
for real cases. From past experience, I think it is unduly hard to fully
implement this scheme. It would be awkward that we do contortions for
the sole purpose of implementing exactly its specification, without real,
fairly sounded reasons (other then the fact some people once thought it
was worth standardising). It is much better to immediately aim for the
refinement we need, without uselessly forcing us into the dubious detour
iconv
represents.
Some may argue that if recode
was using a comprehensive charset
as a turning template, as discussed in a previous point, this would make
iconv
easier to implement. Some may be tempted to say that the
cases which are hard to handle are not really needed, nor interesting,
anyway. I feel and fear a bit some pressure wanting that recode
be split into the part that well fits the iconv
model, and the part
that does not fit, considering this second part less important, with the
idea of dropping it one of these days, maybe. My guess is that users of
the recode
library, whatever its form, would not like to have such
arbitrary limitations. In the long run, we should not have to explain
to our users that some recodings may not be made available just because
they do not fit the simple model we had in mind when we did it. Instead,
we should try to stay opened to the difficulties of real life. There is
still a lot of complex needs for Asian people, say, that recode
does not currently address, while it should. Not only the doors should
stay open, but we should force them wider!
If strict mapping is requested, another efficient device will be used instead of a permutation.
Previous: New surfaces, Up: Internals [Contents][Index]