Next: , Previous: , Up: Top   [Contents][Index]

14 Internal aspects

The incoming explanations of the internals of recode should help people who want to dive into recode sources for adding new charsets. Adding new charsets does not require much knowledge about the overall organisation of recode. You can rather concentrate of your new charset, letting the remainder of the recode mechanics take care of interconnecting it with all others charsets.

If you intend to play seriously at modifying recode, beware that you may need some other GNU tools which were not required when you first installing recode. If you modify or create any .l file, then you need Flex, and some better awk like mawk, GNU awk, or nawk. If you modify the documentation (and you should!), you need makeinfo. If you are really audacious, you may also want Perl for modifying tabular processing, then m4, Autoconf, Automake and libtool for adjusting configuration matters.


Next: , Previous: , Up: Internals   [Contents][Index]

14.1 Overall organisation

The recode mechanics slowly evolved for many years, and it would be tedious to explain all problems I met and mistakes I did all along, yielding the current behaviour. Surely, one of the key choices was to stop trying to do all conversions in memory, one line or one buffer at a time. It has been fruitful to use the character stream paradigm, and the elementary recoding steps now convert a whole stream to another. Most of the control complexity in recode exists so that each elementary recoding step stays simple, making easier to add new ones. The whole point of recode, as I see it, is providing a comfortable nest for growing new charset conversions.

The main recode driver constructs, while initialising all conversion modules, a table giving all the conversion routines available (single steps) and for each, the starting charset and the ending charset. If we consider these charsets as being the nodes of a directed graph, each single step may be considered as oriented arc from one node to the other. A cost is attributed to each arc: for example, a high penalty is given to single steps which are prone to losing characters, a lower penalty is given to those which need studying more than one input character for producing an output character, etc.

Given a starting code and a goal code, recode computes the most economical route through the elementary recodings, that is, the best sequence of conversions that will transform the input charset into the final charset. To speed up execution, recode looks for subsequences of conversions which are simple enough to be merged, and then dynamically creates new single steps to represent these mergings.

A double step in recode is a special concept representing a sequence of two single steps, the output of the first single step being the special charset UCS-2, the input of the second single step being also UCS-2. Special recode machinery dynamically produces efficient, reversible, merge-able single steps out of these double steps.

I made some statistics about how many internal recoding steps are required between any two charsets chosen at random. The initial recoding layout, before optimisation, always uses between 1 and 5 steps. Optimisation could sometimes produce mere copies, which are counted as no steps at all. In other cases, optimisation is unable to save any step. The number of steps after optimisation is currently between 0 and 5 steps. Of course, the expected number of steps is affected by optimisation: it drops from 2.8 to 1.8. This means that recode uses a theoretical average of a bit less than one step per recoding job. This looks good. This was computed using reversible recodings. In strict mode, optimisation might be defeated somewhat. Number of steps run between 1 and 6, both before and after optimisation, and the expected number of steps decreases by a lesser amount, going from 2.2 to 1.3. This is still manageable.


Next: , Previous: , Up: Internals   [Contents][Index]

14.2 Adding new charsets

The main part of recode is written in C, as are most single steps. A few single steps need to recognise sequences of multiple characters, they are often better written in Flex. It is easy for a programmer to add a new charset to recode. All it requires is making a few functions kept in a single .c file, adjusting Makefile.am and remaking recode.

One of the function should convert from any previous charset to the new one. Any previous charset will do, but try to select it so you will not lose too much information while converting. The other function should convert from the new charset to any older one. You do not have to select the same old charset than what you selected for the previous routine. Once again, select any charset for which you will not lose too much information while converting.

If, for any of these two functions, you have to read multiple bytes of the old charset before recognising the character to produce, you might prefer programming it in Flex in a separate .l file. Prototype your C or Flex files after one of those which exist already, so to keep the sources uniform. Besides, at make time, all .l files are automatically merged into a single big one by the script mergelex.awk.

There are a few hidden rules about how to write new recode modules, for allowing the automatic creation of decsteps.h and initsteps.h at make time, or the proper merging of all Flex files. Mimetism is a simple approach which relieves me of explaining all these rules! Start with a module closely resembling what you intend to do. Here is some advice for picking up a model. First decide if your new charset module is to be be driven by algorithms rather than by tables. For algorithmic recodings, see iconqnx.c for C code, or txtelat1.l for Flex code. For table driven recodings, see ebcdic.c for one-to-one style recodings, lat1html.c for one-to-many style recodings, or atarist.c for double-step style recodings. Just select an example from the style that better fits your application.

Each of your source files should have its own initialisation function, named module_charset, which is meant to be executed quickly once, prior to any recoding. It should declare the name of your charsets and the single steps (or elementary recodings) you provide, by calling declare_step one or more times. Besides the charset names, declare_step expects a description of the recoding quality (see recodext.h) and two functions you also provide.

The first such function has the purpose of allocating structures, pre-conditioning conversion tables, etc. It is also the way of further modifying the STEP structure. This function is executed if and only if the single step is retained in an actual recoding sequence. If you do not need such delayed initialisation, merely use NULL for the function argument.

The second function executes the elementary recoding on a whole file. There are a few cases when you can spare writing this function:

If you have a recoding table handy in a suitable format but do not use one of the predefined recoding functions, it is still a good idea to use a delayed initialisation to save it anyway, because recode option ‘-h’ will take advantage of this information when available.

Finally, edit Makefile.am to add the source file name of your routines to the C_STEPS or L_STEPS macro definition, depending on the fact your routines is written in C or in Flex.


Next: , Previous: , Up: Internals   [Contents][Index]

14.3 Adding new surfaces

Adding a new surface is technically quite similar to adding a new charset. See New charsets. A surface is provided as a set of two transformations: one from the predefined special charset data or tree to the new surface, meant to apply the surface, the other from the new surface to the predefined special charset data or tree, meant to remove the surface.

Internally in recode, function declare_step especially recognises when a charset is so related to data or tree, and then takes appropriate actions so that charset gets indeed installed as a surface.


Previous: , Up: Internals   [Contents][Index]

14.4 Comments on the library design


Footnotes

(18)

If strict mapping is requested, another efficient device will be used instead of a permutation.


Previous: , Up: Internals   [Contents][Index]