Next: , Previous: , Up: Top   [Contents][Index]

4 A recoding library

The program named recode is just an application of its recoding library. The recoding library is available separately for other C programs. A good way to acquire some familiarity with the recoding library is to get acquainted with the recode program itself.

To use the recoding library once it is installed, a C program needs to have a line:

#include <recode.h>

near its beginning, and the user should have ‘-lrecode’ on the linking call, so modules from the recoding library are found.

The library is still under development. As it stands, it contains four identifiable sets of routines: the outer level functions, the request level functions, the task level functions and the charset level functions. There are discussed in separate sections.

For effectively using the recoding library in most applications, it should be rarely needed to study anything beyond the main initialisation function at outer level, and then, various functions at request level.


Next: , Previous: , Up: Library   [Contents][Index]

4.1 Outer level functions

The outer level functions mainly prepare the whole recoding library for use, or do actions which are unrelated to specific recodings. Here is an example of a program which does not really make anything useful.

#include <stdbool.h>
#include <recode.h>

const char *program_name;

int
main (int argc, char *const *argv)
{
  program_name = argv[0];
  RECODE_OUTER outer = recode_new_outer (true);

  recode_delete_outer (outer);
  exit (0);
}

The header file <recode.h> declares an opaque RECODE_OUTER structure, which the programmer should use for allocating a variable in his program (let’s assume the programmer is a male, here, no prejudice intended). This ‘outer’ variable is given as a first argument to all outer level functions.

The <recode.h> header file uses the Boolean type setup by the system header file <stdbool.h>. But this header file is still fairly new in C standards, and likely does not exist everywhere. If you system does not offer this system header file yet, the proper compilation of the <recode.h> file could be guaranteed through the replacement of the inclusion line by:

typedef enum {false = 0, true = 1} bool;

People wanting wider portability, or Autoconf lovers, might arrange their configure.in for being able to write something more general, like:

#if STDC_HEADERS
# include <stdlib.h>
#endif

/* Some systems do not define EXIT_*, even with STDC_HEADERS.  */
#ifndef EXIT_SUCCESS
# define EXIT_SUCCESS 0
#endif
#ifndef EXIT_FAILURE
# define EXIT_FAILURE 1
#endif
/* The following test is to work around the gross typo in systems like Sony
   NEWS-OS Release 4.0C, whereby EXIT_FAILURE is defined to 0, not 1.  */
#if !EXIT_FAILURE
# undef EXIT_FAILURE
# define EXIT_FAILURE 1
#endif

#if HAVE_STDBOOL_H
# include <stdbool.h>
#else
typedef enum {false = 0, true = 1} bool;
#endif

#include <recode.h>

const char *program_name;

int
main (int argc, char *const *argv)
{
  program_name = argv[0];
  RECODE_OUTER outer = recode_new_outer (true);

  recode_term_outer (outer);
  exit (EXIT_SUCCESS);
}

but we will not insist on such details in the examples to come.


Next: , Previous: , Up: Library   [Contents][Index]

4.2 Request level functions

The request level functions are meant to cover most recoding needs programmers may have; they should provide all usual functionality. Their API is almost stable by now. To get started with request level functions, here is a full example of a program which sole job is to filter ibmpc code on its standard input into latin1 code on its standard output.

#include <stdio.h>
#include <stdbool.h>
#include <recode.h>

const char *program_name;

int
main (int argc, char *const *argv)
{
  program_name = argv[0];
  RECODE_OUTER outer = recode_new_outer (true);
  RECODE_REQUEST request = recode_new_request (outer);
  bool success;

  recode_scan_request (request, "ibmpc..latin1");

  success = recode_file_to_file (request, stdin, stdout);

  recode_delete_request (request);
  recode_delete_outer (outer);

  exit (success ? 0 : 1);
}

The header file <recode.h> declares a RECODE_REQUEST structure, which the programmer should use for allocating a variable in his program. This request variable is given as a first argument to all request level functions, and in most cases, may be considered as opaque.

The following special function is still subject to change:

void recode_format_table (request, language, "name");

and is not documented anymore for now.


Next: , Previous: , Up: Library   [Contents][Index]

4.3 Task level functions

The task level functions are used internally by the request level functions, they allow more explicit control over files and memory buffers holding input and output to recoding processes. The interface specification of task level functions is still subject to change a bit.

To get started with task level functions, here is a full example of a program which sole job is to filter ibmpc code on its standard input into latin1 code on its standard output. That is, this program has the same goal as the one from the previous section, but does its things a bit differently.

#include <stdio.h>
#include <stdbool.h>
#include <recodext.h>

const char *program_name;

int
main (int argc, char *const *argv)
{
  program_name = argv[0];
  RECODE_OUTER outer = recode_new_outer (false);
  RECODE_REQUEST request = recode_new_request (outer);
  RECODE_TASK task;
  bool success;

  recode_scan_request (request, "ibmpc..latin1");

  task = recode_new_task (request);
  task->input.file = "";
  task->output.file = "";
  success = recode_perform_task (task);

  recode_delete_task (task);
  recode_delete_request (request);
  recode_delete_outer (outer);

  exit (success ? 0 : 1);
}

The header file <recode.h> declares a RECODE_TASK structure, which the programmer should use for allocating a variable in his program. This task variable is given as a first argument to all task level functions. The programmer ought to change and possibly consult a few fields in this structure, using special functions.


Next: , Previous: , Up: Library   [Contents][Index]

4.4 Charset level functions

Many functions are internal to the recoding library. Some of them have been made external and available, for the recode program had to retain all its previous functionality while being transformed into a mere application of the recoding library. These functions are not really documented here for the time being, as we hope that many of them will vanish over time. When this set of routines will stabilise, it would be convenient to document them as an API for handling charset names and contents.

RECODE_CHARSET find_charset (name, cleaning-type);
bool list_all_charsets (charset);
bool list_concise_charset (charset, list-format);
bool list_full_charset (charset);

Previous: , Up: Library   [Contents][Index]

4.5 Handling errors

The recode program, while using the recode library, needs to control whether recoding problems are reported or not, and then reflect these in the exit status. The program should also instruct the library whether the recoding should be abruptly interrupted when an error is met (so sparing processing when it is known in advance that a wrong result would be discarded anyway), or if it should proceed nevertheless. Here is how the library groups errors into levels, listed here in order of increasing severity.

RECODE_NO_ERROR

No error was met on previous library calls.

RECODE_NOT_CANONICAL

The input text was using one of the many alternative codings for some phenomenon, but not the one recode would have canonically generated. So, if the reverse recoding is later attempted, it would produce a text having the same meaning as the original text, yet not being byte identical.

For example, a Base64 block in which end-of-lines appear elsewhere that at every 76 characters is not canonical. An e-circumflex in TeX which is coded as ‘\^{e}’ instead of ‘\^e’ is not canonical.

RECODE_AMBIGUOUS_OUTPUT

It has been discovered that if the reverse recoding was attempted on the text output by this recoding, we would not obtain the original text, only because an ambiguity was generated by accident in the output text. This ambiguity would then cause the wrong interpretation to be taken.

Here are a few examples. If the Latin-1 sequence ‘e^’ is converted to Easy French and back, the result will be interpreted as e-circumflex and so, will not reflect the intent of the original two characters. Recoding an IBM-PC text to Latin-1 and back, where the input text contained an isolated LF, will have a spurious CR inserted before the LF.

Currently, there are many cases in the library where the production of ambiguous output is not properly detected, as it is sometimes a difficult problem to accomplish this detection, or to do it speedily.

RECODE_UNTRANSLATABLE

One or more input character could not be recoded, because there is just no representation for this character in the output charset.

Here are a few examples. Non-strict mode often allows recode to compute on-the-fly mappings for unrepresentable characters, but strict mode prohibits such attribution of reversible translations: so strict mode might often trigger such an error. Most UCS-2 codes used to represent Asian characters cannot be expressed in various Latin charsets.

RECODE_INVALID_INPUT

The input text does not comply with the coding it is declared to hold. So, there is no way by which a reverse recoding would reproduce this text, because recode should never produce invalid output.

Here are a few examples. In strict mode, ASCII text is not allowed to contain characters with the eight bit set. UTF-8 encodings ought to be minimal7.

RECODE_SYSTEM_ERROR

The underlying system reported an error while the recoding was going on, likely an input/output error. (This error symbol is currently unused in the library.)

RECODE_USER_ERROR

The programmer or user requested something the recoding library is unable to provide, or used the API wrongly. (This error symbol is currently unused in the library.)

RECODE_INTERNAL_ERROR

Something really wrong, which should normally never happen, was detected within the recoding library. This might be due to genuine bugs in the library, or maybe due to un-initialised or overwritten arguments to the API. (This error symbol is currently unused in the library.)

RECODE_MAXIMUM_ERROR

This error code should never be returned, it is only internally used as a sentinel for the list of all possible error codes.

One should be able to set the error level threshold for returning failure at end of recoding, and also the threshold for immediate interruption. If many errors occur while the recoding proceed, which are not severe enough to interrupt the recoding, then the most severe error is retained, while others are forgotten8. So, in case of an error, the possible actions currently are:

See Task level, and particularly the description of the fields fail_level, abort_level and error_so_far, for more information about how errors are handled.


Footnotes

(7)

The minimality of an UTF-8 encoding is guaranteed on output, but currently, it is not checked on input.

(8)

Another approach would have been to define the level symbols as masks instead, and to give masks to threshold setting routines, and to retain all errors—yet I never met myself such a need in practice, and so I fear it would be overkill. On the other hand, it might be interesting to maintain counters about how many times each kind of error occurred.


Previous: , Up: Library   [Contents][Index]