Library (The recode reference manual)

4 A recoding library

The program named recode is just an application of its recoding library. The recoding library is available separately for other C programs. A good way to acquire some familiarity with the recoding library is to get acquainted with the recode program itself.

To use the recoding library once it is installed, a C program needs to have a line:

#include <recode.h>

near its beginning, and the user should have ‘-lrecode’ on the linking call, so modules from the recoding library are found.

The library is still under development. As it stands, it contains four identifiable sets of routines: the outer level functions, the request level functions, the task level functions and the charset level functions. There are discussed in separate sections.

For effectively using the recoding library in most applications, it should be rarely needed to study anything beyond the main initialisation function at outer level, and then, various functions at request level.

4.1 Outer level functions

The outer level functions mainly prepare the whole recoding library for use, or do actions which are unrelated to specific recodings. Here is an example of a program which does not really make anything useful.

#include <stdbool.h>
#include <recode.h>

const char *program_name;

int
main (int argc, char *const *argv)
{
  program_name = argv[0];
  RECODE_OUTER outer = recode_new_outer (true);

  recode_delete_outer (outer);
  exit (0);
}

The header file <recode.h> declares an opaque RECODE_OUTER structure, which the programmer should use for allocating a variable in his program (let’s assume the programmer is a male, here, no prejudice intended). This ‘outer’ variable is given as a first argument to all outer level functions.

The <recode.h> header file uses the Boolean type setup by the system header file <stdbool.h>. But this header file is still fairly new in C standards, and likely does not exist everywhere. If you system does not offer this system header file yet, the proper compilation of the <recode.h> file could be guaranteed through the replacement of the inclusion line by:

typedef enum {false = 0, true = 1} bool;

People wanting wider portability, or Autoconf lovers, might arrange their configure.in for being able to write something more general, like:

#if STDC_HEADERS
# include <stdlib.h>
#endif

/* Some systems do not define EXIT_*, even with STDC_HEADERS.  */
#ifndef EXIT_SUCCESS
# define EXIT_SUCCESS 0
#endif
#ifndef EXIT_FAILURE
# define EXIT_FAILURE 1
#endif
/* The following test is to work around the gross typo in systems like Sony
   NEWS-OS Release 4.0C, whereby EXIT_FAILURE is defined to 0, not 1.  */
#if !EXIT_FAILURE
# undef EXIT_FAILURE
# define EXIT_FAILURE 1
#endif

#if HAVE_STDBOOL_H
# include <stdbool.h>
#else
typedef enum {false = 0, true = 1} bool;
#endif

#include <recode.h>

const char *program_name;

int
main (int argc, char *const *argv)
{
  program_name = argv[0];
  RECODE_OUTER outer = recode_new_outer (true);

  recode_term_outer (outer);
  exit (EXIT_SUCCESS);
}

but we will not insist on such details in the examples to come.

Initialisation functions
```
RECODE_OUTER recode_new_outer (auto_abort);
bool recode_delete_outer (outer);
```
The recoding library absolutely needs to be initialised before being used, and recode_new_outer has to be called once, first. Besides the outer it is meant to initialise, the function accepts a Boolean argument whether or not the library should automatically issue diagnostics on standard and abort the whole program on errors. When auto_abort is true, the library later conveniently issues diagnostics itself, and aborts the calling program on errors. This is merely a convenience, because if this parameter was false, the calling program should always take care of checking the return value of all other calls to the recoding library functions, and when any error is detected, issue a diagnostic and abort processing itself.

Regardless of the setting of auto_abort, all recoding library functions return a success status. Most functions are geared for returning false for an error, and true if everything went fine. Functions returning structures or strings return NULL instead of the result, when the result cannot be produced. If auto_abort is selected, functions either return true, or do not return at all.

As in the example above, recode_new_outer is called only once in most cases. Calling recode_new_outer implies some overhead, so calling it more than once should preferably be avoided.

The termination function recode_delete_outer reclaims the memory allocated by recode_new_outer for a given outer variable. Calling recode_delete_outer prior to program termination is more aesthetic then useful, as all memory resources are automatically reclaimed when the program ends. You may spare this terminating call if you prefer.
The program_name declaration
As we just explained, the user may set the recode library so that, in case of problems error, it issues the diagnostic itself and aborts the whole processing. This capability may be quite convenient. When this feature is used, the aborting routine includes the name of the running program in the diagnostic. On the other hand, when this feature is not used, the library merely return error codes, giving the library user fuller control over all this. This behaviour is more like what usual libraries do: they return codes and never abort. However, I would rather not force library users to necessarily check all return codes themselves, by leaving no other choice. In most simple applications, letting the library diagnose and abort is much easier, and quite welcome. This is precisely because both possibilities exist that the program_name variable is needed: it may be used by the library when the user sets it to diagnose itself.

4.2 Request level functions

The request level functions are meant to cover most recoding needs programmers may have; they should provide all usual functionality. Their API is almost stable by now. To get started with request level functions, here is a full example of a program which sole job is to filter ibmpc code on its standard input into latin1 code on its standard output.

#include <stdio.h>
#include <stdbool.h>
#include <recode.h>

const char *program_name;

int
main (int argc, char *const *argv)
{
  program_name = argv[0];
  RECODE_OUTER outer = recode_new_outer (true);
  RECODE_REQUEST request = recode_new_request (outer);
  bool success;

  recode_scan_request (request, "ibmpc..latin1");

  success = recode_file_to_file (request, stdin, stdout);

  recode_delete_request (request);
  recode_delete_outer (outer);

  exit (success ? 0 : 1);
}

The header file <recode.h> declares a RECODE_REQUEST structure, which the programmer should use for allocating a variable in his program. This request variable is given as a first argument to all request level functions, and in most cases, may be considered as opaque.

Initialisation functions
```
RECODE_REQUEST recode_new_request (outer);
bool recode_delete_request (request);
```
No request variable may not be used in other request level functions of the recoding library before having been initialised by recode_new_request. There may be many such request variables, in which case, they are independent of one another and they all need to be initialised separately. To avoid memory leaks, a request variable should not be initialised a second time without calling recode_delete_request to “un-initialise” it.

Like for recode_delete_outer, calling recode_delete_request prior to program termination, in the example above, may be left out.
Fields of struct recode_request
Here are the fields of a struct recode_request which may be meaningfully changed, once a request has been initialised by recode_new_request, but before it gets used. It is not very frequent, in practice, that these fields need to be changed. To access the fields, you need to include recodext.h instead of recode.h, in which case there also is a greater chance that you need to recompile your programs if a new version of the recoding library gets installed.

verbose_flag

This field is initially false. When set to true, the library will echo to stderr the sequence of elementary recoding steps needed to achieve the requested recoding.

diaeresis_char

This field is initially the ASCII value of a double quote ", but it may also be the ASCII value of a colon :. In texte charset, some countries use double quotes to mark diaeresis, while other countries prefer colons. This field contains the diaeresis character for the texte charset.

make_header_flag

This field is initially false. When set to true, it indicates that the program is merely trying to produce a recoding table in source form rather than completing any actual recoding. In such a case, the optimisation of step sequence can be attempted much more aggressively. If the step sequence cannot be reduced to a single step, table production will fail.

diacritics_only

This field is initially false. For HTML and LaTeX charset, it is often convenient to recode the diacriticized characters only, while just not recoding other HTML code using ampersands or angular brackets, or LaTeX code using backslashes. Set the field to true for getting this behaviour. In the other charset, one can edit text as well as HTML or LaTeX directives.

ascii_graphics

This field is initially false, and relate to characters 176 to 223 in the ibmpc charset, which are use to draw boxes. When set to true, while getting out of ibmpc, ASCII characters are selected so to graphically approximate these boxes.
Study of request strings
```
bool recode_scan_request (request, "string");
```
The main role of a request variable is to describe a set of recoding transformations. Function recode_scan_request studies the given string, and stores an internal representation of it into request. Note that string may be a full-fledged recode request, possibly including surfaces specifications, intermediary charsets, sequences, aliases or abbreviations (see Requests).

The internal representation automatically receives some pre-conditioning and optimisation, so the request may then later be used many times to achieve many actual recodings. It would not be efficient calling recode_scan_request many times with the same string, it is better having many request variables instead.
Actual recoding jobs
Once the request variable holds the description of a recoding transformation, a few functions use it for achieving an actual recoding. Either input or output of a recoding may be string, an in-memory buffer, or a file.

Functions with names like recode_input-type_to_output-type request an actual recoding, and are described below. It is easy to remember which arguments each function accepts, once grasped some simple principles for each possible type. However, one of the recoding function escapes these principles and is discussed separately, first.
```
recode_string (request, string);
```
The function recode_string recodes string according to request, and directly returns the resulting recoded string freshly allocated, or NULL if the recoding could not succeed for some reason. When this function is used, it is the responsibility of the programmer to ensure that the memory used by the returned string is later reclaimed.
```
char *recode_string_to_buffer (request,
  input_string,
  &output_buffer, &output_length, &output_allocated);
bool recode_string_to_file (request,
  input_file,
  output_file);
bool recode_buffer_to_buffer (request,
  input_buffer, input_length,
  &output_buffer, &output_length, &output_allocated);
bool recode_buffer_to_file (request,
  input_buffer, input_length,
  output_file);
bool recode_file_to_buffer (request,
  input_file,
  &output_buffer, &output_length, &output_allocated);
bool recode_file_to_file (request,
  input_file,
  output_file);
```
All these functions return a bool result, false meaning that the recoding was not successful, often because of reversibility issues. The name of the function well indicates on which types it reads and which type it produces. Let’s discuss these three types in turn.

string

A string is merely an in-memory buffer which is terminated by a NUL character (using as many bytes as needed), instead of being described by a byte length. For input, a pointer to the buffer is given through one argument.

It is notable that there is no to_string functions. Only one function recodes into a string, and it is recode_string, which has already been discussed separately, above.

buffer

A buffer is a sequence of bytes held in computer memory. For input, two arguments provide a pointer to the start of the buffer and its byte size. Note that for charsets using many bytes per character, the size is given in bytes, not in characters.

For output, three arguments provide the address of three variables, which will receive the buffer pointer, the used buffer size in bytes, and the allocated buffer size in bytes. If at the time of the call, the buffer pointer is NULL, then the allocated buffer size should also be zero, and the buffer will be allocated afresh by the recoding functions. However, if the buffer pointer is not NULL, it should be already allocated, the allocated buffer size then gives its size. If the allocated size gets exceeded while the recoding goes, the buffer will be automatically reallocated bigger, probably elsewhere, and the allocated buffer size will be adjusted accordingly.

The second variable, giving the in-memory buffer size, will receive the exact byte size which was needed for the recoding. A NUL character is guaranteed at the end of the produced buffer, but is not counted in the byte size of the recoding. Beyond that NUL, there might be some extra space after the recoded data, extending to the allocated buffer size.

file

A file is a sequence of bytes held outside computer memory, but buffered through it. For input, one argument provides a pointer to a file already opened for read. The file is then read and recoded from its current position until the end of the file, effectively swallowing it in memory if the destination of the recoding is a buffer. For reading a file filtered through the recoding library, but only a little bit at a time, one should rather use recode_filter_open and recode_filter_close (these two functions are not yet available).

For output, one argument provides a pointer to a file already opened for write. The result of the recoding is written to that file starting at its current position.

The following special function is still subject to change:

void recode_format_table (request, language, "name");

and is not documented anymore for now.

4.3 Task level functions

The task level functions are used internally by the request level functions, they allow more explicit control over files and memory buffers holding input and output to recoding processes. The interface specification of task level functions is still subject to change a bit.

To get started with task level functions, here is a full example of a program which sole job is to filter ibmpc code on its standard input into latin1 code on its standard output. That is, this program has the same goal as the one from the previous section, but does its things a bit differently.

#include <stdio.h>
#include <stdbool.h>
#include <recodext.h>

const char *program_name;

int
main (int argc, char *const *argv)
{
  program_name = argv[0];
  RECODE_OUTER outer = recode_new_outer (false);
  RECODE_REQUEST request = recode_new_request (outer);
  RECODE_TASK task;
  bool success;

  recode_scan_request (request, "ibmpc..latin1");

  task = recode_new_task (request);
  task->input.file = "";
  task->output.file = "";
  success = recode_perform_task (task);

  recode_delete_task (task);
  recode_delete_request (request);
  recode_delete_outer (outer);

  exit (success ? 0 : 1);
}

The header file <recode.h> declares a RECODE_TASK structure, which the programmer should use for allocating a variable in his program. This task variable is given as a first argument to all task level functions. The programmer ought to change and possibly consult a few fields in this structure, using special functions.

Initialisation functions
```
RECODE_TASK recode_new_task (request);
bool recode_delete_task (task);
```
No task variable may be used in other task level functions of the recoding library without having first been initialised with recode_new_task. There may be many such task variables, in which case, they are independent of one another and they all need to be initialised separately. To avoid memory leaks, a task variable should not be initialised a second time without calling recode_delete_task to “un-initialise” it. This function also accepts a request argument and associates the request to the task. In fact, a task is essentially a set of recoding transformations with the specification for its current input and its current output.

The request variable may be scanned before or after the call to recode_new_task, it does not matter so far. Immediately after initialisation, before further changes, the task variable associates request empty in-memory buffers for both input and output. The output buffer will later get allocated automatically on the fly, as needed, by various task processors.

Even if a call to recode_delete_task is not strictly mandatory before ending the program, it is cleaner to always include it. Moreover, in some future version of the recoding library, it might become required.
Fields of struct task_request
Here are the fields of a struct task_request which may be meaningfully changed, once a task has been initialised by recode_new_task. In fact, fields are expected to change. Once again, to access the fields, you need to include recodext.h instead of recode.h, in which case there also is a greater chance that you need to recompile your programs if a new version of the recoding library gets installed.

request

The field request points to the current recoding request, but may be changed as needed between recoding calls, for example when there is a need to achieve the construction of a resulting text made up of many pieces, each being recoded differently.

input.name

input.file

If input.name is not NULL at start of a recoding, this is a request that a file by that name be first opened for reading and later automatically closed once the whole file has been read. If the file name is not NULL but an empty string, it means that standard input is to be used. The opened file pointer is then held into input.file.

If input.name is NULL and input.file is not, than input.file should point to a file already opened for read, which is meant to be recoded.

input.buffer

input.cursor

input.limit

When both input.name and input.file are NULL, three pointers describe an in-memory buffer containing the text to be recoded. The buffer extends from input.buffer to input.limit, yet the text to be recoded only extends from input.cursor to input.limit. In most situations, input.cursor starts with the value that input.buffer has. (Its value will internally advance as the recoding goes, until it reaches the value of input.limit.)

output.name

output.file

If output.name is not NULL at start of a recoding, this is a request that a file by that name be opened for write and later automatically closed after the recoding is done. If the file name is not NULL but an empty string, it means that standard output is to be used. The opened file pointer is then held into output.file. If several passes with intermediate files are needed to produce the recoding, the output.name file is opened only for the final pass.

If output.name is NULL and output.file is not, then output.file should point to a file already opened for write, which will receive the result of the recoding.

output.buffer

output.cursor

output.limit

When both output.name and output.file are NULL, three pointers describe an in-memory buffer meant to receive the text, once it is recoded. The buffer is already allocated from output.buffer to output.limit. In most situations, output.cursor starts with the value that output.buffer has. Once the recoding is done, output.cursor will point at the next free byte in the buffer, just after the recoded text, so another recoding could be called without changing any of these three pointers, for appending new information to it. The number of recoded bytes in the buffer is the difference between output.cursor and output.buffer.

Each time output.cursor reaches output.limit, the buffer is reallocated bigger, possibly at a different location in memory, always held up-to-date in output.buffer. It is still possible to call a task level function with no output buffer at all to start with, in which case all three fields should have NULL as a value. This is the situation immediately after a call to recode_new_task.

strategy

This field, which is of type enum recode_sequence_strategy, tells how various recoding steps (passes) will be interconnected. Its initial value is RECODE_STRATEGY_UNDECIDED, which is a constant defined in the header file <recodext.h>. Other possible values are:

RECODE_SEQUENCE_IN_MEMORY

Keep intermediate recodings in memory.

RECODE_SEQUENCE_WITH_FILES

Do not fork, use intermediate files.

RECODE_SEQUENCE_WITH_PIPE

Fork processes connected with pipe(2).

The best for now is to leave this field alone, and let the recoding library decide its strategy, as many combinations have not been tested yet.

byte_order_mark

This field, which is preset to true, indicates that a byte order mark is to be expected at the beginning of any canonical UCS-2 or UTF-16 text, and that such a byte order mark should be also produced for these charsets.

fail_level

This field, which is of type enum recode_error (see Errors), sets the error level at which task level functions should report a failure. If an error being detected is equal or greater than fail_level, the function will eventually return false instead of true. The preset value for this field is RECODE_NOT_CANONICAL, that means that if not reset to another value, the library will report failure on any error.

abort_level

This field, which is of type enum recode_error (see Errors), sets the error level at which task level functions should immediately interrupt their processing. If an error being detected is equal or greater than abort_level, the function returns immediately, but the returned value (true or false) is still is decided from the setting of fail_level, not abort_level. The preset value for this field is RECODE_MAXIMUM_ERROR, that means that is not reset to another value, the library will never interrupt a recoding task.

error_so_far

This field, which is of type enum recode_error (see Errors), maintains the maximum error level met so far while the recoding task was proceeding. The preset value is RECODE_NO_ERROR.
Task execution
```
recode_perform_task (task);
recode_filter_open (task, file);
recode_filter_close (task);
```
The function recode_perform_task reads as much input as possible, and recode all of it on prescribed output, given a properly initialised task.

Functions recode_filter_open and recode_filter_close are only planned for now. They are meant to read input in piecemeal ways. Even if functionality already exists informally in the library, it has not been made available yet through such interface functions.

4.4 Charset level functions

Many functions are internal to the recoding library. Some of them have been made external and available, for the recode program had to retain all its previous functionality while being transformed into a mere application of the recoding library. These functions are not really documented here for the time being, as we hope that many of them will vanish over time. When this set of routines will stabilise, it would be convenient to document them as an API for handling charset names and contents.

RECODE_CHARSET find_charset (name, cleaning-type);
bool list_all_charsets (charset);
bool list_concise_charset (charset, list-format);
bool list_full_charset (charset);

4.5 Handling errors

The recode program, while using the recode library, needs to control whether recoding problems are reported or not, and then reflect these in the exit status. The program should also instruct the library whether the recoding should be abruptly interrupted when an error is met (so sparing processing when it is known in advance that a wrong result would be discarded anyway), or if it should proceed nevertheless. Here is how the library groups errors into levels, listed here in order of increasing severity.

RECODE_NO_ERROR

No error was met on previous library calls.

RECODE_NOT_CANONICAL

The input text was using one of the many alternative codings for some phenomenon, but not the one recode would have canonically generated. So, if the reverse recoding is later attempted, it would produce a text having the same meaning as the original text, yet not being byte identical.

For example, a Base64 block in which end-of-lines appear elsewhere that at every 76 characters is not canonical. An e-circumflex in TeX which is coded as ‘\^{e}’ instead of ‘\^e’ is not canonical.

RECODE_AMBIGUOUS_OUTPUT

It has been discovered that if the reverse recoding was attempted on the text output by this recoding, we would not obtain the original text, only because an ambiguity was generated by accident in the output text. This ambiguity would then cause the wrong interpretation to be taken.

Here are a few examples. If the Latin-1 sequence ‘e^’ is converted to Easy French and back, the result will be interpreted as e-circumflex and so, will not reflect the intent of the original two characters. Recoding an IBM-PC text to Latin-1 and back, where the input text contained an isolated LF, will have a spurious CR inserted before the LF.

Currently, there are many cases in the library where the production of ambiguous output is not properly detected, as it is sometimes a difficult problem to accomplish this detection, or to do it speedily.

RECODE_UNTRANSLATABLE

One or more input character could not be recoded, because there is just no representation for this character in the output charset.

Here are a few examples. Non-strict mode often allows recode to compute on-the-fly mappings for unrepresentable characters, but strict mode prohibits such attribution of reversible translations: so strict mode might often trigger such an error. Most UCS-2 codes used to represent Asian characters cannot be expressed in various Latin charsets.

RECODE_INVALID_INPUT

The input text does not comply with the coding it is declared to hold. So, there is no way by which a reverse recoding would reproduce this text, because recode should never produce invalid output.

Here are a few examples. In strict mode, ASCII text is not allowed to contain characters with the eight bit set. UTF-8 encodings ought to be minimal⁷.

RECODE_SYSTEM_ERROR

The underlying system reported an error while the recoding was going on, likely an input/output error. (This error symbol is currently unused in the library.)

RECODE_USER_ERROR

The programmer or user requested something the recoding library is unable to provide, or used the API wrongly. (This error symbol is currently unused in the library.)

RECODE_INTERNAL_ERROR

Something really wrong, which should normally never happen, was detected within the recoding library. This might be due to genuine bugs in the library, or maybe due to un-initialised or overwritten arguments to the API. (This error symbol is currently unused in the library.)

RECODE_MAXIMUM_ERROR

This error code should never be returned, it is only internally used as a sentinel for the list of all possible error codes.

One should be able to set the error level threshold for returning failure at end of recoding, and also the threshold for immediate interruption. If many errors occur while the recoding proceed, which are not severe enough to interrupt the recoding, then the most severe error is retained, while others are forgotten⁸. So, in case of an error, the possible actions currently are:

do nothing and let go, returning success at end of recoding,
just let go for now, but return failure at end of recoding,
interrupt recoding right away and return failure now.

See Task level, and particularly the description of the fields fail_level, abort_level and error_so_far, for more information about how errors are handled.

Footnotes

(7)

The minimality of an UTF-8 encoding is guaranteed on output, but currently, it is not checked on input.

(8)

Another approach would have been to define the level symbols as masks instead, and to give masks to threshold setting routines, and to retain all errors—yet I never met myself such a need in practice, and so I fear it would be overkill. On the other hand, it might be interesting to maintain counters about how many times each kind of error occurred.

• Outer level		Outer level functions
• Request level		Request level functions
• Task level		Task level functions
• Charset level		Charset level functions
• Errors		Handling errors