Next: Universal, Previous: Invoking recode, Up: Top [Contents][Index]
The program named recode
is just an application of its recoding
library. The recoding library is available separately for other C
programs. A good way to acquire some familiarity with the recoding
library is to get acquainted with the recode
program itself.
To use the recoding library once it is installed, a C program needs to have a line:
#include <recode.h>
near its beginning, and the user should have ‘-lrecode’ on the linking call, so modules from the recoding library are found.
The library is still under development. As it stands, it contains four identifiable sets of routines: the outer level functions, the request level functions, the task level functions and the charset level functions. There are discussed in separate sections.
For effectively using the recoding library in most applications, it should be rarely needed to study anything beyond the main initialisation function at outer level, and then, various functions at request level.
• Outer level | Outer level functions | |
• Request level | Request level functions | |
• Task level | Task level functions | |
• Charset level | Charset level functions | |
• Errors | Handling errors |
Next: Request level, Previous: Library, Up: Library [Contents][Index]
The outer level functions mainly prepare the whole recoding library for use, or do actions which are unrelated to specific recodings. Here is an example of a program which does not really make anything useful.
#include <stdbool.h> #include <recode.h> const char *program_name; int main (int argc, char *const *argv) { program_name = argv[0]; RECODE_OUTER outer = recode_new_outer (true); recode_delete_outer (outer); exit (0); }
The header file <recode.h>
declares an opaque RECODE_OUTER
structure, which the programmer should use for allocating a variable in
his program (let’s assume the programmer is a male, here, no prejudice
intended). This ‘outer’ variable is given as a first argument to
all outer level functions.
The <recode.h>
header file uses the Boolean type setup by the
system header file <stdbool.h>
. But this header file is still
fairly new in C standards, and likely does not exist everywhere. If you
system does not offer this system header file yet, the proper compilation
of the <recode.h>
file could be guaranteed through the replacement
of the inclusion line by:
typedef enum {false = 0, true = 1} bool;
People wanting wider portability, or Autoconf lovers, might arrange their configure.in for being able to write something more general, like:
#if STDC_HEADERS # include <stdlib.h> #endif /* Some systems do not define EXIT_*, even with STDC_HEADERS. */ #ifndef EXIT_SUCCESS # define EXIT_SUCCESS 0 #endif #ifndef EXIT_FAILURE # define EXIT_FAILURE 1 #endif /* The following test is to work around the gross typo in systems like Sony NEWS-OS Release 4.0C, whereby EXIT_FAILURE is defined to 0, not 1. */ #if !EXIT_FAILURE # undef EXIT_FAILURE # define EXIT_FAILURE 1 #endif #if HAVE_STDBOOL_H # include <stdbool.h> #else typedef enum {false = 0, true = 1} bool; #endif #include <recode.h> const char *program_name; int main (int argc, char *const *argv) { program_name = argv[0]; RECODE_OUTER outer = recode_new_outer (true); recode_term_outer (outer); exit (EXIT_SUCCESS); }
but we will not insist on such details in the examples to come.
RECODE_OUTER recode_new_outer (auto_abort); bool recode_delete_outer (outer);
The recoding library absolutely needs to be initialised before being used,
and recode_new_outer
has to be called once, first. Besides the
outer it is meant to initialise, the function accepts a Boolean
argument whether or not the library should automatically issue diagnostics
on standard and abort the whole program on errors. When auto_abort
is true
, the library later conveniently issues diagnostics itself,
and aborts the calling program on errors. This is merely a convenience,
because if this parameter was false
, the calling program should always
take care of checking the return value of all other calls to the recoding
library functions, and when any error is detected, issue a diagnostic and
abort processing itself.
Regardless of the setting of auto_abort, all recoding library
functions return a success status. Most functions are geared for returning
false
for an error, and true
if everything went fine.
Functions returning structures or strings return NULL
instead
of the result, when the result cannot be produced. If auto_abort
is selected, functions either return true
, or do not return at all.
As in the example above, recode_new_outer
is called only once in
most cases. Calling recode_new_outer
implies some overhead, so
calling it more than once should preferably be avoided.
The termination function recode_delete_outer
reclaims the memory
allocated by recode_new_outer
for a given outer variable.
Calling recode_delete_outer
prior to program termination is more
aesthetic then useful, as all memory resources are automatically reclaimed
when the program ends. You may spare this terminating call if you prefer.
program_name
declaration
As we just explained, the user may set the recode
library so that,
in case of problems error, it issues the diagnostic itself and aborts the
whole processing. This capability may be quite convenient. When this
feature is used, the aborting routine includes the name of the running
program in the diagnostic. On the other hand, when this feature is not
used, the library merely return error codes, giving the library user fuller
control over all this. This behaviour is more like what usual libraries
do: they return codes and never abort. However, I would rather not force
library users to necessarily check all return codes themselves, by leaving
no other choice. In most simple applications, letting the library diagnose
and abort is much easier, and quite welcome. This is precisely because
both possibilities exist that the program_name
variable is needed: it
may be used by the library when the user sets it to diagnose itself.
Next: Task level, Previous: Outer level, Up: Library [Contents][Index]
The request level functions are meant to cover most recoding needs
programmers may have; they should provide all usual functionality.
Their API is almost stable by now. To get started with request level
functions, here is a full example of a program which sole job is to filter
ibmpc
code on its standard input into latin1
code on its
standard output.
#include <stdio.h> #include <stdbool.h> #include <recode.h> const char *program_name; int main (int argc, char *const *argv) { program_name = argv[0]; RECODE_OUTER outer = recode_new_outer (true); RECODE_REQUEST request = recode_new_request (outer); bool success; recode_scan_request (request, "ibmpc..latin1"); success = recode_file_to_file (request, stdin, stdout); recode_delete_request (request); recode_delete_outer (outer); exit (success ? 0 : 1); }
The header file <recode.h>
declares a RECODE_REQUEST
structure,
which the programmer should use for allocating a variable in his program.
This request variable is given as a first argument to all request
level functions, and in most cases, may be considered as opaque.
RECODE_REQUEST recode_new_request (outer); bool recode_delete_request (request);
No request variable may not be used in other request level
functions of the recoding library before having been initialised by
recode_new_request
. There may be many such request
variables, in which case, they are independent of one another and
they all need to be initialised separately. To avoid memory leaks, a
request variable should not be initialised a second time without
calling recode_delete_request
to “un-initialise” it.
Like for recode_delete_outer
, calling recode_delete_request
prior to program termination, in the example above, may be left out.
struct recode_request
Here are the fields of a struct recode_request
which may be
meaningfully changed, once a request has been initialised by
recode_new_request
, but before it gets used. It is not very frequent,
in practice, that these fields need to be changed. To access the fields,
you need to include recodext.h instead of recode.h,
in which case there also is a greater chance that you need to recompile
your programs if a new version of the recoding library gets installed.
verbose_flag
This field is initially false
. When set to true
, the
library will echo to stderr the sequence of elementary recoding steps
needed to achieve the requested recoding.
diaeresis_char
This field is initially the ASCII value of a double quote ",
but it may also be the ASCII value of a colon :. In texte
charset, some countries use double quotes to mark diaeresis, while other
countries prefer colons. This field contains the diaeresis character
for the texte
charset.
make_header_flag
This field is initially false
. When set to true
, it
indicates that the program is merely trying to produce a recoding table in
source form rather than completing any actual recoding. In such a case,
the optimisation of step sequence can be attempted much more aggressively.
If the step sequence cannot be reduced to a single step, table production
will fail.
diacritics_only
This field is initially false
. For HTML
and LaTeX
charset, it is often convenient to recode the diacriticized characters
only, while just not recoding other HTML code using ampersands or angular
brackets, or LaTeX code using backslashes. Set the field to true
for getting this behaviour. In the other charset, one can edit text as
well as HTML or LaTeX directives.
ascii_graphics
This field is initially false
, and relate to characters 176 to
223 in the ibmpc
charset, which are use to draw boxes. When set
to true
, while getting out of ibmpc
, ASCII characters are
selected so to graphically approximate these boxes.
bool recode_scan_request (request, "string");
The main role of a request variable is to describe a set of
recoding transformations. Function recode_scan_request
studies
the given string, and stores an internal representation of it into
request. Note that string may be a full-fledged recode
request, possibly including surfaces specifications, intermediary
charsets, sequences, aliases or abbreviations (see Requests).
The internal representation automatically receives some pre-conditioning
and optimisation, so the request may then later be used many times
to achieve many actual recodings. It would not be efficient calling
recode_scan_request
many times with the same string, it is
better having many request variables instead.
Once the request variable holds the description of a recoding transformation, a few functions use it for achieving an actual recoding. Either input or output of a recoding may be string, an in-memory buffer, or a file.
Functions with names like
recode_input-type_to_output-type
request an actual
recoding, and are described below. It is easy to remember which arguments
each function accepts, once grasped some simple principles for each
possible type. However, one of the recoding function escapes these
principles and is discussed separately, first.
recode_string (request, string);
The function recode_string
recodes string according
to request, and directly returns the resulting recoded string
freshly allocated, or NULL
if the recoding could not succeed for
some reason. When this function is used, it is the responsibility of
the programmer to ensure that the memory used by the returned string is
later reclaimed.
char *recode_string_to_buffer (request, input_string, &output_buffer, &output_length, &output_allocated); bool recode_string_to_file (request, input_file, output_file); bool recode_buffer_to_buffer (request, input_buffer, input_length, &output_buffer, &output_length, &output_allocated); bool recode_buffer_to_file (request, input_buffer, input_length, output_file); bool recode_file_to_buffer (request, input_file, &output_buffer, &output_length, &output_allocated); bool recode_file_to_file (request, input_file, output_file);
All these functions return a bool
result, false
meaning that
the recoding was not successful, often because of reversibility issues.
The name of the function well indicates on which types it reads and which
type it produces. Let’s discuss these three types in turn.
A string is merely an in-memory buffer which is terminated by a NUL
character (using as many bytes as needed), instead of being described
by a byte length. For input, a pointer to the buffer is given through
one argument.
It is notable that there is no to_string
functions. Only one
function recodes into a string, and it is recode_string
, which
has already been discussed separately, above.
A buffer is a sequence of bytes held in computer memory. For input, two arguments provide a pointer to the start of the buffer and its byte size. Note that for charsets using many bytes per character, the size is given in bytes, not in characters.
For output, three arguments provide the address of three variables, which
will receive the buffer pointer, the used buffer size in bytes, and the
allocated buffer size in bytes. If at the time of the call, the buffer
pointer is NULL
, then the allocated buffer size should also be zero,
and the buffer will be allocated afresh by the recoding functions. However,
if the buffer pointer is not NULL
, it should be already allocated,
the allocated buffer size then gives its size. If the allocated size
gets exceeded while the recoding goes, the buffer will be automatically
reallocated bigger, probably elsewhere, and the allocated buffer size will
be adjusted accordingly.
The second variable, giving the in-memory buffer size, will receive the
exact byte size which was needed for the recoding. A NUL
character
is guaranteed at the end of the produced buffer, but is not counted in the
byte size of the recoding. Beyond that NUL
, there might be some
extra space after the recoded data, extending to the allocated buffer size.
A file is a sequence of bytes held outside computer memory, but
buffered through it. For input, one argument provides a pointer to a
file already opened for read. The file is then read and recoded from its
current position until the end of the file, effectively swallowing it in
memory if the destination of the recoding is a buffer. For reading a file
filtered through the recoding library, but only a little bit at a time, one
should rather use recode_filter_open
and recode_filter_close
(these two functions are not yet available).
For output, one argument provides a pointer to a file already opened for write. The result of the recoding is written to that file starting at its current position.
The following special function is still subject to change:
void recode_format_table (request, language, "name");
and is not documented anymore for now.
Next: Charset level, Previous: Request level, Up: Library [Contents][Index]
The task level functions are used internally by the request level functions, they allow more explicit control over files and memory buffers holding input and output to recoding processes. The interface specification of task level functions is still subject to change a bit.
To get started with task level functions, here is a full example of a
program which sole job is to filter ibmpc
code on its standard input
into latin1
code on its standard output. That is, this program has
the same goal as the one from the previous section, but does its things
a bit differently.
#include <stdio.h> #include <stdbool.h> #include <recodext.h> const char *program_name; int main (int argc, char *const *argv) { program_name = argv[0]; RECODE_OUTER outer = recode_new_outer (false); RECODE_REQUEST request = recode_new_request (outer); RECODE_TASK task; bool success; recode_scan_request (request, "ibmpc..latin1"); task = recode_new_task (request); task->input.file = ""; task->output.file = ""; success = recode_perform_task (task); recode_delete_task (task); recode_delete_request (request); recode_delete_outer (outer); exit (success ? 0 : 1); }
The header file <recode.h>
declares a RECODE_TASK
structure, which the programmer should use for allocating a variable in
his program. This task
variable is given as a first argument to
all task level functions. The programmer ought to change and possibly
consult a few fields in this structure, using special functions.
RECODE_TASK recode_new_task (request); bool recode_delete_task (task);
No task variable may be used in other task level functions
of the recoding library without having first been initialised with
recode_new_task
. There may be many such task variables,
in which case, they are independent of one another and they all need to be
initialised separately. To avoid memory leaks, a task variable should
not be initialised a second time without calling recode_delete_task
to
“un-initialise” it. This function also accepts a request argument
and associates the request to the task. In fact, a task is essentially
a set of recoding transformations with the specification for its current
input and its current output.
The request variable may be scanned before or after the call to
recode_new_task
, it does not matter so far. Immediately after
initialisation, before further changes, the task variable associates
request empty in-memory buffers for both input and output.
The output buffer will later get allocated automatically on the fly,
as needed, by various task processors.
Even if a call to recode_delete_task
is not strictly mandatory
before ending the program, it is cleaner to always include it. Moreover,
in some future version of the recoding library, it might become required.
struct task_request
Here are the fields of a struct task_request
which may be meaningfully
changed, once a task has been initialised by recode_new_task
.
In fact, fields are expected to change. Once again, to access the fields,
you need to include recodext.h instead of recode.h,
in which case there also is a greater chance that you need to recompile
your programs if a new version of the recoding library gets installed.
request
The field request
points to the current recoding request, but may
be changed as needed between recoding calls, for example when there is
a need to achieve the construction of a resulting text made up of many
pieces, each being recoded differently.
input.name
input.file
If input.name
is not NULL
at start of a recoding, this is
a request that a file by that name be first opened for reading and later
automatically closed once the whole file has been read. If the file name is
not NULL
but an empty string, it means that standard input is to
be used. The opened file pointer is then held into input.file
.
If input.name
is NULL
and input.file
is not, than
input.file
should point to a file already opened for read, which
is meant to be recoded.
input.buffer
input.cursor
input.limit
When both input.name
and input.file
are NULL
, three
pointers describe an in-memory buffer containing the text to be recoded.
The buffer extends from input.buffer
to input.limit
,
yet the text to be recoded only extends from input.cursor
to
input.limit
. In most situations, input.cursor
starts with
the value that input.buffer
has. (Its value will internally advance
as the recoding goes, until it reaches the value of input.limit
.)
output.name
output.file
If output.name
is not NULL
at start of a recoding, this
is a request that a file by that name be opened for write and later
automatically closed after the recoding is done. If the file name is
not NULL
but an empty string, it means that standard output is to
be used. The opened file pointer is then held into output.file
.
If several passes with intermediate files are needed to produce the
recoding, the output.name
file is opened only for the final pass.
If output.name
is NULL
and output.file
is not, then
output.file
should point to a file already opened for write, which
will receive the result of the recoding.
output.buffer
output.cursor
output.limit
When both output.name
and output.file
are NULL
, three
pointers describe an in-memory buffer meant to receive the text, once it
is recoded. The buffer is already allocated from output.buffer
to output.limit
. In most situations, output.cursor
starts
with the value that output.buffer
has. Once the recoding is done,
output.cursor
will point at the next free byte in the buffer,
just after the recoded text, so another recoding could be called without
changing any of these three pointers, for appending new information to it.
The number of recoded bytes in the buffer is the difference between
output.cursor
and output.buffer
.
Each time output.cursor
reaches output.limit
, the buffer
is reallocated bigger, possibly at a different location in memory, always
held up-to-date in output.buffer
. It is still possible to call a
task level function with no output buffer at all to start with, in which
case all three fields should have NULL
as a value. This is the
situation immediately after a call to recode_new_task
.
strategy
This field, which is of type enum recode_sequence_strategy
, tells
how various recoding steps (passes) will be interconnected. Its initial
value is RECODE_STRATEGY_UNDECIDED
, which is a constant defined in
the header file <recodext.h>. Other possible values are:
RECODE_SEQUENCE_IN_MEMORY
Keep intermediate recodings in memory.
RECODE_SEQUENCE_WITH_FILES
Do not fork, use intermediate files.
RECODE_SEQUENCE_WITH_PIPE
Fork processes connected with pipe(2)
.
The best for now is to leave this field alone, and let the recoding library decide its strategy, as many combinations have not been tested yet.
byte_order_mark
This field, which is preset to true
, indicates that a byte order
mark is to be expected at the beginning of any canonical UCS-2
or UTF-16
text, and that such a byte order mark should be also
produced for these charsets.
fail_level
This field, which is of type enum recode_error
(see Errors),
sets the error level at which task level functions should report a failure.
If an error being detected is equal or greater than fail_level
,
the function will eventually return false
instead of true
.
The preset value for this field is RECODE_NOT_CANONICAL
, that means
that if not reset to another value, the library will report failure on
any error.
abort_level
This field, which is of type enum recode_error
(see Errors), sets
the error level at which task level functions should immediately interrupt
their processing. If an error being detected is equal or greater than
abort_level
, the function returns immediately, but the returned
value (true
or false
) is still is decided from the setting
of fail_level
, not abort_level
. The preset value for this
field is RECODE_MAXIMUM_ERROR
, that means that is not reset to
another value, the library will never interrupt a recoding task.
error_so_far
This field, which is of type enum recode_error
(see Errors),
maintains the maximum error level met so far while the recoding task
was proceeding. The preset value is RECODE_NO_ERROR
.
recode_perform_task (task); recode_filter_open (task, file); recode_filter_close (task);
The function recode_perform_task
reads as much input as possible,
and recode all of it on prescribed output, given a properly initialised
task.
Functions recode_filter_open
and recode_filter_close
are
only planned for now. They are meant to read input in piecemeal ways.
Even if functionality already exists informally in the library, it has
not been made available yet through such interface functions.
Next: Errors, Previous: Task level, Up: Library [Contents][Index]
Many functions are internal to the recoding library. Some of them
have been made external and available, for the recode
program
had to retain all its previous functionality while being transformed
into a mere application of the recoding library. These functions are
not really documented here for the time being, as we hope that many of
them will vanish over time. When this set of routines will stabilise,
it would be convenient to document them as an API for handling charset
names and contents.
RECODE_CHARSET find_charset (name, cleaning-type); bool list_all_charsets (charset); bool list_concise_charset (charset, list-format); bool list_full_charset (charset);
Previous: Charset level, Up: Library [Contents][Index]
The recode
program, while using the recode
library, needs to
control whether recoding problems are reported or not, and then reflect
these in the exit status. The program should also instruct the library
whether the recoding should be abruptly interrupted when an error is
met (so sparing processing when it is known in advance that a wrong
result would be discarded anyway), or if it should proceed nevertheless.
Here is how the library groups errors into levels, listed here in order
of increasing severity.
RECODE_NO_ERROR
No error was met on previous library calls.
RECODE_NOT_CANONICAL
The input text was using one of the many alternative codings for some
phenomenon, but not the one recode
would have canonically generated.
So, if the reverse recoding is later attempted, it would produce a text
having the same meaning as the original text, yet not being byte
identical.
For example, a Base64
block in which end-of-lines appear elsewhere
that at every 76 characters is not canonical. An e-circumflex in TeX
which is coded as ‘\^{e}’ instead of ‘\^e’ is not canonical.
RECODE_AMBIGUOUS_OUTPUT
It has been discovered that if the reverse recoding was attempted on the text output by this recoding, we would not obtain the original text, only because an ambiguity was generated by accident in the output text. This ambiguity would then cause the wrong interpretation to be taken.
Here are a few examples. If the Latin-1
sequence ‘e^’
is converted to Easy French and back, the result will be interpreted
as e-circumflex and so, will not reflect the intent of the original two
characters. Recoding an IBM-PC
text to Latin-1
and back,
where the input text contained an isolated LF, will have a spurious
CR inserted before the LF.
Currently, there are many cases in the library where the production of ambiguous output is not properly detected, as it is sometimes a difficult problem to accomplish this detection, or to do it speedily.
RECODE_UNTRANSLATABLE
One or more input character could not be recoded, because there is just no representation for this character in the output charset.
Here are a few examples. Non-strict mode often allows recode
to
compute on-the-fly mappings for unrepresentable characters, but strict
mode prohibits such attribution of reversible translations: so strict
mode might often trigger such an error. Most UCS-2
codes used to
represent Asian characters cannot be expressed in various Latin charsets.
RECODE_INVALID_INPUT
The input text does not comply with the coding it is declared to hold. So,
there is no way by which a reverse recoding would reproduce this text,
because recode
should never produce invalid output.
Here are a few examples. In strict mode, ASCII
text is not allowed
to contain characters with the eight bit set. UTF-8
encodings
ought to be minimal7.
RECODE_SYSTEM_ERROR
The underlying system reported an error while the recoding was going on, likely an input/output error. (This error symbol is currently unused in the library.)
RECODE_USER_ERROR
The programmer or user requested something the recoding library is unable to provide, or used the API wrongly. (This error symbol is currently unused in the library.)
RECODE_INTERNAL_ERROR
Something really wrong, which should normally never happen, was detected within the recoding library. This might be due to genuine bugs in the library, or maybe due to un-initialised or overwritten arguments to the API. (This error symbol is currently unused in the library.)
RECODE_MAXIMUM_ERROR
This error code should never be returned, it is only internally used as a sentinel for the list of all possible error codes.
One should be able to set the error level threshold for returning failure at end of recoding, and also the threshold for immediate interruption. If many errors occur while the recoding proceed, which are not severe enough to interrupt the recoding, then the most severe error is retained, while others are forgotten8. So, in case of an error, the possible actions currently are:
See Task level, and particularly the description of the fields
fail_level
, abort_level
and error_so_far
, for more
information about how errors are handled.
The minimality of an UTF-8
encoding
is guaranteed on output, but currently, it is not checked on input.
Another approach would have been to define the level symbols as masks instead, and to give masks to threshold setting routines, and to retain all errors—yet I never met myself such a need in practice, and so I fear it would be overkill. On the other hand, it might be interesting to maintain counters about how many times each kind of error occurred.
Previous: Charset level, Up: Library [Contents][Index]