Next: , Previous: , Up: Top   [Contents][Index]

9 Some IBM or Microsoft charsets

The recode program provides various IBM or Microsoft code pages (see Tabular). An easy way to find them all at once out of the recode program itself is through the command:

recode -l | egrep -i '(CP|IBM)[0-9]'

But also, see few special charsets presented in the incoming sections.


Next: , Previous: , Up: IBM and MS   [Contents][Index]

9.1 EBCDIC code

This charset is the IBM’s External Binary Coded Decimal for Interchange Coding. This is an eight bits code. The following three variants were implemented in recode independently of RFC 1345:

EBCDIC

In recode, the us..ebcdic conversion is identical to ‘dd conv=ebcdic’ conversion, and recode ebcdic..us conversion is identical to ‘dd conv=ascii’ conversion. This charset also represents the way Control Data Corporation relates EBCDIC to 8-bits ASCII.

EBCDIC-CCC

In recode, the us..ebcdic-ccc or ebcdic-ccc..us conversions represent the way Concurrent Computer Corporation (formerly Perkin Elmer) relates EBCDIC to 8-bits ASCII.

EBCDIC-IBM

In recode, the us..ebcdic-ibm conversion is almost identical to the GNU ‘dd conv=ibm’ conversion. Given the exact ‘dd conv=ibm’ conversion table, recode once said:

Codes  91 and 213 both recode to 173
Codes  93 and 229 both recode to 189
No character recodes to  74
No character recodes to 106

So I arbitrarily chose to recode 213 by 74 and 229 by 106. This makes the EBCDIC-IBM recoding reversible, but this is not necessarily the best correction. In any case, I think that GNU dd should be amended. dd and recode should ideally agree on the same correction. So, this table might change once again.

RFC 1345 brings into recode 15 other EBCDIC charsets, and 21 other charsets having EBCDIC in at least one of their alias names. You can get a list of all these by executing:

recode -l | grep -i ebcdic

Note that recode may convert a pure stream of EBCDIC characters, but it does not know how to handle binary data between records which is sometimes used to delimit them and build physical blocks. If end of lines are not marked, fixed record size may produce something readable, but VB or VBS blocking is likely to yield some garbage in the converted results.


Next: , Previous: , Up: IBM and MS   [Contents][Index]

9.2 IBM’s PC code

This charset is available in recode under the name IBM-PC, with dos, MSDOS and pc as acceptable aliases. The shortest way of specifying it in recode is pc.

The charset is aimed towards a PC microcomputer from IBM or any compatible. This is an eight-bit code. This charset is fairly old in recode, its tables were produced a long while ago by mere inspection of a printed chart of the IBM-PC codes and glyph.

It has CR-LF as its implied surface. This means that, if the original end of lines have to be preserved while going out of IBM-PC, they should currently be added back through the usage of a surface on the other charset, or better, just never removed. Here are examples for both cases:

recode pc..l2/cl < input > output
recode pc/..l2 < input > output

RFC 1345 brings into recode 44 ‘IBM’ charsets or code pages, and also 8 other code pages. You can get a list of these all these by executing:11

recode -l | egrep -i '(CP|IBM)[0-9]'

All charset or aliases beginning with letters ‘CP’ or ‘IBM’ also have CR-LF as their implied surface. The same is true for a purely numeric alias in the same family. For example, all of 819, CP819 and IBM819 imply CR-LF as a surface. Note that ISO-8859-1 does not imply a surface, despite it shares the same tabular data as 819.

There are a few discrepancies between this IBM-PC charset and the very similar RFC 1345 charset ibm437, which have not been analysed yet, so the charsets are being kept separate for now. This might change in the future, and the IBM-PC charset might disappear. Wizards would be interested in comparing the output of these two commands:

recode -vh IBM-PC..Latin-1
recode -vh IBM437..Latin-1

The first command uses the charset prior to RFC 1345 introduction. Both methods give different recodings. These differences are annoying, the fuzziness will have to be explained and settle down one day.


Previous: , Up: IBM and MS   [Contents][Index]

9.3 Unisys’ Icon code

This charset is available in recode under the name Icon-QNX, with QNX as an acceptable alias.

The file is using Unisys’ Icon way to represent diacritics with code 25 escape sequences, under the system QNX. This is a seven-bit code, even if eight-bit codes can flow through as part of IBM-PC charset.


Footnotes

(11)

On DOS/Windows, stock shells do not know that apostrophes quote special characters like |, so one need to use double quotes instead of apostrophes.


Previous: , Up: IBM and MS   [Contents][Index]