Encodings in man-db
I’ve spent some quality upstream time lately with man-db. Specifically, I’ve been upgrading its locale support. I recently published a pre-release, man-db 2.5.0-pre2 mainly for translators, but other people may be interested in having a look at it as well. I hope to release 2.5.0 quite soon so that all of this can land in Debian.
Firstly, man-db now supports creating and using databases for per-locale hierarchies of manual pages, not just English. This means that apropos and whatis can now display information about localised manual pages.
Secondly, I’ve been working on the transition to UTF-8 manual pages. Now, modulo some hacks, groff can’t yet deal with Unicode input; some possible input characters are reserved for its internal use which makes full 32-bit input difficult to do properly until that’s fixed. However, with a few exceptions, manual pages generally just need the subset of Unicode that corresponds to their language’s usual legacy character set, so for now it’s good enough to just recode on the fly from UTF-8 to some appropriate 8-bit character set and use groff’s support for that.
man-db has actually supported doing this kind of thing for a while, but it’s
been difficult to use since it only applies to /usr/share/man/ll_CC.UTF-8/
directories, while manual pages usually aren’t country-specific. So, man-db
2.5.0 supports using /usr/share/man/ll.UTF-8/
instead, which is a bit more
appropriate. Also, following a discussion with Adam
Borowski,
man-db can now try decoding manual pages as UTF-8 and fall back to 8-bit
encodings even in directories without an explicit encoding tag; if this
fails for some reason, you can put a '\" -*- coding: UTF-8 -*-
line at the
top of the page.
I’m still debating whether Debian policy should recommend installing UTF-8
manual pages in /usr/share/man/ll.UTF-8/
or just in /usr/share/man/ll/
.
Initially I was very strongly in favour of an encoding declaration, but now
that man-db can do a pretty good job of guesswork I’m coming round to Adam
Borowski’s position that people should be able to forget about character
sets with UTF-8. Opinions here would be welcome. One thing I haven’t moved
on is that any design that assumes that the encoding of manual pages on the
filesystem has anything to do with the user’s locale is demonstrably
incorrect and broken; I’m not going to use LC_CTYPE
for anything except
output. However, maybe “UTF-8 or the usual legacy encoding provided that
the latter is not typically confused for the former” is a good enough
specification, and that still has the desirable property of not requiring a
flag day. I’ll try to come down from the fence before unleashing this code
on the world.