Colin Watson's blog
Subscribe to a syndicated feed of my blog.
Powered by Blosxom
Encodings in man-db
I've spent some quality upstream time lately with man-db. Specifically,
I've been upgrading its locale support. I recently published a pre-release,
man-db 2.5.0-pre2, mainly for translators, but other people may be
interested in having a look at it as well. I hope to release 2.5.0 quite
soon so that all of this can land in Debian.
Firstly, man-db now supports creating and using databases for per-locale
hierarchies of manual pages, not just English. This means that
apropos and whatis can now display
information about localised manual pages.
Secondly, I've been working on the transition to UTF-8 manual pages. Now,
modulo some hacks, groff can't yet deal with Unicode input; some possible
input characters are reserved for its internal use which makes full 32-bit
input difficult to do properly until that's fixed. However, with a few
exceptions, manual pages generally just need the subset of Unicode that
corresponds to their language's usual legacy character set, so for now it's
good enough to just recode on the fly from UTF-8 to some appropriate 8-bit
character set and use groff's support for that.
man-db has actually supported doing this kind of thing for a while, but
it's been difficult to use since it only applies to
/usr/share/man/ll_CC.UTF-8/ directories, while manual pages
usually aren't country-specific. So, man-db 2.5.0 supports using
/usr/share/man/ll.UTF-8/ instead, which is a bit more
appropriate. Also, following a
discussion with Adam Borowski, man-db can now try decoding manual pages
as UTF-8 and fall back to 8-bit encodings even in directories without an
explicit encoding tag; if this fails for some reason, you can put a
'\" -*- coding: UTF-8 -*- line at the top of the page.
I'm still debating whether Debian policy should recommend installing
UTF-8 manual pages in
/usr/share/man/ll.UTF-8/ or just in
/usr/share/man/ll/. Initially I was very strongly in favour of
an encoding declaration, but now that man-db can do a pretty good job of
guesswork I'm coming round to Adam Borowski's position that people should be
able to forget about character sets with UTF-8. Opinions here would be
welcome. One thing I haven't moved on is that any design that assumes that
the encoding of manual pages on the filesystem has anything to do with the
user's locale is demonstrably incorrect and broken; I'm not going to use
LC_CTYPE for anything except output. However, maybe "UTF-8 or
the usual legacy encoding provided that the latter is not typically confused
for the former" is a good enough specification, and that still has the
desirable property of not requiring a flag day. I'll try to come down from
the fence before unleashing this code on the world.