Encodings in man-db

I’ve spent some quality upstream time lately with man-db. Specifically, I’ve been upgrading its locale support. I recently published a pre-release, man-db 2.5.0-pre2 mainly for translators, but other people may be interested in having a look at it as well. I hope to release 2.5.0 quite soon so that all of this can land in Debian.

Firstly, man-db now supports creating and using databases for per-locale hierarchies of manual pages, not just English. This means that apropos and whatis can now display information about localised manual pages.

Secondly, I’ve been working on the transition to UTF-8 manual pages. Now, modulo some hacks, groff can’t yet deal with Unicode input; some possible input characters are reserved for its internal use which makes full 32-bit input difficult to do properly until that’s fixed. However, with a few exceptions, manual pages generally just need the subset of Unicode that corresponds to their language’s usual legacy character set, so for now it’s good enough to just recode on the fly from UTF-8 to some appropriate 8-bit character set and use groff’s support for that.

man-db has actually supported doing this kind of thing for a while, but it’s been difficult to use since it only applies to /usr/share/man/ll_CC.UTF-8/ directories, while manual pages usually aren’t country-specific. So, man-db 2.5.0 supports using /usr/share/man/ll.UTF-8/ instead, which is a bit more appropriate. Also, following a discussion with Adam Borowski, man-db can now try decoding manual pages as UTF-8 and fall back to 8-bit encodings even in directories without an explicit encoding tag; if this fails for some reason, you can put a '\" -*- coding: UTF-8 -*- line at the top of the page.

I’m still debating whether Debian policy should recommend installing UTF-8 manual pages in /usr/share/man/ll.UTF-8/ or just in /usr/share/man/ll/. Initially I was very strongly in favour of an encoding declaration, but now that man-db can do a pretty good job of guesswork I’m coming round to Adam Borowski’s position that people should be able to forget about character sets with UTF-8. Opinions here would be welcome. One thing I haven’t moved on is that any design that assumes that the encoding of manual pages on the filesystem has anything to do with the user’s locale is demonstrably incorrect and broken; I’m not going to use LC_CTYPE for anything except output. However, maybe “UTF-8 or the usual legacy encoding provided that the latter is not typically confused for the former” is a good enough specification, and that still has the desirable property of not requiring a flag day. I’ll try to come down from the fence before unleashing this code on the world.