PuTTY semi-bug dbcs-breakage

summary: Double-byte character set (CJK, &c) display is broken
class: semi-bug: This might or might not be a bug, depending on your precise definition of what a bug is.
difficulty: tricky: Needs many tuits.
priority: historic: This is an old bug report that we think is either fixed without noticing, or confined to old systems, or too vague.

We've had a report that Korean display got a lot worse in the snapshots between r5002 and r5003 <000001c513c9$e63ae1e0$aa000059@ktd>. Quoth Simon:

Looking at the code, I think I can see why this is happening. This is to do with RDB's idea that when the user selects `use font encoding' and a font with a DBCS encoding, the terminal code should simply store the individual bytes in individual character cells and rely on do_text() being passed a string of these so that TextOut() can reconstitute pairs of DBCS bytes into double-width characters.
As far as I can tell, terminal.c does not mark the first byte of a DBCS character stored in this way. Therefore, the mechanism is fundamentally dependent on a do_text() run happening to begin at the correct point mod 2! Hence the comment in the mail referenced above, which said that there was already some breakage when the cursor moved over a double-byte character - the half-character under the cursor cannot be properly redrawn. Owing to font-overflow, though, when you move the cursor over a double-byte character we now redraw a lot of text to the right of that as well, and if the cursor is on the first half of the character then this is bound to be incorrect mod 2; so the problem shows up a lot more readily. I'd bet that the same breakage could have been seen in previous versions if the window was covered and re-exposed when the cursor was in a problem position.
A real fix for this would involve implementing proper DBCS support, by detecting DBCS lead bytes in the terminal.c input data stream and storing both bytes in the same character cell using the existing UCSWIDE mechanism. I have occasionally wondered about doing this: I envisage that we would co-opt the top half of the unsigned long space (never used by any flavour of Unicode/UCS ever) to provide more than enough fake character encodings for the purpose.
Of course, if we were going to support DBCSes in terminal.c it would also be good to be able to support them properly, by translating them to Unicode on input.
Summary: I think this has always been broken, and now it's merely more obviously broken. I regret the effect on CJK users who had found the previous behaviour worked just about well enough, but I don't think a hurried fix is in the general interest.

UTF-8 mode should work reasonably well. A workaround is to use UTF-8 if possible (perhaps via something such as luit or screen).

SGT, 2024-11-17: classifying this bug as historic. We've had no actual complaints about this for a long time, and these days UTF-8 has mostly taken over from the Windows-native DBCS strategy.

If you want to comment on this web site, see the Feedback page.

Audit trail for this semi-bug.

(last revision of this bug record was at 2024-11-17 14:53:03 +0000)