Thread: psql display of Unicode combining characters in 8.2
psql's display of Unicode combining characters appears to have changed in 8.2. For example, I'd expect <U+006E LATIN SMALL LETTER N, U+0303 COMBINING TILDE> to display the same as the precomposed <U+00F1 LATIN SMALL LETTER N WITH TILDE>. With 8.1's psql they do, but with 8.2's psql this sequence displays as: SELECT E'n\314\203'; -- \314\203 = UTF-8 encoding of U+0303?column? ----------n\u0303 (1 row) (I'm testing with both server and client using UTF-8.) This excerpt from pg_wcsformat() in mbprint.c looks responsible: else if (w <= 0) /* Non-ascii control char */ { if (encoding == PG_UTF8) sprintf((char *) ptr,"\\u%04X", utf2ucs(pwcs)); This might be the relevant commit: http://archives.postgresql.org/pgsql-committers/2006-02/msg00089.php Should the code distinguish between combining characters and zero-width control characters so the former display correctly? -- Michael Fuhr
On Sat, Dec 09, 2006 at 10:50:05PM -0700, Michael Fuhr wrote: > psql's display of Unicode combining characters appears to have > changed in 8.2. I forgot to mention that this change is in aligned output; unaligned output prints sequences with combining characters as I'd expect: test=> SELECT E'n\314\203';?column? ----------n\u0303 (1 row) test=> \a Output format is unaligned. test=> SELECT E'n\314\203'; ?column? ñ (1 row) -- Michael Fuhr
On Sat, Dec 09, 2006 at 10:50:05PM -0700, Michael Fuhr wrote: > Should the code distinguish between combining characters and > zero-width control characters so the former display correctly? Probably, any idea how to tell the difference? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout <kleptog@svana.org> writes: > On Sat, Dec 09, 2006 at 10:50:05PM -0700, Michael Fuhr wrote: >> Should the code distinguish between combining characters and >> zero-width control characters so the former display correctly? > Probably, any idea how to tell the difference? I'm no expert, but isn't there a specific range of Unicode code points defined for combining characters? regards, tom lane
On Sun, Dec 10, 2006 at 12:30:12PM -0500, Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > On Sat, Dec 09, 2006 at 10:50:05PM -0700, Michael Fuhr wrote: > >> Should the code distinguish between combining characters and > >> zero-width control characters so the former display correctly? > > > Probably, any idea how to tell the difference? > > I'm no expert, but isn't there a specific range of Unicode code points > defined for combining characters? Yes, several, with others scattered about. Could we use the general category (Mn = Mark, nonspacing; Me = Mark, enclosing)? ucs_wcwidth() in src/backend/utils/mb/wchar.c already contains some of that knowledge, doesn't it? The combining[] list looks incomplete but otherwise close to what we'd need. -- Michael Fuhr
Michael Fuhr <mike@fuhr.org> writes: > On Sun, Dec 10, 2006 at 12:30:12PM -0500, Tom Lane wrote: >> Martijn van Oosterhout <kleptog@svana.org> writes: >>> On Sat, Dec 09, 2006 at 10:50:05PM -0700, Michael Fuhr wrote: >>>> Should the code distinguish between combining characters and >>>> zero-width control characters so the former display correctly? >> >>> Probably, any idea how to tell the difference? >> >> I'm no expert, but isn't there a specific range of Unicode code points >> defined for combining characters? > Yes, several, with others scattered about. What about the other way around: use the \u output convention only for things we can specifically identify as control chars, and assume that anything else with zero width is a combining char? Is there anything other than 0-31 and 128-159 that should really get the \u treatment? regards, tom lane
I wrote: > What about the other way around: use the \u output convention only for > things we can specifically identify as control chars, and assume that > anything else with zero width is a combining char? Is there anything > other than 0-31 and 128-159 that should really get the \u treatment? Actually, looking at the comments for ucs_wcwidth() in wchar.c, it seems that this is already accounted for in the "dsplen" output: characters for which -1 is returned are control characters, characters for which 0 is returned should be printed as-is and counted as zero width. So the bug is just that pg_wcsformat conflates the two cases. regards, tom lane
I wrote: > Actually, looking at the comments for ucs_wcwidth() in wchar.c, it seems > that this is already accounted for in the "dsplen" output: characters > for which -1 is returned are control characters, characters for which > 0 is returned should be printed as-is and counted as zero width. So the > bug is just that pg_wcsformat conflates the two cases. I've applied the attached patch to fix this, but not being much of a user of languages that have combining characters, I can't test it very well. Please check out the behavior and see if you like it. regards, tom lane
Attachment
On Wed, Dec 27, 2006 at 02:49:41PM -0500, Tom Lane wrote: > I've applied the attached patch to fix this, but not being much of a > user of languages that have combining characters, I can't test it very > well. Please check out the behavior and see if you like it. Looks good so far. I've tested languages like Vietnamese (Latin script with lots of diacritics), polytonic Greek, and pointed Hebrew, with text normalized to both NFC and NFD. Before the patch the NFD text had lots of \u escapes; after the patch it looks identical to the NFC text aside from a few minor differences in the rendered glyphs, which tells me that I am indeed receiving the decomposed sequences. Thanks! -- Michael Fuhr