Thread: psql display of Unicode combining characters in 8.2

psql display of Unicode combining characters in 8.2

From
Michael Fuhr
Date:
psql's display of Unicode combining characters appears to have
changed in 8.2.  For example, I'd expect <U+006E LATIN SMALL LETTER N,
U+0303 COMBINING TILDE> to display the same as the precomposed
<U+00F1 LATIN SMALL LETTER N WITH TILDE>.  With 8.1's psql they do,
but with 8.2's psql this sequence displays as:

SELECT E'n\314\203';  -- \314\203 = UTF-8 encoding of U+0303?column? 
----------n\u0303
(1 row)

(I'm testing with both server and client using UTF-8.)

This excerpt from pg_wcsformat() in mbprint.c looks responsible:
   else if (w <= 0)        /* Non-ascii control char */   {       if (encoding == PG_UTF8)           sprintf((char *)
ptr,"\\u%04X", utf2ucs(pwcs));
 

This might be the relevant commit:

http://archives.postgresql.org/pgsql-committers/2006-02/msg00089.php

Should the code distinguish between combining characters and
zero-width control characters so the former display correctly?

-- 
Michael Fuhr


Re: psql display of Unicode combining characters in 8.2

From
Michael Fuhr
Date:
On Sat, Dec 09, 2006 at 10:50:05PM -0700, Michael Fuhr wrote:
> psql's display of Unicode combining characters appears to have
> changed in 8.2.

I forgot to mention that this change is in aligned output; unaligned
output prints sequences with combining characters as I'd expect:

test=> SELECT E'n\314\203';?column? 
----------n\u0303
(1 row)

test=> \a
Output format is unaligned.
test=> SELECT E'n\314\203';
?column?
ñ
(1 row)

-- 
Michael Fuhr


Re: psql display of Unicode combining characters in 8.2

From
Martijn van Oosterhout
Date:
On Sat, Dec 09, 2006 at 10:50:05PM -0700, Michael Fuhr wrote:
> Should the code distinguish between combining characters and
> zero-width control characters so the former display correctly?

Probably, any idea how to tell the difference?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: psql display of Unicode combining characters in 8.2

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> On Sat, Dec 09, 2006 at 10:50:05PM -0700, Michael Fuhr wrote:
>> Should the code distinguish between combining characters and
>> zero-width control characters so the former display correctly?

> Probably, any idea how to tell the difference?

I'm no expert, but isn't there a specific range of Unicode code points
defined for combining characters?
        regards, tom lane


Re: psql display of Unicode combining characters in 8.2

From
Michael Fuhr
Date:
On Sun, Dec 10, 2006 at 12:30:12PM -0500, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > On Sat, Dec 09, 2006 at 10:50:05PM -0700, Michael Fuhr wrote:
> >> Should the code distinguish between combining characters and
> >> zero-width control characters so the former display correctly?
> 
> > Probably, any idea how to tell the difference?
> 
> I'm no expert, but isn't there a specific range of Unicode code points
> defined for combining characters?

Yes, several, with others scattered about.  Could we use the general
category (Mn = Mark, nonspacing; Me = Mark, enclosing)?  ucs_wcwidth()
in src/backend/utils/mb/wchar.c already contains some of that
knowledge, doesn't it?  The combining[] list looks incomplete but
otherwise close to what we'd need.

-- 
Michael Fuhr


Re: psql display of Unicode combining characters in 8.2

From
Tom Lane
Date:
Michael Fuhr <mike@fuhr.org> writes:
> On Sun, Dec 10, 2006 at 12:30:12PM -0500, Tom Lane wrote:
>> Martijn van Oosterhout <kleptog@svana.org> writes:
>>> On Sat, Dec 09, 2006 at 10:50:05PM -0700, Michael Fuhr wrote:
>>>> Should the code distinguish between combining characters and
>>>> zero-width control characters so the former display correctly?
>> 
>>> Probably, any idea how to tell the difference?
>> 
>> I'm no expert, but isn't there a specific range of Unicode code points
>> defined for combining characters?

> Yes, several, with others scattered about.

What about the other way around: use the \u output convention only for
things we can specifically identify as control chars, and assume that
anything else with zero width is a combining char?  Is there anything
other than 0-31 and 128-159 that should really get the \u treatment?
        regards, tom lane


Re: psql display of Unicode combining characters in 8.2

From
Tom Lane
Date:
I wrote:
> What about the other way around: use the \u output convention only for
> things we can specifically identify as control chars, and assume that
> anything else with zero width is a combining char?  Is there anything
> other than 0-31 and 128-159 that should really get the \u treatment?

Actually, looking at the comments for ucs_wcwidth() in wchar.c, it seems
that this is already accounted for in the "dsplen" output: characters
for which -1 is returned are control characters, characters for which
0 is returned should be printed as-is and counted as zero width.  So the
bug is just that pg_wcsformat conflates the two cases.
        regards, tom lane


Re: psql display of Unicode combining characters in 8.2

From
Tom Lane
Date:
I wrote:
> Actually, looking at the comments for ucs_wcwidth() in wchar.c, it seems
> that this is already accounted for in the "dsplen" output: characters
> for which -1 is returned are control characters, characters for which
> 0 is returned should be printed as-is and counted as zero width.  So the
> bug is just that pg_wcsformat conflates the two cases.

I've applied the attached patch to fix this, but not being much of a
user of languages that have combining characters, I can't test it very
well.  Please check out the behavior and see if you like it.

            regards, tom lane


Attachment

Re: psql display of Unicode combining characters in 8.2

From
Michael Fuhr
Date:
On Wed, Dec 27, 2006 at 02:49:41PM -0500, Tom Lane wrote:
> I've applied the attached patch to fix this, but not being much of a
> user of languages that have combining characters, I can't test it very
> well.  Please check out the behavior and see if you like it.

Looks good so far.  I've tested languages like Vietnamese (Latin
script with lots of diacritics), polytonic Greek, and pointed Hebrew,
with text normalized to both NFC and NFD.  Before the patch the NFD
text had lots of \u escapes; after the patch it looks identical to
the NFC text aside from a few minor differences in the rendered
glyphs, which tells me that I am indeed receiving the decomposed
sequences.

Thanks!

-- 
Michael Fuhr