Thread: Update list of combining characters

Update list of combining characters

From

Peter Eisentraut

Date:

04 June 2019, 20:58:46

In src/backend/utils/mb/wchar.c, function ucs_wcwidth(), there is a list
of Unicode combining characters, so that those can be ignored for
computing the display length of a Unicode string.  It seems to me that
that list is either outdated or plain incorrect.

For example, the list starts with

    {0x0300, 0x034E}, {0x0360, 0x0362}, {0x0483, 0x0486},

Let's look at the characters around the first "gap":

(https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt)

034C;COMBINING ALMOST EQUAL TO ABOVE;Mn;230;NSM;;;;;N;;;;;
034D;COMBINING LEFT RIGHT ARROW BELOW;Mn;220;NSM;;;;;N;;;;;
034E;COMBINING UPWARDS ARROW BELOW;Mn;220;NSM;;;;;N;;;;;
034F;COMBINING GRAPHEME JOINER;Mn;0;NSM;;;;;N;;;;;
0350;COMBINING RIGHT ARROWHEAD ABOVE;Mn;230;NSM;;;;;N;;;;;
0351;COMBINING LEFT HALF RING ABOVE;Mn;230;NSM;;;;;N;;;;;

So these are all in the "Mn" category, so they should be treated all the
same here.  Indeed, psql doesn't compute the width of some of them
correctly:

postgres=> select u&'|oo\034Coo|';
+----------+
| ?column? |
+----------+
| |oXoo|   |
+----------+

postgres=> select u&'|oo\0350oo|';
+----------+
| ?column? |
+----------+
| |oXoo|  |
+----------+

(I have replaced the combined character with X above so that the mail
client rendering doesn't add another layer of uncertainty to this issue.
 The point is that the box is off in the second example.)

AFAICT, these Unicode definitions haven't changed since that list was
put in originally around 2006, so I wonder what's going on there.

I have written a script that recomputes that list from the current
Unicode data.  Patch and script are attached.  This makes those above
cases all render correctly.  (This should eventually get better built
system integration.)

Thoughts?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Update list of combining characters

From

Peter Eisentraut

Date:

13 June 2019, 07:16:29

On 2019-06-04 22:58, Peter Eisentraut wrote:
> AFAICT, these Unicode definitions haven't changed since that list was
> put in originally around 2006, so I wonder what's going on there.
> 
> I have written a script that recomputes that list from the current
> Unicode data.  Patch and script are attached.  This makes those above
> cases all render correctly.  (This should eventually get better built
> system integration.)

Any thoughts about applying this as

a) a bug fix with backpatching
b) just to master
c) wait for PG13
d) it's all wrong?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Update list of combining characters

From

Tom Lane

Date:

13 June 2019, 13:33:37

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> Any thoughts about applying this as

> a) a bug fix with backpatching
> b) just to master
> c) wait for PG13
> d) it's all wrong?

Well, it's a behavioral change, and we've not gotten field complaints,
so I'm about -0.1 on back-patching.  No objection to apply to master
though.

            regards, tom lane

Re: Update list of combining characters

From

Alvaro Herrera

Date:

13 June 2019, 13:52:21

I think there's an off-by-one bug in your script.  I picked one value at
random to verify -- 0x0BC0.  Old:

> -        {0x0BC0, 0x0BC0}, {0x0BCD, 0x0BCD}, {0x0C3E, 0x0C40},

New:

> +        {0x0BC0, 0x0BC1}, {0x0BCD, 0x0BD0}, {0x0C00, 0x0C01},

the UCD file has:

0BC0;TAMIL VOWEL SIGN II;Mn;0;NSM;;;;;N;;;;;
0BC1;TAMIL VOWEL SIGN U;Mc;0;L;;;;;N;;;;;

0BCD;TAMIL SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
0BD0;TAMIL OM;Lo;0;L;;;;;N;;;;;

So it appears that the inclusion of both 0x0BC1 and 0x0BD0 are mistakes.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Update list of combining characters

From

Peter Eisentraut

Date:

14 June 2019, 09:36:02

On 2019-06-13 15:52, Alvaro Herrera wrote:
> I think there's an off-by-one bug in your script.

Indeed.  Here is an updated script and patch.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Update list of combining characters

From

Peter Eisentraut

Date:

19 June 2019, 19:39:38

On 2019-06-14 11:36, Peter Eisentraut wrote:
> On 2019-06-13 15:52, Alvaro Herrera wrote:
>> I think there's an off-by-one bug in your script.
> 
> Indeed.  Here is an updated script and patch.

committed (to master)

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Update list of combining characters

From

Tom Lane

Date:

19 June 2019, 19:55:46

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>> Indeed.  Here is an updated script and patch.

> committed (to master)

Cool, but should we also put your recalculation script into git, to help
the next time we decide that we need to update this list?  It's
demonstrated to be nontrivial to get it right ;-)

            regards, tom lane

Re: Update list of combining characters

From

Peter Eisentraut

Date:

24 June 2019, 20:58:34

On 2019-06-19 21:55, Tom Lane wrote:
> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>>> Indeed.  Here is an updated script and patch.
> 
>> committed (to master)
> 
> Cool, but should we also put your recalculation script into git, to help
> the next time we decide that we need to update this list?  It's
> demonstrated to be nontrivial to get it right ;-)

For PG12, having the script in the archives is sufficient, I think.  Per
thread "more Unicode data updates", we should come up with a method that
updates all (currently three) places where Unicode data is applied,
which would involve some larger restructuring, probably.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services