Thread: Update list of combining characters
In src/backend/utils/mb/wchar.c, function ucs_wcwidth(), there is a list of Unicode combining characters, so that those can be ignored for computing the display length of a Unicode string. It seems to me that that list is either outdated or plain incorrect. For example, the list starts with {0x0300, 0x034E}, {0x0360, 0x0362}, {0x0483, 0x0486}, Let's look at the characters around the first "gap": (https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt) 034C;COMBINING ALMOST EQUAL TO ABOVE;Mn;230;NSM;;;;;N;;;;; 034D;COMBINING LEFT RIGHT ARROW BELOW;Mn;220;NSM;;;;;N;;;;; 034E;COMBINING UPWARDS ARROW BELOW;Mn;220;NSM;;;;;N;;;;; 034F;COMBINING GRAPHEME JOINER;Mn;0;NSM;;;;;N;;;;; 0350;COMBINING RIGHT ARROWHEAD ABOVE;Mn;230;NSM;;;;;N;;;;; 0351;COMBINING LEFT HALF RING ABOVE;Mn;230;NSM;;;;;N;;;;; So these are all in the "Mn" category, so they should be treated all the same here. Indeed, psql doesn't compute the width of some of them correctly: postgres=> select u&'|oo\034Coo|'; +----------+ | ?column? | +----------+ | |oXoo| | +----------+ postgres=> select u&'|oo\0350oo|'; +----------+ | ?column? | +----------+ | |oXoo| | +----------+ (I have replaced the combined character with X above so that the mail client rendering doesn't add another layer of uncertainty to this issue. The point is that the box is off in the second example.) AFAICT, these Unicode definitions haven't changed since that list was put in originally around 2006, so I wonder what's going on there. I have written a script that recomputes that list from the current Unicode data. Patch and script are attached. This makes those above cases all render correctly. (This should eventually get better built system integration.) Thoughts? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 2019-06-04 22:58, Peter Eisentraut wrote: > AFAICT, these Unicode definitions haven't changed since that list was > put in originally around 2006, so I wonder what's going on there. > > I have written a script that recomputes that list from the current > Unicode data. Patch and script are attached. This makes those above > cases all render correctly. (This should eventually get better built > system integration.) Any thoughts about applying this as a) a bug fix with backpatching b) just to master c) wait for PG13 d) it's all wrong? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > Any thoughts about applying this as > a) a bug fix with backpatching > b) just to master > c) wait for PG13 > d) it's all wrong? Well, it's a behavioral change, and we've not gotten field complaints, so I'm about -0.1 on back-patching. No objection to apply to master though. regards, tom lane
I think there's an off-by-one bug in your script. I picked one value at random to verify -- 0x0BC0. Old: > - {0x0BC0, 0x0BC0}, {0x0BCD, 0x0BCD}, {0x0C3E, 0x0C40}, New: > + {0x0BC0, 0x0BC1}, {0x0BCD, 0x0BD0}, {0x0C00, 0x0C01}, the UCD file has: 0BC0;TAMIL VOWEL SIGN II;Mn;0;NSM;;;;;N;;;;; 0BC1;TAMIL VOWEL SIGN U;Mc;0;L;;;;;N;;;;; 0BCD;TAMIL SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;; 0BD0;TAMIL OM;Lo;0;L;;;;;N;;;;; So it appears that the inclusion of both 0x0BC1 and 0x0BD0 are mistakes. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-06-13 15:52, Alvaro Herrera wrote: > I think there's an off-by-one bug in your script. Indeed. Here is an updated script and patch. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 2019-06-14 11:36, Peter Eisentraut wrote: > On 2019-06-13 15:52, Alvaro Herrera wrote: >> I think there's an off-by-one bug in your script. > > Indeed. Here is an updated script and patch. committed (to master) -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: >> Indeed. Here is an updated script and patch. > committed (to master) Cool, but should we also put your recalculation script into git, to help the next time we decide that we need to update this list? It's demonstrated to be nontrivial to get it right ;-) regards, tom lane
On 2019-06-19 21:55, Tom Lane wrote: > Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: >>> Indeed. Here is an updated script and patch. > >> committed (to master) > > Cool, but should we also put your recalculation script into git, to help > the next time we decide that we need to update this list? It's > demonstrated to be nontrivial to get it right ;-) For PG12, having the script in the archives is sufficient, I think. Per thread "more Unicode data updates", we should come up with a method that updates all (currently three) places where Unicode data is applied, which would involve some larger restructuring, probably. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services