Re: badly calculated width of emoji in psql - Mailing list pgsql-hackers

From Jacob Champion
Subject Re: badly calculated width of emoji in psql
Date
Msg-id c2a556a0158b642d6c597508d2a47059091e4d56.camel@vmware.com
Whole thread Raw
In response to Re: badly calculated width of emoji in psql  (Laurenz Albe <laurenz.albe@cybertec.at>)
Responses Re: badly calculated width of emoji in psql  (Jacob Champion <pchampion@vmware.com>)
Re: badly calculated width of emoji in psql  (Jacob Champion <pchampion@vmware.com>)
List pgsql-hackers
On Mon, 2021-07-19 at 13:13 +0200, Laurenz Albe wrote:
> On Mon, 2021-07-19 at 16:46 +0900, Michael Paquier wrote:
> > > In your opinion, would the current one-line patch proposal make things
> > > strictly better than they are today, or would it have mixed results?
> > > I'm wondering how to help this patch move forward for the current
> > > commitfest, or if we should maybe return with feedback for now.
> > 
> > Based on the following list, it seems to me that [u+1f300,u+0x1faff]
> > won't capture everything, like the country flags:
> >
https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2Femoji%2Fcharts%2Ffull-emoji-list.html&data=04%7C01%7Cpchampion%40vmware.com%7Cbc3f4cff42094f60fa7708d94aa64f11%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637622900429154586%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=lfSsqU%2BEiSJrwftt9FL13ib7pw0Mzt5DYl%2BSjL2%2Bm%2F0%3D&reserved=0

On my machine, the regional indicator codes take up one column each
(even though they display as a wide uppercase letter), so making them
wide would break alignment. This seems to match up with Annex #11 [1]:

ED4. East Asian Wide (W): All other characters that are always
     wide. [...] This category includes characters that have explicit
     halfwidth counterparts, along with characters that have the [UTS51]
     property Emoji_Presentation, with the exception of characters that
     have the [UCD] property Regional_Indicator

So for whatever reason, those indicator codes aren't considered East
Asian Wide by Unicode (and therefore glibc), even though they are
Emoji_Presentation. And glibc appears to be using East Asian Wide as
the flag for a 2-column character.

glibc 2.31 is based on Unicode 12.1, I think. So if Postgres is built
against a Unicode database that's different from the system's,
obviously you'll see odd results no matter what we do here.

And _all_ of that completely ignores the actual country-flag-combining
behavior, which my terminal doesn't do and I assume would be part of a
separate conversation entirely, along with things like ZWJ sequences.

> That could be adapted; the question is if the approach as such is
> desirable or not.  This is necessarily a moving target, at the rate
> that emojis are created and added to Unicode.

Sure. We already have code in the tree that deals with that moving
target, though, by parsing apart pieces of the Unicode database. So the
added maintenance cost should be pretty low.

> My personal feeling is that something simple and perhaps imperfect
> as my one-liner that may miss some corner cases would be ok, but
> anything that saps more performance or is complicated would not
> be worth the effort.

Another data point: on my machine (Ubuntu 20.04, glibc 2.31) that
additional range not only misses a large number of emoji (e.g. in the
2xxx codepoint range), it incorrectly treats some narrow codepoints as
wide (e.g. many in the 1F32x range have Emoji_Presentation set to
false).

I note that the doc comment for ucs_wcwidth()...

>  *      - Spacing characters in the East Asian Wide (W) or East Asian
>  *        FullWidth (F) category as defined in Unicode Technical
>  *        Report #11 have a column width of 2.

...doesn't match reality anymore. The East Asian width handling was
last updated in 2006, it looks like? So I wonder whether fixing the
code to match the comment would not only fix the emoji problem but also
a bunch of other non-emoji characters.

--Jacob

[1] http://www.unicode.org/reports/tr11/

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Bitmap reuse
Next
From: Zhihong Yu
Date:
Subject: Re: Have I found an interval arithmetic bug?