Re: badly calculated width of emoji in psql - Mailing list pgsql-hackers

From Jacob Champion
Subject Re: badly calculated width of emoji in psql
Date
Msg-id 9e3c847108bac6041f50e86d08d8835f5cf7cd78.camel@vmware.com
Whole thread Raw
In response to Re: badly calculated width of emoji in psql  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: badly calculated width of emoji in psql  (Pavel Stehule <pavel.stehule@gmail.com>)
Re: badly calculated width of emoji in psql  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
On Mon, 2021-04-05 at 14:07 +0900, Kyotaro Horiguchi wrote:
> At Fri, 2 Apr 2021 11:51:26 +0200, Pavel Stehule <pavel.stehule@gmail.com> wrote in 
> > with this patch, the formatting is correct
> 
> I think the hardest point of this issue is that we don't have a
> reasonable authoritative source that determines character width. And
> that the presentation is heavily dependent on environment.

> Unicode 9 and/or 10 defines the character properties "Emoji" and
> "Emoji_Presentation", and tr51[1] says that
> 
> > Emoji are generally presented with a square aspect ratio, which
> > presents a problem for flags.
> ...
> > Current practice is for emoji to have a square aspect ratio, deriving
> > from their origin in Japanese. For interoperability, it is recommended
> > that this practice be continued with current and future emoji. They
> > will typically have about the same vertical placement and advance
> > width as CJK ideographs. For example:
> 
> Ok, even putting aside flags, the first table in [2] asserts that "#",
> "*", "0-9" are emoji characters.  But we and I think no-one never
> present them in two-columns.  And the table has many mysterious holes
> I haven't looked into.

I think that's why Emoji_Presentation is false for those characters --
they _could_ be presented as emoji if the UI should choose to do so, or
if an emoji presentation selector is used, but by default a text
presentation would be expected.

> We could Emoji_Presentation=yes for the purpose, but for example,
> U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE) has the property
> Emoji_Presentation=yes but U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE
> WITH VERTICAL BAR) does not for a reason uncertaion to me.  It doesn't
> look like other than some kind of mistake.

That is strange.

> About environment, for example, U+23E9 is an emoji, and
> Emoji_Presentation=yes, but it is shown in one column on my
> xterm. (I'm not sure what font am I using..)

I would guess that's the key issue here. If we choose a particular
width for emoji characters, is there anything keeping a terminal's font
from doing something different anyway?

Furthermore, if the stream contains an emoji presentation selector
after a code point that would normally be text, shouldn't we change
that glyph to have an emoji "expected width"?

I'm wondering if the most correct solution would be to have the user
tell the client what width to use, using .psqlrc or something.

> A possible compromise is that we treat all Emoji=yes characters
> excluding ASCII characters as double-width and manually merge the
> fragmented regions into reasonably larger chunks.

We could also keep the fragments as-is and generate a full interval
table, like common/unicode_combining_table.h. It looks like there's
roughly double the number of emoji intervals as combining intervals, so
hopefully adding a second binary search wouldn't be noticeably slower.

--

In your opinion, would the current one-line patch proposal make things
strictly better than they are today, or would it have mixed results?
I'm wondering how to help this patch move forward for the current
commitfest, or if we should maybe return with feedback for now.

--Jacob

pgsql-hackers by date:

Previous
From: Zhihong Yu
Date:
Subject: Re: Numeric x^y for negative x
Next
From: David Christensen
Date:
Subject: Re: DELETE CASCADE