Re: badly calculated width of emoji in psql - Mailing list pgsql-hackers

From John Naylor
Subject Re: badly calculated width of emoji in psql
Date
Msg-id CAFBsxsGOCpzV7c-f3a8ADsA1n4uZ=8puCctQp+x7W0vgkv=w+g@mail.gmail.com
Whole thread Raw
In response to Re: badly calculated width of emoji in psql  (Jacob Champion <pchampion@vmware.com>)
Responses Re: badly calculated width of emoji in psql
Re: badly calculated width of emoji in psql
List pgsql-hackers
On Thu, Aug 19, 2021 at 8:05 PM Jacob Champion <pchampion@vmware.com> wrote:
>
> On Thu, 2021-08-19 at 13:49 -0400, John Naylor wrote:
> > I had a couple further thoughts:
> >
> > 1. The coding historically used normal comparison and branching for
> > everything, but recently it only does that to detect control
> > characters, and then goes through a binary search (and with this
> > patch, two binary searches) for everything else. Although the
> > performance regression of the current patch seems negligible,
>
> If I'm reading the code correctly, ASCII characters don't go through
> the binary searches; they're already short-circuited at the beginning
> of mbbisearch(). On my machine that's enough for the patch to be a
> performance _improvement_ for ASCII, not a regression.

I had assumed that there would be a function call, but looking at the assembly, it's inlined, so you're right.

> Should be && instead of ||, I think.

Yes, you're quite right. Clearly I didn't test it. :-) But given the previous, I won't pursue this further.

> > 2. As written, the patch adds a script that's very close to an
> > existing one, and emits a new file that has the same type of contents
> > as an existing one, both of which are #included in one place. I
> > wonder if we should consider having just one script that ingests both
> > files and emits one file. All we need is for mbinterval to encode the
> > character width, but we can probably do that with a bitfield like the
> > normprops table to save space. Then, we only do one binary search.
> > Opinions?
>
> I guess it just depends on what the end result looks/performs like.
> We'd save seven hops or so in the worst case?

Something like that. Attached is what I had in mind (using real patches to see what the CF bot thinks):

0001 is a simple renaming
0002 puts the char width inside the mbinterval so we can put arbitrary values there
0003 is Jacob's patch adjusted to use the same binary search as for combining characters
0004 removes the ancient limit on combining characters, so the following works now:

SELECT U&'\+0102E1\+0102E0';
+----------+
| ?column? |
+----------+
| 𐋡𐋠       |
+----------+
(1 row)

I think the adjustments to 0003 result in a cleaner and more extensible design, but a case could be made otherwise. The former combining table script is a bit more complex than the sum of its former self and Jacob's proposed new script, but just slightly.

Also, I checked the behavior of this comment that I proposed to remove upthread:

- *  - Other format characters (general category code Cf in the Unicode
- * database) and ZERO WIDTH SPACE (U+200B) have a column width of 0.

We don't handle the latter in our current setup:

SELECT U&'foo\200Bbar';
+----------+
| ?column? |
+----------+
| foobar  |
+----------+
(1 row)

Not sure if we should do anything about this. It was an explicit exception years ago in our vendored manual table, but is not labeled as such in the official Unicode files.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: archive status ".ready" files may be created too early
Next
From: Robert Haas
Date:
Subject: Re: The Free Space Map: Problems and Opportunities