Re: badly calculated width of emoji in psql - Mailing list pgsql-hackers

From John Naylor
Subject Re: badly calculated width of emoji in psql
Date
Msg-id CAFBsxsEX8+K4fvvUEJYTqbhpj9ENenWwYm9SrMXZ1P2cPG8P+Q@mail.gmail.com
Whole thread Raw
In response to Re: badly calculated width of emoji in psql  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: badly calculated width of emoji in psql  (Jacob Champion <pchampion@vmware.com>)
List pgsql-hackers
I wrote:

> On Sun, Aug 15, 2021 at 10:45 PM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Thu, Aug 12, 2021 at 05:13:31PM -0400, John Naylor wrote:
> > > I'm a bit concerned about the build dependencies not working right, but
> > > it's not clear it's even due to the patch. I'll spend some time
> > > investigating next week.
> >
> > How large do libpgcommon deliverables get with this patch?  Skimming
> > through the patch, that just looks like a couple of bytes, still.
>
> More like a couple thousand bytes. That's because the width of mbinterval doubled. If this is not desirable, we could teach the scripts to adjust the width of the interval type depending on the largest character they saw.

For src/common/libpgcommon.a, in a non-assert / non-debug build:
master: 254912
patch: 256432

And if I go further and remove the limit on the largest character in the combining table, I get 257248, which is still a relatively small difference.

I had a couple further thoughts:

1. The coding historically used normal comparison and branching for everything, but recently it only does that to detect control characters, and then goes through a binary search (and with this patch, two binary searches) for everything else. Although the performance regression of the current patch seems negligible, we could use almost the same branches to fast-path printable ascii text, like this:

+       /* fast path for printable ASCII characters */
+       if (ucs >= 0x20 || ucs < 0x7f)
+               return 1;
+
        /* test for 8-bit control characters */
        if (ucs == 0)
                return 0;

-       if (ucs < 0x20 || (ucs >= 0x7f && ucs < 0xa0) || ucs > 0x0010ffff)
+       if (ucs < 0xa0 || ucs > 0x0010ffff)
                return -1;

2. As written, the patch adds a script that's very close to an existing one, and emits a new file that has the same type of contents as an existing one, both of which are #included in one place. I wonder if we should consider having just one script that ingests both files and emits one file. All we need is for mbinterval to encode the character width, but we can probably do that with a bitfield like the normprops table to save space. Then, we only do one binary search. Opinions?

--
John Naylor
EDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Mark Dilger
Date:
Subject: Re: reporting TID/table with corruption error
Next
From: Peter Geoghegan
Date:
Subject: Re: reporting TID/table with corruption error