Re: Built-in CTYPE provider - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Built-in CTYPE provider
Date
Msg-id 163f4e2190cdf67f67016044e503c5004547e5a9.camel@j-davis.com
Whole thread Raw
In response to Re: Built-in CTYPE provider  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Built-in CTYPE provider
List pgsql-hackers
On Thu, 2024-02-29 at 21:05 -0800, Jeff Davis wrote:
> Attached v19 which addresses this issue.

I pushed the doc patch.

Attached v20. I am going to start pushing some other patches. v20-0001
(property tables) and v20-0003 (catalog iculocale -> locale) have been
stable for a while so are likely to go in soon. v20-0002 (case mapping)
also feels close to me, but it went through significant changes to
support full case mapping and titlecasing, so I'll see if there are
more comments.

Changes in v20:

 * For titlecasing with the builtin "C.UTF-8" locale, do not perform
word break adjustment, so it matches libc's "C.UTF-8" titlecasing
behavior more closely.

 * Add optimized table for ASCII code points when determining
categories and properties (this was already done for the case mapping
table).

 * Add a small patch to make UTF-8 functions inline, which speeds
things up substantially.

Performance:

ASCII-only data:

                       lower    initcap    upper

  "C" (libc)            2426       3326     2341
  pg_c_utf8             2890       6570     2825
  pg_unicode_fast       2929       7140     2893
  "C.utf8" (libc)       5410       7810     5397
  "en-US-x-icu"         8320      65732     9367

Including non-ASCII data:

                       lower    initcap    upper

  "C" (libc)            2630       4677     2548
  pg_c_utf8             5471      10682     5431
  pg_unicode_fast       5582      12023     5587
  "C.utf8" (libc)       8126      11834     8106
  "en-US-x-icu"        14473      73655    15112


The new builtin collations nicely finish ahead of everything except "C"
(with an exception where pg_unicode_fast is marginally slower at
titlecasing non-ASCII data than libc "C.UTF-8", which is likely due to
the word break adjustment semantics).

I suspect the inlined UTF-8 functions also speed up a few other areas,
but I didn't measure.

Regards,
    Jeff Davis


Attachment

pgsql-hackers by date:

Previous
From: Melanie Plageman
Date:
Subject: Re: BitmapHeapScan streaming read user and prelim refactoring
Next
From: Melanie Plageman
Date:
Subject: Re: Streaming read-ready sequential scan code