On Thu, 2024-02-29 at 21:05 -0800, Jeff Davis wrote:
> Attached v19 which addresses this issue.
I pushed the doc patch.
Attached v20. I am going to start pushing some other patches. v20-0001
(property tables) and v20-0003 (catalog iculocale -> locale) have been
stable for a while so are likely to go in soon. v20-0002 (case mapping)
also feels close to me, but it went through significant changes to
support full case mapping and titlecasing, so I'll see if there are
more comments.
Changes in v20:
* For titlecasing with the builtin "C.UTF-8" locale, do not perform
word break adjustment, so it matches libc's "C.UTF-8" titlecasing
behavior more closely.
* Add optimized table for ASCII code points when determining
categories and properties (this was already done for the case mapping
table).
* Add a small patch to make UTF-8 functions inline, which speeds
things up substantially.
Performance:
ASCII-only data:
lower initcap upper
"C" (libc) 2426 3326 2341
pg_c_utf8 2890 6570 2825
pg_unicode_fast 2929 7140 2893
"C.utf8" (libc) 5410 7810 5397
"en-US-x-icu" 8320 65732 9367
Including non-ASCII data:
lower initcap upper
"C" (libc) 2630 4677 2548
pg_c_utf8 5471 10682 5431
pg_unicode_fast 5582 12023 5587
"C.utf8" (libc) 8126 11834 8106
"en-US-x-icu" 14473 73655 15112
The new builtin collations nicely finish ahead of everything except "C"
(with an exception where pg_unicode_fast is marginally slower at
titlecasing non-ASCII data than libc "C.UTF-8", which is likely due to
the word break adjustment semantics).
I suspect the inlined UTF-8 functions also speed up a few other areas,
but I didn't measure.
Regards,
Jeff Davis