Re: Speed up ICU case conversion by using ucasemap_utf8To*() - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Speed up ICU case conversion by using ucasemap_utf8To*()
Date
Msg-id sfxay6t5adv57wdhp2idqi4pyuftbiqbilgbt2643dgymt2mmu@46uxujkk3un7
Whole thread Raw
In response to Re: Speed up ICU case conversion by using ucasemap_utf8To*()  (vignesh C <vignesh21@gmail.com>)
Responses Re: Speed up ICU case conversion by using ucasemap_utf8To*()
List pgsql-hackers
On 2025-03-17 12:16:11 +0530, vignesh C wrote:
> On Fri, 20 Dec 2024 at 10:50, Andreas Karlsson <andreas@proxel.se> wrote:
> >
> > Hi,
> >
> > Jeff pointed out to me that the case conversion functions in ICU have
> > UTF-8 specific versions which means we can call those directly if the
> > database encoding is UTF-8 and skip having to convert to and from UChar.
> >
> > Since most people today run their databases in UTF-8 I think this
> > optimization is worth it and when measuring on short to medium length
> > strings I got a 15-20% speed up. It is still slower than glibc in my
> > benchmarks but the gap is smaller now.
> >
> > SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
> > "sv-SE-x-icu") FROM generate_series(1, 1000000) i);
> >
> > master:  ~540 ms
> > Patched: ~460 ms
> > glibc:   ~410 ms
> >
> > I have also attached a clean up patch for the non-UTF-8 code paths. I
> > thought about doing the same for the new UTF-8 code paths but it turned
> > out to be a bit messy due to different function signatures for
> > ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle().
> 
> I noticed that Jeff's comments from [1] have not yet been addressed, I
> have changed the commitfest entry status to "Waiting on Author",
> please address them and update it to "Needs Review".
> [1] - https://www.postgresql.org/message-id/72c7c2b5848da44caddfe0f20f6c7ebc7c0c6e60.camel@j-davis.com

It's also worth noting that this patch hasn't been building for quite a while
(at least not since 2025-01-29):

https://cirrus-ci.com/task/5621435164524544?logs=build#L1228
[17:17:51.214] ld: error: undefined symbol: icu_convert_case
[17:17:51.214] >>> referenced by pg_locale_icu.c:484 (../src/backend/utils/adt/pg_locale_icu.c:484)
[17:17:51.214] >>>               src/backend/postgres_lib.a.p/utils_adt_pg_locale_icu.c.o:(strfold_icu)
[17:17:51.214] cc: error: linker command failed with exit code 1 (use -v to see invocation)

I think we can mark this as returned-with-feedback for now?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Srinath Reddy
Date:
Subject: Re: getting "shell command argument contains a newline or carriage return:" error with pg_dumpall when db name have new line in double quote
Next
From: Srinath Reddy
Date:
Subject: Re: getting "shell command argument contains a newline or carriage return:" error with pg_dumpall when db name have new line in double quote