Re: Use CASEFOLD() internally rather than LOWER() - Mailing list pgsql-hackers

From Mark Dilger
Subject Re: Use CASEFOLD() internally rather than LOWER()
Date
Msg-id CAHgHdKtb2jD+DaTJU+3jnQRZ9hEXSDcPCR8DCCzZTTVeo4jQcA@mail.gmail.com
Whole thread Raw
In response to Re: Use CASEFOLD() internally rather than LOWER()  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Use CASEFOLD() internally rather than LOWER()
List pgsql-hackers


On Tue, Mar 24, 2026 at 4:07 PM Jeff Davis <pgsql@j-davis.com> wrote:
On Sat, 2026-03-21 at 20:14 -0700, Mark Dilger wrote:
> After v2-0001, ILIKE uses str_casefold() for matching, but pg_trgm
> still
> uses str_tolower() for trigram extraction (trgm_op.c:352 and :948).
> With builtin collations, these produce different results.

Interesting, thank you. As stated in the original message, I was unsure
about changing pg_trgm without adjusting the regex logic, also:

https://www.postgresql.org/message-id/64d7949bad90545f981ac7513fb0b4954daca2c9.camel@j-davis.com

do you have a suggestion about an easy way to do that, or should we
revisit in the next cycle?

pg_trgm appears to be lossy, with recheck logic.  I would think you just need to make it give answers which at least include everything that a regex would match, and then allow recheck to prune that down.  My concern is having pg_trgm give less than all the answers, so that after recheck you get fewer results than a seqscan would have returned.  Would switching to casefold be strictly broader than regex?  If so, you would just need to convert pg_trgm to use casefold and then rely on the recheck machinery.

Sorry if this misses something discussed upthread.  I'm clearly assuming here that you don't mind that such a change necessitates a REINDEX. 

--

Mark Dilger

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Test timings are increasing too fast for cfbot
Next
From: Matthias van de Meent
Date:
Subject: Re: SQL-level pg_datum_image_equal