Re: Add CASEFOLD() function. - Mailing list pgsql-hackers
From | Thom Brown |
---|---|
Subject | Re: Add CASEFOLD() function. |
Date | |
Msg-id | CAA-aLv7KLoT9yCdiJwRP9PeL_4yNTzQ3T8WJLbTtX=Ld45UOpg@mail.gmail.com Whole thread Raw |
In response to | Re: Add CASEFOLD() function. (Ian Lawrence Barwick <barwick@gmail.com>) |
Responses |
Re: Add CASEFOLD() function.
|
List | pgsql-hackers |
On Thu, 19 Jun 2025, 03:53 Jeff Davis, <pgsql@j-davis.com> wrote:
On Wed, 2025-06-18 at 19:09 +0200, Vik Fearing wrote:
> I don't know. I am just pointing out what the Standard says. I
> think
> we should either comply, or say that we don't do it for LOWER and
> UPPER
> so let's keep things implementation-consistent.
For the standard, I see two potential philosophies:
I. CASEFOLD() is another variant of LOWER()/UPPER(), and it should
preserve NFC in the same way.
II. CASEFOLD() is not like LOWER()/UPPER(); it returns a semi-opaque
text value that is useful for caseless matching, but should not
ordinarily be used for display or sent to the application (those things
would be allowed, just not encouraged). For normalization, either:
(A) Follow Unicode Default Caseless Matching (16.0 3.13.5 D144), and
don't require any kind of normalization; or
(B) Follow Unicode Canonical Caseless Matching (D145), and require
that the input and output are normalized appropriately, but leave the
precise normal form as implementation-defined.
The current implementation could either be seen as philosophy (I) where
we've chosen to ignore the normalization part for the sake of
consistency with LOWER()/UPPER(); or it could be seen as philosophy
(II)(A).
> How much does it cost to check for NFC? I honestly don't know the
> answer to that question, but that is the only case where we need to
> maintain normalization.
I attached a very rough patch and ran a very simple test on strings
averaging 36 bytes in length, all already in NFC and the result is also
NFC. Before the patch, doing a CASEFOLD() on 10M tuples took about 3
seconds, afterward about 8.
There's a patch to optimize some of the normalization paths, which I
haven't had a chance to review yet. So those numbers might come down.
>
> It's not unconditionally, it's only if the input was NFC.
Optimizing the case where the input is _not_ NFC seems strange to me.
If we are normalizing the output, I'd say we should just make the
output always NFC. Being more strict, this seems likely to comply with
the eventual standard.
Additionally, if we are normalizing the output, then we should also do
the input fixup for U+0345, which would make the result usable for
Canonical Caseless Matching. Again, this seems likely to comply with
the eventual standard.
>
So I only see two reasonable implementations:
1. The current CASEFOLD() implementation.
2. Do the input fixup for U+0345 and unconditionally normalize the
output in NFC.
If there's a case to be made for both implementations, we could also
consider having two functions, say, CASEFOLD() for #1 and NCASEFOLD()
for #2. I'm not sure whether we'd want to standardize one or both of
those functions.
And if you think there's likely to be a collision with the standard
that's hard to anticipate and fix now, then we should consider
reverting CASEFOLD() for 18 and wait for more progress on the
standardization. What's the likelihood that the name changes or
something like that?
Late to the party, but is there an argument for porting this to the citext type? Or supplementing the extension with an additional type ("cftext"? *shrug*). It currently uses lower(), so our current recommendation for dealing with all unicode characters is to use nondeterministic collations.
Thom
pgsql-hackers by date: