Re: Add CASEFOLD() function. - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Add CASEFOLD() function.
Date
Msg-id 898752524b1ad658c5fdae789f823a0dc52e6171.camel@j-davis.com
Whole thread Raw
In response to Re: Add CASEFOLD() function.  (Peter Eisentraut <peter@eisentraut.org>)
List pgsql-hackers
On Thu, 2024-12-19 at 17:18 +0100, Peter Eisentraut wrote:
> Can you explain this in further detail?  I don't quite follow why
> this
> would be required.

I am unsure now.

My initial reasoning was based on the idea that users would want to use
CASEFOLD(t) in a unique expression index as an improvement over
LOWER(t). And if you do that, you'd be surprised if some equivalent
strings ended up in the index. I don't think that's a huge problem,
because in other contexts we leave it up to the user to keep things
normalized consistently, and a CHECK(t IS NFC NORMALIZED) is a good way
to do that.

But there's a problem: full case folding doesn't preserve the normal
form, so even if the input is NFC normalized, the output might not be.
If we solve this problem, then we can just say that CASEFOLD()
preserves the normal form, consistently with how the spec defines
LOWER()/UPPER(), and I think that would be the best outcome.

I'm not sure if that problem is solvable, though, because what if the
input string is in both NFC and NFD, how do we know which normal form
to preserve?

We could tell users to use an expression index on
NORMALIZE(CASEFOLD(t)) instead, but that feels like inefficient
boilerplate.

>
> Another might be that's not entirely clear how this should work in
> encodings other than UTF-8.  For example, the normalized string might
> not be representable in the encoding.

That's a good point.

Regards,
    Jeff Davis




pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Memory leak in WAL sender with pgoutput (v10~)
Next
From: Cary Huang
Date:
Subject: Re: sslinfo extension - add notbefore and notafter timestamps