Re: Add CASEFOLD() function. - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Add CASEFOLD() function.
Date
Msg-id a17da3ebe17101ef330434237d2fef1cd1d0f30e.camel@j-davis.com
Whole thread Raw
In response to Re: Add CASEFOLD() function.  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
On Thu, 2024-12-19 at 09:51 -0800, Jeff Davis wrote:
> But there's a problem: full case folding doesn't preserve the normal
> form, so even if the input is NFC normalized, the output might not
> be.
> If we solve this problem, then we can just say that CASEFOLD()
> preserves the normal form, consistently with how the spec defines
> LOWER()/UPPER(), and I think that would be the best outcome.
>
> I'm not sure if that problem is solvable, though, because what if the
> input string is in both NFC and NFD, how do we know which normal form
> to preserve?

The options as I see it are:

1. Normalize the output (either by using an extra parameter or just
always normalizing to NFC). As you said, the problem here is that the
encoding might not work with normalization. One solution might be that
CASEFOLD() only works in UTF8, like NORMALIZE().

2. Try to preserve normalization as long as the encoding supports it.
The problem here is that we don't know what normal form to preserve,
because the input string might be in both NFC and NFD. We could
document that it preserves NFC form iff the input is NFC.

3. Allow CASEFOLD() to break the normal form of the input string. The
problem here is that the user may be surprised that the output is not
normalized even when all of their data is normalized. It's not clear to
me whether it still works for caseless matching -- it might if the
string is in a consistent form, even if not normalized.

Out of those I think #1 is the most appealing. Most users, and
especially users that care about these edge cases enough to use Full
Case Folding, are almost certainly going to be on UTF8 anyway. It's
also the most user-friendly.

Regards,
    Jeff Davis




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: use a non-locking initial test in TAS_SPIN on AArch64
Next
From: James Hunter
Date:
Subject: Re: Add the ability to limit the amount of memory that can be allocated to backends.