Thread: Re: Add CASEFOLD() function.

Re: Add CASEFOLD() function.

From
Ian Lawrence Barwick
Date:
Hi

2024年12月12日(木) 18:00 Jeff Davis <pgsql@j-davis.com>:
>
> Unicode case folding is a way to convert a string to a canonical case
> for the purpose of case-insensitive matching.
>
> Users have long used LOWER() for that purpose, but there are a few edge
> case problems:
>
> * Some characters have more than two cased forms, such as "Σ" (U+03A3),
> which can be lowercased as "σ" (U+03C3) or "ς" (U+03C2). The CASEFOLD()
> function converts all cased forms of the character to "σ".
>
> * The character "İ" (U+0130, capital I with dot) is lowercased to "i",
> which can be a problem in locales that don't expect that.
>
> * If new lower case characters are added to Unicode, the results of
> LOWER() may change.
>
> The CASEFOLD() function solves these problems.
>
> Patch attached.

I took a quick look at this as it sounds useful for the described issue,
and it seems to work as advertised, except the function is named "FOLDCASE()"
in the patch, so I'm wondering which is intended? A quick search indicates
there are no functions of either name in other databases; Python has a
"casefold()"
function [1] and PHP a "foldCase()" function [2], so it doesn't seem there's a
de-facto standard for this.

[1] https://docs.python.org/3/library/stdtypes.html#str.casefold
[2] https://www.php.net/manual/en/intlchar.foldcase.php

Regards


Ian Barwick



Re: Add CASEFOLD() function.

From
Jeff Davis
Date:
On Thu, 2024-12-12 at 21:52 +0900, Ian Lawrence Barwick wrote:
> and it seems to work as advertised, except the function is named
> "FOLDCASE()"
> in the patch, so I'm wondering which is intended?

Thank you for looking into this, I went back and forth on the name, and
mistyped it a few times.

ICU seems to use "foldcase":

https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucasemap_8h.html

and it seems to be slightly more proper grammatically (with "fold"
being the verb). However, "case folding" is common terminology in
Postgres and Unicode, so "casefold" can be seen as the verb instead.

I don't have a strong opinion here so I will just go with whatever
seems like the popular choice.

Regards,
    Jeff Davis




Re: Add CASEFOLD() function.

From
Joe Conway
Date:
On 12/12/24 13:30, Jeff Davis wrote:
> On Thu, 2024-12-12 at 21:52 +0900, Ian Lawrence Barwick wrote:
>> and it seems to work as advertised, except the function is named
>> "FOLDCASE()"
>> in the patch, so I'm wondering which is intended?
> 
> Thank you for looking into this, I went back and forth on the name, and
> mistyped it a few times.
> 
> ICU seems to use "foldcase":
> 
> https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucasemap_8h.html
> 
> and it seems to be slightly more proper grammatically (with "fold"
> being the verb). However, "case folding" is common terminology in
> Postgres and Unicode, so "casefold" can be seen as the verb instead.
> 
> I don't have a strong opinion here so I will just go with whatever
> seems like the popular choice.


FWIW I prefer casefold()

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Add CASEFOLD() function.

From
Joe Conway
Date:
On 12/16/24 12:49, Jeff Davis wrote:
> One question I have is whether we want this function to normalize the
> output.
> 
> I believe most usecases would want the output normalized, because
> normalization differences (e.g. "a" U+0061 followed by "combining
> acute" U+0301 vs "a with acute" U+00E1) are more minor than differences
> in case.
> 
> Of course, a user could wrap it with the normalize() function, but
> that's verbose and easy to forget. I'm also not sure that it can be
> made as fast as a combined function that does both.

Perhaps a one arg version that always casefolds and a two arg version 
that accepts nfc, nfd, none (or something similar)?

> And a follow-up question: if it does normalize, the second parameter
> would be the requested normal form. But to accept the keyword forms
> (NFC, NFD in gram.y) rather than the string forms ('NFC', 'NFD') then
> we'd need to also need to add CASEFOLD to gram.y (like NORMALIZE). Is
> that a reasonable thing to do?

SQL 2023 seems to include the NORMALIZE syntax, but the only case 
folding considered is UPPER and LOWER. As such, I think it ought to be a 
function but not part of the grammar.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Add CASEFOLD() function.

From
Jeff Davis
Date:
On Mon, 2024-12-16 at 16:27 -0500, Joe Conway wrote:
>
> SQL 2023 seems to include the NORMALIZE syntax, but the only case
> folding considered is UPPER and LOWER. As such, I think it ought to
> be a
> function but not part of the grammar.

Should the standard support something like the Unicode idea of case
folding? If so, do we need to be careful of conflicts?

Regards,
    Jeff Davis




Re: Add CASEFOLD() function.

From
Peter Eisentraut
Date:
On 16.12.24 18:49, Jeff Davis wrote:
> One question I have is whether we want this function to normalize the
> output.
> 
> I believe most usecases would want the output normalized, because
> normalization differences (e.g. "a" U+0061 followed by "combining
> acute" U+0301 vs "a with acute" U+00E1) are more minor than differences
> in case.

Can you explain this in further detail?  I don't quite follow why this 
would be required.

> Of course, a user could wrap it with the normalize() function, but
> that's verbose and easy to forget. I'm also not sure that it can be
> made as fast as a combined function that does both.
> 
> And a follow-up question: if it does normalize, the second parameter
> would be the requested normal form. But to accept the keyword forms
> (NFC, NFD in gram.y) rather than the string forms ('NFC', 'NFD') then
> we'd need to also need to add CASEFOLD to gram.y (like NORMALIZE). Is
> that a reasonable thing to do?

That's maybe one reason to keep it separate.

Another might be that's not entirely clear how this should work in 
encodings other than UTF-8.  For example, the normalized string might 
not be representable in the encoding.




Re: Add CASEFOLD() function.

From
Jeff Davis
Date:
On Thu, 2024-12-19 at 17:18 +0100, Peter Eisentraut wrote:
> Can you explain this in further detail?  I don't quite follow why
> this
> would be required.

I am unsure now.

My initial reasoning was based on the idea that users would want to use
CASEFOLD(t) in a unique expression index as an improvement over
LOWER(t). And if you do that, you'd be surprised if some equivalent
strings ended up in the index. I don't think that's a huge problem,
because in other contexts we leave it up to the user to keep things
normalized consistently, and a CHECK(t IS NFC NORMALIZED) is a good way
to do that.

But there's a problem: full case folding doesn't preserve the normal
form, so even if the input is NFC normalized, the output might not be.
If we solve this problem, then we can just say that CASEFOLD()
preserves the normal form, consistently with how the spec defines
LOWER()/UPPER(), and I think that would be the best outcome.

I'm not sure if that problem is solvable, though, because what if the
input string is in both NFC and NFD, how do we know which normal form
to preserve?

We could tell users to use an expression index on
NORMALIZE(CASEFOLD(t)) instead, but that feels like inefficient
boilerplate.

>
> Another might be that's not entirely clear how this should work in
> encodings other than UTF-8.  For example, the normalized string might
> not be representable in the encoding.

That's a good point.

Regards,
    Jeff Davis




Re: Add CASEFOLD() function.

From
Jeff Davis
Date:
On Thu, 2024-12-19 at 09:51 -0800, Jeff Davis wrote:
> But there's a problem: full case folding doesn't preserve the normal
> form, so even if the input is NFC normalized, the output might not
> be.
> If we solve this problem, then we can just say that CASEFOLD()
> preserves the normal form, consistently with how the spec defines
> LOWER()/UPPER(), and I think that would be the best outcome.
>
> I'm not sure if that problem is solvable, though, because what if the
> input string is in both NFC and NFD, how do we know which normal form
> to preserve?

The options as I see it are:

1. Normalize the output (either by using an extra parameter or just
always normalizing to NFC). As you said, the problem here is that the
encoding might not work with normalization. One solution might be that
CASEFOLD() only works in UTF8, like NORMALIZE().

2. Try to preserve normalization as long as the encoding supports it.
The problem here is that we don't know what normal form to preserve,
because the input string might be in both NFC and NFD. We could
document that it preserves NFC form iff the input is NFC.

3. Allow CASEFOLD() to break the normal form of the input string. The
problem here is that the user may be surprised that the output is not
normalized even when all of their data is normalized. It's not clear to
me whether it still works for caseless matching -- it might if the
string is in a consistent form, even if not normalized.

Out of those I think #1 is the most appealing. Most users, and
especially users that care about these edge cases enough to use Full
Case Folding, are almost certainly going to be on UTF8 anyway. It's
also the most user-friendly.

Regards,
    Jeff Davis




Re: Add CASEFOLD() function.

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> v6 attached. I plan to commit this soon.

The documentation for this function is giving the PDF docs build
indigestion:

[WARN] FOUserAgent - Glyph "?" (0x3a3, Sigma) not available in font "Courier".
[WARN] FOUserAgent - Glyph "?" (0x3c3, sigma) not available in font "Courier".
[WARN] FOUserAgent - Glyph "?" (0x3c2, sigma1) not available in font "Courier".

Found characters that cannot be output in the PDF document;  see README.non-ASCII

Not sure about a good workaround for this.  Are there any characters
within LATIN-1 that have interesting case-folding behavior?

            regards, tom lane



Re: Add CASEFOLD() function.

From
Jeff Davis
Date:
On Sat, 2025-01-25 at 00:00 -0500, Tom Lane wrote:
> Found characters that cannot be output in the PDF document;  see
> README.non-ASCII

Thank you, fixed.

> Not sure about a good workaround for this.  Are there any characters
> within LATIN-1 that have interesting case-folding behavior?

I just removed that example. There's already another example using ß
(U+00DF), though that only applies to PG_UNICODE_FAST (the new
collation that performs full case mapping and now full case folding).

Regards,
    Jeff Davis