Thread: Does UCS_BASIC have the right CTYPE?
UCS_BASIC is defined in the standard as a collation based on comparing the code point values, and in UTF8 that is satisfied with memcmp(), so the collation locale for UCS_BASIC in Postgres is simply "C". But what should the result of UPPER('á' COLLATE UCS_BASIC) be? In Postgres, the answer is 'á', but intuitively, one could reasonably expect the answer to be 'Á'. Reading the standard, it seems that LOWER()/UPPER() are defined in terms of the Unicode General Category (Section 4.2, "<fold> is a pair of functions..."). It is somewhat ambiguous about the case mappings, but I could guess that it means the Default Case Algorithm[1]. That seems to suggest the standard answer should be 'Á' regardless of any COLLATE clause (though I could be misreading). I'm a bit confused by that... what's the standard-compatible way to specify the locale for UPPER()/LOWER()? If there is none, then it makes sense that Postgres overloads the COLLATE clause for that purpose so that users can use a different locale if they want. But given that UCS_BASIC is a collation specified in the standard, shouldn't it have ctype behavior that's as close to the standard as possible? Regards, Jeff Davis [1] https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf#G33992
On 25.10.23 20:32, Jeff Davis wrote: > But what should the result of UPPER('á' COLLATE UCS_BASIC) be? In > Postgres, the answer is 'á', but intuitively, one could reasonably > expect the answer to be 'Á'. I think that's right. But what would you put into ctype to make that happen? > That seems to suggest the standard answer should be 'Á' regardless of > any COLLATE clause (though I could be misreading). I'm a bit confused > by that... what's the standard-compatible way to specify the locale for > UPPER()/LOWER()? If there is none, then it makes sense that Postgres > overloads the COLLATE clause for that purpose so that users can use a > different locale if they want. The standard doesn't have the notion of locale-dependent case conversion.
On Thu, 2023-10-26 at 16:49 +0200, Peter Eisentraut wrote: > On 25.10.23 20:32, Jeff Davis wrote: > > But what should the result of UPPER('á' COLLATE UCS_BASIC) be? In > > Postgres, the answer is 'á', but intuitively, one could reasonably > > expect the answer to be 'Á'. > > I think that's right. But what would you put into ctype to make that > happen? It looks like using Unicode files to implement upper()/lower()/initcap() behavior would not be very hard. The Default Case Algorithm only needs a simple mapping for toUppercase() and toLowercase(). Our initcap() is not defined in the standard, and we document that it only differentiates between alphanumeric and non-alphanumeric characters, so we could get that behavior pretty easily as well. If we wanted to do it the Unicode way instead, we can follow the toTitlecase() part of the Default Case Algorithm, which is based on word breaks and would require another lookup table for that. I've already posted a patch that includes a lookup table for the General Category, so that could be used for the rest of the ctype behavior. Doing ctype based on built-in Unicode tables would be a good use case for the "builtin" provider that I had previously proposed. > The standard doesn't have the notion of locale-dependent case > conversion. That explains it. Interesting. Regards, Jeff Davis
On Thu, 2023-10-26 at 09:21 -0700, Jeff Davis wrote: > Our initcap() is not defined in the standard, and we document that it > only differentiates between alphanumeric and non-alphanumeric > characters, so we could get that behavior pretty easily as well. If > we > wanted to do it the Unicode way instead, we can follow the > toTitlecase() part of the Default Case Algorithm, which is based on > word breaks and would require another lookup table for that. Correction: the rules for word breaks are fairly complex, so it would not be worth it to try to replicate that just to support initcap(). We could just use the simple, existing, and documented rules for initcap() which only differentiate between alphanumeric and not. Anyone who wants the more sophisticated rules can just use an ICU collation with initcap(). The point stands that it would be pretty simple to have a collation that handles upper() and lower() in a standards-compliant way without relying on libc or ICU. Unfortunately it's too late to call that collation UCS_BASIC, but it would still be useful. Regards, Jeff Davis
Peter Eisentraut wrote: > > That seems to suggest the standard answer should be 'Á' regardless of > > any COLLATE clause (though I could be misreading). I'm a bit confused > > by that... what's the standard-compatible way to specify the locale for > > UPPER()/LOWER()? If there is none, then it makes sense that Postgres > > overloads the COLLATE clause for that purpose so that users can use a > > different locale if they want. > > The standard doesn't have the notion of locale-dependent case conversion. Neither does Unicode, which is why the ICU functions like u_isupper() or u_toupper() don't take a locale argument. With libc, isupper_l() and the other ctype functions need a locale argument, but given a locale's value of "language[_territory][.codeset]", in theory only the codeset part is actually useful. To me the question of what we should put in pg_collation.collctype for the "ucs_basic" collation leads to another question which is: why do we even consider collctype in the first place? Within a database, there's only one "codeset", which corresponds to pg_database.encoding, and there's a value in pg_database.lc_ctype that is normally compatible with that encoding. ISTM that UPPER(string COLLATE "whatever") should always give the same result than UPPER(string COLLATE pg_catalog.default). And likewise all functions that depend on character categories could basically ignore the COLLATE specification, given that our database-wide properties are sufficient to characterize the strings within. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
"Daniel Verite" <daniel@manitou-mail.org> writes: > To me the question of what we should put in pg_collation.collctype > for the "ucs_basic" collation leads to another question which is: > why do we even consider collctype in the first place? For starters, C locale should certainly act different from others. I'm not sold that arguing from Unicode's behavior to other encodings makes sense, either. Unicode can get away with defining that there's only one case-folding rule because they have the luxury of inventing new code points when the "same" glyph should act differently according to different languages' rules. Encodings with a small number of code points don't have that luxury. In particular see the mess around dotted and dotless I/J in Turkish vs. everywhere else. regards, tom lane
On Thu, 2023-10-26 at 23:22 +0200, Daniel Verite wrote: > Neither does Unicode, which is why the ICU functions like u_isupper() > or u_toupper() don't take a locale argument. u_strToUpper() accepts a locale argument: https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ustring_8h.html#aa64fbd4ad23af84d01c931d7cfa25f89 See also the part about tailorings here: https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf#G33992 Regards, Jeff Davis
On Thu, 2023-10-26 at 17:32 -0400, Tom Lane wrote: > For starters, C locale should certainly act different from others. Agreed. ctype of "C" is 100% stable (as implemented in Postgres with special ASCII-only semantics) and simple. I'm looking for a way to offer a new middle ground between plain "C" and buying into all of the problems with collation providers and localization. We don't need to remove functionality to do so. Providing Unicode ctype behavior doesn't look very hard. Collations could select it either with a special name or by using the "builtin" provider I proposed earlier. If the behavior does change with a new Unicode version it would be easier to see and less likely to affect on- disk structures than a collation change. Regards, Jeff Davis