Thread: Re: Unicode full case mapping: PG_UNICODE_FAST, and standard-compliant UCS_BASIC
Re: Unicode full case mapping: PG_UNICODE_FAST, and standard-compliant UCS_BASIC
From
Noah Misch
Date:
On Fri, Jan 17, 2025 at 04:06:20PM -0800, Jeff Davis wrote: > Committed 0001 and 0002. > Upon reviewing the discussion threads, I removed the Unicode "adjust to > Cased" behavior when titlecasing. As Peter pointed out[1], it doesn't > match the documentation or expectations for INITCAP(). While commit d3d0983 changed most of the non-test pg_u_*() "bool posix" arguments, it left a pg_u_isalnum(u, true) in strtitle_builtin() subroutine initcap_wbnext(). The above paragraph may or may not be saying that's intentional. Example of the consequence at non-ASCII decimal digits: SELECT str, re, regexp_count(str COLLATE pg_c_utf8, re) AS count_c_utf8, regexp_count(str COLLATE pg_unicode_fast, re) AS count_unicode_fast, regexp_count(str COLLATE unicode, re) AS count_unicode, initcap(str COLLATE pg_c_utf8) AS initcap_c_utf8, initcap(str COLLATE pg_unicode_fast) AS initcap_unicode_fast, initcap(str COLLATE unicode) AS initcap_unicode FROM (VALUES (U&'foo\0661bar baz')) AS str_t(str), (VALUES ('[[:digit:]]')) AS re_t(re) ORDER BY 1, 2; str │ foo١bar baz re │ [[:digit:]] count_c_utf8 │ 0 count_unicode_fast │ 1 count_unicode │ 1 initcap_c_utf8 │ Foo١Bar Baz initcap_unicode_fast │ Foo١Bar Baz initcap_unicode │ Foo١bar Baz Should initcap_wbnext() pass in a locale-dependent "bool posix" argument like the others calls the commit changed? Related message from the development of pg_c_utf8, which you shared downthread: https://www.postgresql.org/message-id/610d7f1b-c68c-4eb8-a03d-1515da304c58%40manitou-mail.org Long-term, pg_u_isword() should have a "bool posix" argument. Currently, only tests call that function. If it got a non-test caller, https://www.unicode.org/reports/tr18/#word would have pg_u_isword() follow the choice of posix compatibility like pg_u_isalnum() does.
Re: Unicode full case mapping: PG_UNICODE_FAST, and standard-compliant UCS_BASIC
From
Jeff Davis
Date:
On Thu, 2025-04-17 at 06:58 -0700, Noah Misch wrote: > Should initcap_wbnext() pass in a locale-dependent "bool posix" > argument like > the others calls the commit changed? Yes, I believe you are correct. Patch and tests attached. > Long-term, pg_u_isword() should have a "bool posix" argument. > Currently, only > tests call that function. If it got a non-test caller, > https://www.unicode.org/reports/tr18/#word would have pg_u_isword() > follow the > choice of posix compatibility like pg_u_isalnum() does. I based those functions on: https://www.unicode.org/reports/tr18/#Compatibility_Properties and the "word" class does not have a POSIX variant. But Postgres has two documented definitions for "word" characters: * for regexes, alnum + "_" * for INITCAP(), just alnum and the above definition doesn't match up with either one, which is why we don't use it. ICU INITCAP() uses the ICU definition of word boundaries, so doesn't match our documentation. We could adjust our documentation to allow for provider-dependent definitions of word characters, which might be a good idea. But that still doesn't quite capture ICU's more complex definition of word boundaries. Or, we could remove those unused functions for now, and figure out if there's a reason to add them back later. They are probably adding more confusion than anything. Regards, Jeff Davis