Re: Unicode full case mapping: PG_UNICODE_FAST, and standard-compliant UCS_BASIC - Mailing list pgsql-hackers

From Noah Misch
Subject Re: Unicode full case mapping: PG_UNICODE_FAST, and standard-compliant UCS_BASIC
Date
Msg-id 20250417135841.33.nmisch@google.com
Whole thread Raw
Responses Re: Unicode full case mapping: PG_UNICODE_FAST, and standard-compliant UCS_BASIC
List pgsql-hackers
On Fri, Jan 17, 2025 at 04:06:20PM -0800, Jeff Davis wrote:
> Committed 0001 and 0002.

> Upon reviewing the discussion threads, I removed the Unicode "adjust to
> Cased" behavior when titlecasing. As Peter pointed out[1], it doesn't
> match the documentation or expectations for INITCAP().

While commit d3d0983 changed most of the non-test pg_u_*() "bool posix"
arguments, it left a pg_u_isalnum(u, true) in strtitle_builtin() subroutine
initcap_wbnext().  The above paragraph may or may not be saying that's
intentional.  Example of the consequence at non-ASCII decimal digits:

SELECT
    str,
    re,
    regexp_count(str COLLATE pg_c_utf8, re) AS count_c_utf8,
    regexp_count(str COLLATE pg_unicode_fast, re) AS count_unicode_fast,
    regexp_count(str COLLATE unicode, re) AS count_unicode,
    initcap(str COLLATE pg_c_utf8) AS initcap_c_utf8,
    initcap(str COLLATE pg_unicode_fast) AS initcap_unicode_fast,
    initcap(str COLLATE unicode) AS initcap_unicode
FROM
    (VALUES (U&'foo\0661bar baz')) AS str_t(str),
    (VALUES ('[[:digit:]]')) AS re_t(re)
ORDER BY 1, 2;

str                  │ foo١bar baz
re                   │ [[:digit:]]
count_c_utf8         │ 0
count_unicode_fast   │ 1
count_unicode        │ 1
initcap_c_utf8       │ Foo١Bar Baz
initcap_unicode_fast │ Foo١Bar Baz
initcap_unicode      │ Foo١bar Baz

Should initcap_wbnext() pass in a locale-dependent "bool posix" argument like
the others calls the commit changed?  Related message from the development of
pg_c_utf8, which you shared downthread:
https://www.postgresql.org/message-id/610d7f1b-c68c-4eb8-a03d-1515da304c58%40manitou-mail.org


Long-term, pg_u_isword() should have a "bool posix" argument.  Currently, only
tests call that function.  If it got a non-test caller,
https://www.unicode.org/reports/tr18/#word would have pg_u_isword() follow the
choice of posix compatibility like pg_u_isalnum() does.



pgsql-hackers by date:

Previous
From: "Tristan Partin"
Date:
Subject: Re: Decouple C++ support in Meson's PGXS from LLVM enablement
Next
From: Andrew Dunstan
Date:
Subject: disabled SSL log_like tests