Re: [18] Unintentional behavior change in commit e9931bfb75 - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: [18] Unintentional behavior change in commit e9931bfb75 |
Date | |
Msg-id | 20250414204408.5d.nmisch@google.com Whole thread Raw |
In response to | Re: [18] Unintentional behavior change in commit e9931bfb75 (Jeff Davis <pgsql@j-davis.com>) |
Responses |
Re: [18] Unintentional behavior change in commit e9931bfb75
|
List | pgsql-hackers |
On Mon, Apr 14, 2025 at 12:59:50PM -0700, Jeff Davis wrote: > On Sat, 2025-04-12 at 05:34 -0700, Noah Misch wrote: > > I think the code for (2) and for "I/i in Turkish" haven't returned. > > Given > > commit e3fa2b0 restored the v17 "I/i in Turkish" treatment for plain > > lower(), > > the regex code likely needs a similar restoration. If not, the regex > > comments > > would need to change to match the code. > > Great find, thank you! I'm curious how you came about this difference, > was it through testing or code inspection? Code inspection. I saw commit e9931bf remove "per comments above" references without editing the referenced "comments above". > --- a/src/backend/regex/regc_pg_locale.c > +++ b/src/backend/regex/regc_pg_locale.c > @@ -21,9 +21,10 @@ > #include "utils/pg_locale.h" > > /* > - * To provide as much functionality as possible on a variety of platforms, > - * without going so far as to implement everything from scratch, we use > - * several implementation strategies depending on the situation: > + * For the libc provider, to provide as much functionality as possible on a > + * variety of platforms without going so far as to implement everything from > + * scratch, we use several implementation strategies depending on the > + * situation: > * > * 1. In C/POSIX collations, we use hard-wired code. We can't depend on > * the <ctype.h> functions since those will obey LC_CTYPE. Note that these > @@ -33,8 +34,9 @@ > * > * 2a. When working in UTF8 encoding, we use the <wctype.h> functions. > * This assumes that every platform uses Unicode codepoints directly > - * as the wchar_t representation of Unicode. On some platforms > - * wchar_t is only 16 bits wide, so we have to punt for codepoints > 0xFFFF. > + * as the wchar_t representation of Unicode. (XXX: This could be a problem > + * for ICU in non-UTF8 encodings.) On some platforms wchar_t is only 16 bits > + * wide, so we have to punt for codepoints > 0xFFFF. > * > * 2b. In all other encodings, we use the <ctype.h> functions for pg_wchar > * values up to 255, and punt for values above that. This is 100% correct This leaves outdated material in the big comment edited here. Most prominent: * 2. In the "default" collation (which is supposed to obey LC_CTYPE): ... * 3. Here, we use the locale_t-extended forms of the <wctype.h> and <ctype.h> * functions, under exactly the same cases as #2. v18 regc_pg_locale.c uses only the locale_t-extended forms, and it aims not to depend on LC_CTYPE. v17 used both tolower() and tolower_l(), but v18 uses the latter only. > @@ -562,10 +564,16 @@ pg_wc_toupper(pg_wchar c) > case PG_REGEX_STRATEGY_BUILTIN: > return unicode_uppercase_simple(c); > case PG_REGEX_STRATEGY_LIBC_WIDE: > + /* force C behavior for ASCII characters, per comments above */ > + if (pg_regex_locale->is_default && c <= (pg_wchar) 127) > + return pg_ascii_toupper((unsigned char) c); v17 has distinct symbols PG_REGEX_LOCALE_WIDE (default-locale case) and PG_REGEX_LOCALE_WIDE_L (locale_t) case. Consolidating them looked attractive when this is_default case was going away. Now that is_default will remain a factor, I don't think overloading PG_REGEX_STRATEGY_LIBC_WIDE is working out well. The "_L" suffix no longer captures the distinction, since there's no case that uses tolower() as opposed to tolower_l(). However, I think v17's concept of separate PG_REGEX_ symbols for the default-locale case is still the right thing for v18. In other words, this code should change to look more like v17, with the removal of non-locale_t calls being the main change.
pgsql-hackers by date: