Re: [18] Unintentional behavior change in commit e9931bfb75 - Mailing list pgsql-hackers

From Noah Misch
Subject Re: [18] Unintentional behavior change in commit e9931bfb75
Date
Msg-id 20250414204408.5d.nmisch@google.com
Whole thread Raw
In response to Re: [18] Unintentional behavior change in commit e9931bfb75  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: [18] Unintentional behavior change in commit e9931bfb75
List pgsql-hackers
On Mon, Apr 14, 2025 at 12:59:50PM -0700, Jeff Davis wrote:
> On Sat, 2025-04-12 at 05:34 -0700, Noah Misch wrote:
> > I think the code for (2) and for "I/i in Turkish" haven't returned. 
> > Given
> > commit e3fa2b0 restored the v17 "I/i in Turkish" treatment for plain
> > lower(),
> > the regex code likely needs a similar restoration.  If not, the regex
> > comments
> > would need to change to match the code.
> 
> Great find, thank you! I'm curious how you came about this difference,
> was it through testing or code inspection?

Code inspection.  I saw commit e9931bf remove "per comments above" references
without editing the referenced "comments above".

> --- a/src/backend/regex/regc_pg_locale.c
> +++ b/src/backend/regex/regc_pg_locale.c
> @@ -21,9 +21,10 @@
>  #include "utils/pg_locale.h"
>  
>  /*
> - * To provide as much functionality as possible on a variety of platforms,
> - * without going so far as to implement everything from scratch, we use
> - * several implementation strategies depending on the situation:
> + * For the libc provider, to provide as much functionality as possible on a
> + * variety of platforms without going so far as to implement everything from
> + * scratch, we use several implementation strategies depending on the
> + * situation:
>   *
>   * 1. In C/POSIX collations, we use hard-wired code.  We can't depend on
>   * the <ctype.h> functions since those will obey LC_CTYPE.  Note that these
> @@ -33,8 +34,9 @@
>   *
>   * 2a. When working in UTF8 encoding, we use the <wctype.h> functions.
>   * This assumes that every platform uses Unicode codepoints directly
> - * as the wchar_t representation of Unicode.  On some platforms
> - * wchar_t is only 16 bits wide, so we have to punt for codepoints > 0xFFFF.
> + * as the wchar_t representation of Unicode.  (XXX: This could be a problem
> + * for ICU in non-UTF8 encodings.)  On some platforms wchar_t is only 16 bits
> + * wide, so we have to punt for codepoints > 0xFFFF.
>   *
>   * 2b. In all other encodings, we use the <ctype.h> functions for pg_wchar
>   * values up to 255, and punt for values above that.  This is 100% correct

This leaves outdated material in the big comment edited here.  Most prominent:

 * 2. In the "default" collation (which is supposed to obey LC_CTYPE):
...
 * 3. Here, we use the locale_t-extended forms of the <wctype.h> and <ctype.h>
 * functions, under exactly the same cases as #2.

v18 regc_pg_locale.c uses only the locale_t-extended forms, and it aims not to
depend on LC_CTYPE.  v17 used both tolower() and tolower_l(), but v18 uses the
latter only.

> @@ -562,10 +564,16 @@ pg_wc_toupper(pg_wchar c)
>          case PG_REGEX_STRATEGY_BUILTIN:
>              return unicode_uppercase_simple(c);
>          case PG_REGEX_STRATEGY_LIBC_WIDE:
> +            /* force C behavior for ASCII characters, per comments above */
> +            if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
> +                return pg_ascii_toupper((unsigned char) c);

v17 has distinct symbols PG_REGEX_LOCALE_WIDE (default-locale case) and
PG_REGEX_LOCALE_WIDE_L (locale_t) case.  Consolidating them looked attractive
when this is_default case was going away.  Now that is_default will remain a
factor, I don't think overloading PG_REGEX_STRATEGY_LIBC_WIDE is working out
well.  The "_L" suffix no longer captures the distinction, since there's no
case that uses tolower() as opposed to tolower_l().  However, I think v17's
concept of separate PG_REGEX_ symbols for the default-locale case is still the
right thing for v18.  In other words, this code should change to look more
like v17, with the removal of non-locale_t calls being the main change.



pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Re: Adding support for SSLKEYLOGFILE in the frontend
Next
From: Ranier Vilela
Date:
Subject: Re: Fix a resource leak (src/backend/utils/adt/rowtypes.c)