Thread: [18] Unintentional behavior change in commit e9931bfb75

[18] Unintentional behavior change in commit e9931bfb75

From

Jeff Davis

Date:

03 December 2024, 00:17:14

Commit e9931bfb75 (version 18) contained an unexpected behavior change.
LOWER('I') returns:

  In the locale tr_TR.iso88599 (single byte encoding):
                              17      18
    default                    i       ı
    specified                  ı       ı

  In the locale tr_TR.utf8:
                              17      18
    default                    ı       ı
    specified                  ı       ı

(Look carefully to see the dotted vs dotless "i".)

The behavior is commented (commit 176d5bae1d) in formatting.c:

   * ...  When using the default
   * collation, we apply the traditional Postgres behavior that
   * forces ASCII-style treatment of I/i, but in non-default
   * collations you get exactly what the collation says.
   */
  for (p = result; *p; p++)
  {
      if (mylocale)
          *p = tolower_l((unsigned char) *p, mylocale->info.lt);
      else
          *p = pg_tolower((unsigned char) *p);
  }

That's a somewhat strange special case (along with similar ones for
INITCAP() and UPPER()) that applies to single-byte encodings with the
libc provider and the database default collation only. I assume it was
done for backwards compatibility?

My commit e9931bfb75 (version 18) unifies the code paths for default
and explicit collations, and in the process it eliminates the special
case, and always uses tolower_l() for single-byte libc (whether default
collation or not).

Should I put the special case back? If not, it could break expression
indexes on LOWER()/UPPER() after an upgrade for users with the database
default collation of tr_TR who use libc and a single-byte encoding. But
preserving the special case seems weirdly inconsistent -- surely the
results should not depend on the encoding, right?

Regards,
    Jeff Davis

Re: [18] Unintentional behavior change in commit e9931bfb75

From

Tom Lane

Date:

03 December 2024, 01:25:49

Jeff Davis <pgsql@j-davis.com> writes:
> The behavior is commented (commit 176d5bae1d) in formatting.c:

>    * ...  When using the default
>    * collation, we apply the traditional Postgres behavior that
>    * forces ASCII-style treatment of I/i, but in non-default
>    * collations you get exactly what the collation says.

> That's a somewhat strange special case (along with similar ones for
> INITCAP() and UPPER()) that applies to single-byte encodings with the
> libc provider and the database default collation only. I assume it was
> done for backwards compatibility?

Well, also for compatibility with our SQL parser's understanding
of identifier lowercasing.

> Should I put the special case back?

I think so.  It's stood for a lot of years now without field
complaints, and I'm fairly sure there *were* field complaints
before.  (I think that behavior long predates 176d5bae1d, which
was just restoring the status quo ante after somebody else's
overenthusiastic application of system locale infrastructure.)

            regards, tom lane

Re: [18] Unintentional behavior change in commit e9931bfb75

From

Jeff Davis

Date:

03 December 2024, 09:24:07

On Mon, 2024-12-02 at 17:25 -0500, Tom Lane wrote:
> Well, also for compatibility with our SQL parser's understanding
> of identifier lowercasing.

But why only for single-byte encodings? And why not for ICU?

> > Should I put the special case back?
>
> I think so.

Done. I put the special case back in (commit e3fa2b037c) because the
earlier commit wasn't intended to be a behavior change.

I'm still not convinced that the special case behavior is great, but a
lot of users are unaffected because they are on UTF8 anyway, so I'm
fine keeping it.

Regards,
    Jeff Davis

Re: [18] Unintentional behavior change in commit e9931bfb75

From

Tom Lane

Date:

03 December 2024, 09:32:47

Jeff Davis <pgsql@j-davis.com> writes:
> On Mon, 2024-12-02 at 17:25 -0500, Tom Lane wrote:
>> Well, also for compatibility with our SQL parser's understanding
>> of identifier lowercasing.

> But why only for single-byte encodings? And why not for ICU?

I think the not-for-utf8 exclusion was mostly because that was
how it was before, which was probably mostly historical accident.
(I do vaguely recall that there was discussion on the point, but
I'm too tired to go look in the archives for it.)

As for ICU, that didn't exist back then, and I'm not going to
defend whether it was a good idea for that code path to fail
to reproduce this behavior.

            regards, tom lane

Re: [18] Unintentional behavior change in commit e9931bfb75

From

Peter Eisentraut

Date:

04 December 2024, 20:14:48

On 02.12.24 23:25, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
>> The behavior is commented (commit 176d5bae1d) in formatting.c:
> 
>>     * ...  When using the default
>>     * collation, we apply the traditional Postgres behavior that
>>     * forces ASCII-style treatment of I/i, but in non-default
>>     * collations you get exactly what the collation says.
> 
>> That's a somewhat strange special case (along with similar ones for
>> INITCAP() and UPPER()) that applies to single-byte encodings with the
>> libc provider and the database default collation only. I assume it was
>> done for backwards compatibility?
> 
> Well, also for compatibility with our SQL parser's understanding
> of identifier lowercasing.

Maybe that was relevant before the "name" type got its own collation?

>> Should I put the special case back?
> 
> I think so.  It's stood for a lot of years now without field
> complaints, and I'm fairly sure there *were* field complaints
> before.  (I think that behavior long predates 176d5bae1d, which
> was just restoring the status quo ante after somebody else's
> overenthusiastic application of system locale infrastructure.)

I've been tempted several times recently to suggest that we should 
remove the separate libc-with-single-byte-encoding-but-not-C-locale code 
paths, because they have zero test coverage and probably about zero 
users.  But if those code paths actually have different semantics in 
some cases, then that will be difficult.  Just something to keep in mind.

Re: [18] Unintentional behavior change in commit e9931bfb75

From

Tom Lane

Date:

04 December 2024, 20:21:42

Peter Eisentraut <peter@eisentraut.org> writes:
> On 02.12.24 23:25, Tom Lane wrote:
>> Well, also for compatibility with our SQL parser's understanding
>> of identifier lowercasing.

> Maybe that was relevant before the "name" type got its own collation?

downcase_identifier doesn't give a fig about name's collation.

            regards, tom lane

Re: [18] Unintentional behavior change in commit e9931bfb75

From

Jeff Davis

Date:

12 December 2024, 22:19:01

On Wed, 2024-12-04 at 12:21 -0500, Tom Lane wrote:
> Peter Eisentraut <peter@eisentraut.org> writes:
> > On 02.12.24 23:25, Tom Lane wrote:
> > > Well, also for compatibility with our SQL parser's understanding
> > > of identifier lowercasing.
>
> > Maybe that was relevant before the "name" type got its own
> > collation?
>
> downcase_identifier doesn't give a fig about name's collation.

I'd like to understand better the relationship between the parser's
casefolding, object names in the catalog, the default collation, and
the default ctype.

The comment in downcase_identifier() says: "SQL99 specifies Unicode-
aware case normalization, which we don't yet have the infrastructure
for". The good news is that, with

https://commitfest.postgresql.org/51/5436/

we hopefully will have the infrastructure in place soon.

  (a) Do we want to use that infrastructure?
  (b) Do we want to use the default collation to provide the case
folding behavior, or do we want to always use the unicode behavior
that's built in to postgres?

I'm not quite sure where the single-byte encoding special case fits in
or how it helps. It seems like what it's really doing is locking the
database-wide ctype behavior down to either libc "C" or libc with any
non-Turkish-related locale. The reasoning behind this might answer
question (b) above, but I'm not sure which direction we should be
moving.

Thoughts?

Regards,
    Jeff Davis

Re: [18] Unintentional behavior change in commit e9931bfb75

From

Noah Misch

Date:

12 April, 15:34:30

On Mon, Dec 02, 2024 at 10:24:07PM -0800, Jeff Davis wrote:
> On Mon, 2024-12-02 at 17:25 -0500, Tom Lane wrote:

> > > Should I put the special case back?
> > 
> > I think so.
> 
> Done. I put the special case back in (commit e3fa2b037c) because the
> earlier commit wasn't intended to be a behavior change.

Commit e9931bf had also removed the corresponding regex special case:

> @@ -620,20 +545,6 @@ pg_wc_toupper(pg_wchar c)
>              return c;
>          case PG_REGEX_BUILTIN:
>              return unicode_uppercase_simple(c);
> -        case PG_REGEX_LOCALE_WIDE:
> -            /* force C behavior for ASCII characters, per comments above */
> -            if (c <= (pg_wchar) 127)
> -                return pg_ascii_toupper((unsigned char) c);
> -            if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
> -                return towupper((wint_t) c);

The "comments above" still exist:

 * 2. In the "default" collation (which is supposed to obey LC_CTYPE):
 *
 * 2a. When working in UTF8 encoding, we use the <wctype.h> functions.
 * This assumes that every platform uses Unicode codepoints directly
 * as the wchar_t representation of Unicode.  On some platforms
 * wchar_t is only 16 bits wide, so we have to punt for codepoints > 0xFFFF.
 *
 * 2b. In all other encodings, we use the <ctype.h> functions for pg_wchar
 * values up to 255, and punt for values above that.  This is 100% correct
 * only in single-byte encodings such as LATINn.  However, non-Unicode
 * multibyte encodings are mostly Far Eastern character sets for which the
 * properties being tested here aren't very relevant for higher code values
 * anyway.  The difficulty with using the <wctype.h> functions with
 * non-Unicode multibyte encodings is that we can have no certainty that
 * the platform's wchar_t representation matches what we do in pg_wchar
 * conversions.
 *
 * 3. Here, we use the locale_t-extended forms of the <wctype.h> and <ctype.h>
 * functions, under exactly the same cases as #2.
 *
 * There is one notable difference between cases 2 and 3: in the "default"
 * collation we force ASCII letters to follow ASCII upcase/downcase rules,
 * while in a non-default collation we just let the library functions do what
 * they will.  The case where this matters is treatment of I/i in Turkish,
 * and the behavior is meant to match the upper()/lower() SQL functions.

I think the code for (2) and for "I/i in Turkish" haven't returned.  Given
commit e3fa2b0 restored the v17 "I/i in Turkish" treatment for plain lower(),
the regex code likely needs a similar restoration.  If not, the regex comments
would need to change to match the code.

Re: [18] Unintentional behavior change in commit e9931bfb75

From

Jeff Davis

Date:

14 April, 22:59:50

On Sat, 2025-04-12 at 05:34 -0700, Noah Misch wrote:
> I think the code for (2) and for "I/i in Turkish" haven't returned. 
> Given
> commit e3fa2b0 restored the v17 "I/i in Turkish" treatment for plain
> lower(),
> the regex code likely needs a similar restoration.  If not, the regex
> comments
> would need to change to match the code.

Great find, thank you! I'm curious how you came about this difference,
was it through testing or code inspection?

Patch attached. I also updated the top of the comment so that it's
clear that it's referring to the libc provider specifically, and that
ICU still has an issue with non-UTF8 encodings.

Also, the force-to-ASCII-behavior special case is different for
pg_wc_tolower/uppper vs LOWER()/UPPER: the former depends only on
whether it's the default locale, whereas the latter depends on whether
it's the default locale and the encoding is single-byte. Therefore the
results in the tr_TR.UTF-8 locale for the libc provider are
inconsistent:

  => select 'i' ~* 'I', 'I' ~* 'i', lower('I') = 'i', upper('i') = 'I';
   ?column? | ?column? | ?column? | ?column?
  ----------+----------+----------+----------
   t        | t        | f        | f

That behavior goes back a long way, so I'm not suggesting that we
change it.

Regards,
    Jeff Davis

Attachment

v1-0001-Another-unintentional-behavior-change-in-commit-e.patch

Re: [18] Unintentional behavior change in commit e9931bfb75

From

Noah Misch

Date:

14 April, 23:44:08

On Mon, Apr 14, 2025 at 12:59:50PM -0700, Jeff Davis wrote:
> On Sat, 2025-04-12 at 05:34 -0700, Noah Misch wrote:
> > I think the code for (2) and for "I/i in Turkish" haven't returned. 
> > Given
> > commit e3fa2b0 restored the v17 "I/i in Turkish" treatment for plain
> > lower(),
> > the regex code likely needs a similar restoration.  If not, the regex
> > comments
> > would need to change to match the code.
> 
> Great find, thank you! I'm curious how you came about this difference,
> was it through testing or code inspection?

Code inspection.  I saw commit e9931bf remove "per comments above" references
without editing the referenced "comments above".

> --- a/src/backend/regex/regc_pg_locale.c
> +++ b/src/backend/regex/regc_pg_locale.c
> @@ -21,9 +21,10 @@
>  #include "utils/pg_locale.h"
>  
>  /*
> - * To provide as much functionality as possible on a variety of platforms,
> - * without going so far as to implement everything from scratch, we use
> - * several implementation strategies depending on the situation:
> + * For the libc provider, to provide as much functionality as possible on a
> + * variety of platforms without going so far as to implement everything from
> + * scratch, we use several implementation strategies depending on the
> + * situation:
>   *
>   * 1. In C/POSIX collations, we use hard-wired code.  We can't depend on
>   * the <ctype.h> functions since those will obey LC_CTYPE.  Note that these
> @@ -33,8 +34,9 @@
>   *
>   * 2a. When working in UTF8 encoding, we use the <wctype.h> functions.
>   * This assumes that every platform uses Unicode codepoints directly
> - * as the wchar_t representation of Unicode.  On some platforms
> - * wchar_t is only 16 bits wide, so we have to punt for codepoints > 0xFFFF.
> + * as the wchar_t representation of Unicode.  (XXX: This could be a problem
> + * for ICU in non-UTF8 encodings.)  On some platforms wchar_t is only 16 bits
> + * wide, so we have to punt for codepoints > 0xFFFF.
>   *
>   * 2b. In all other encodings, we use the <ctype.h> functions for pg_wchar
>   * values up to 255, and punt for values above that.  This is 100% correct

This leaves outdated material in the big comment edited here.  Most prominent:

 * 2. In the "default" collation (which is supposed to obey LC_CTYPE):
...
 * 3. Here, we use the locale_t-extended forms of the <wctype.h> and <ctype.h>
 * functions, under exactly the same cases as #2.

v18 regc_pg_locale.c uses only the locale_t-extended forms, and it aims not to
depend on LC_CTYPE.  v17 used both tolower() and tolower_l(), but v18 uses the
latter only.

> @@ -562,10 +564,16 @@ pg_wc_toupper(pg_wchar c)
>          case PG_REGEX_STRATEGY_BUILTIN:
>              return unicode_uppercase_simple(c);
>          case PG_REGEX_STRATEGY_LIBC_WIDE:
> +            /* force C behavior for ASCII characters, per comments above */
> +            if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
> +                return pg_ascii_toupper((unsigned char) c);

v17 has distinct symbols PG_REGEX_LOCALE_WIDE (default-locale case) and
PG_REGEX_LOCALE_WIDE_L (locale_t) case.  Consolidating them looked attractive
when this is_default case was going away.  Now that is_default will remain a
factor, I don't think overloading PG_REGEX_STRATEGY_LIBC_WIDE is working out
well.  The "_L" suffix no longer captures the distinction, since there's no
case that uses tolower() as opposed to tolower_l().  However, I think v17's
concept of separate PG_REGEX_ symbols for the default-locale case is still the
right thing for v18.  In other words, this code should change to look more
like v17, with the removal of non-locale_t calls being the main change.

Re: [18] Unintentional behavior change in commit e9931bfb75

From

Jeff Davis

Date:

16 April, 02:08:55

On Mon, 2025-04-14 at 13:44 -0700, Noah Misch wrote:
> v18 regc_pg_locale.c uses only the locale_t-extended forms, and it
> aims not to
> depend on LC_CTYPE.  v17 used both tolower() and tolower_l(), but v18
> uses the
> latter only.

Fixed in v2-0001 by rewording the comment slightly.

>
> However, I think v17's
> concept of separate PG_REGEX_ symbols for the default-locale case is
> still the
> right thing for v18.  In other words, this code should change to look
> more
> like v17, with the removal of non-locale_t calls being the main
> change.

I tried that in v2-0003, but I think it ended up worse. Most
pg_wc_xyz() functions don't care if it's the default collation or not,
so there are a lot of duplicate cases.

The previous approach is still there as v2-0002.

Regrads,
    Jeff Davis

Attachment

Re: [18] Unintentional behavior change in commit e9931bfb75

From

Noah Misch

Date:

16 April, 04:02:13

On Tue, Apr 15, 2025 at 04:08:55PM -0700, Jeff Davis wrote:
> On Mon, 2025-04-14 at 13:44 -0700, Noah Misch wrote:
> > However, I think v17's
> > concept of separate PG_REGEX_ symbols for the default-locale case is
> > still the
> > right thing for v18.  In other words, this code should change to look
> > more
> > like v17, with the removal of non-locale_t calls being the main
> > change.
> 
> I tried that in v2-0003, but I think it ended up worse. Most
> pg_wc_xyz() functions don't care if it's the default collation or not,
> so there are a lot of duplicate cases.

I'd likely adopt 0003, but I'm fine if you stop at 0002.

> + * As a special case, in the "default" collation we force ASCII letters to

I'd change s/we force/(2) and (3) force/ to make explicit that this isn't
specific to (3).  That's all I would change in v2.