Thread: [18] Unintentional behavior change in commit e9931bfb75

[18] Unintentional behavior change in commit e9931bfb75

From
Jeff Davis
Date:
Commit e9931bfb75 (version 18) contained an unexpected behavior change.
LOWER('I') returns:

  In the locale tr_TR.iso88599 (single byte encoding):
                              17      18
    default                    i       ı
    specified                  ı       ı

  In the locale tr_TR.utf8:
                              17      18
    default                    ı       ı
    specified                  ı       ı

(Look carefully to see the dotted vs dotless "i".)

The behavior is commented (commit 176d5bae1d) in formatting.c:

   * ...  When using the default
   * collation, we apply the traditional Postgres behavior that
   * forces ASCII-style treatment of I/i, but in non-default
   * collations you get exactly what the collation says.
   */
  for (p = result; *p; p++)
  {
      if (mylocale)
          *p = tolower_l((unsigned char) *p, mylocale->info.lt);
      else
          *p = pg_tolower((unsigned char) *p);
  }

That's a somewhat strange special case (along with similar ones for
INITCAP() and UPPER()) that applies to single-byte encodings with the
libc provider and the database default collation only. I assume it was
done for backwards compatibility?

My commit e9931bfb75 (version 18) unifies the code paths for default
and explicit collations, and in the process it eliminates the special
case, and always uses tolower_l() for single-byte libc (whether default
collation or not).

Should I put the special case back? If not, it could break expression
indexes on LOWER()/UPPER() after an upgrade for users with the database
default collation of tr_TR who use libc and a single-byte encoding. But
preserving the special case seems weirdly inconsistent -- surely the
results should not depend on the encoding, right?

Regards,
    Jeff Davis




Re: [18] Unintentional behavior change in commit e9931bfb75

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> The behavior is commented (commit 176d5bae1d) in formatting.c:

>    * ...  When using the default
>    * collation, we apply the traditional Postgres behavior that
>    * forces ASCII-style treatment of I/i, but in non-default
>    * collations you get exactly what the collation says.

> That's a somewhat strange special case (along with similar ones for
> INITCAP() and UPPER()) that applies to single-byte encodings with the
> libc provider and the database default collation only. I assume it was
> done for backwards compatibility?

Well, also for compatibility with our SQL parser's understanding
of identifier lowercasing.

> Should I put the special case back?

I think so.  It's stood for a lot of years now without field
complaints, and I'm fairly sure there *were* field complaints
before.  (I think that behavior long predates 176d5bae1d, which
was just restoring the status quo ante after somebody else's
overenthusiastic application of system locale infrastructure.)

            regards, tom lane



Re: [18] Unintentional behavior change in commit e9931bfb75

From
Jeff Davis
Date:
On Mon, 2024-12-02 at 17:25 -0500, Tom Lane wrote:
> Well, also for compatibility with our SQL parser's understanding
> of identifier lowercasing.

But why only for single-byte encodings? And why not for ICU?

> > Should I put the special case back?
>
> I think so.

Done. I put the special case back in (commit e3fa2b037c) because the
earlier commit wasn't intended to be a behavior change.

I'm still not convinced that the special case behavior is great, but a
lot of users are unaffected because they are on UTF8 anyway, so I'm
fine keeping it.

Regards,
    Jeff Davis




Re: [18] Unintentional behavior change in commit e9931bfb75

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Mon, 2024-12-02 at 17:25 -0500, Tom Lane wrote:
>> Well, also for compatibility with our SQL parser's understanding
>> of identifier lowercasing.

> But why only for single-byte encodings? And why not for ICU?

I think the not-for-utf8 exclusion was mostly because that was
how it was before, which was probably mostly historical accident.
(I do vaguely recall that there was discussion on the point, but
I'm too tired to go look in the archives for it.)

As for ICU, that didn't exist back then, and I'm not going to
defend whether it was a good idea for that code path to fail
to reproduce this behavior.

            regards, tom lane



Re: [18] Unintentional behavior change in commit e9931bfb75

From
Peter Eisentraut
Date:
On 02.12.24 23:25, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
>> The behavior is commented (commit 176d5bae1d) in formatting.c:
> 
>>     * ...  When using the default
>>     * collation, we apply the traditional Postgres behavior that
>>     * forces ASCII-style treatment of I/i, but in non-default
>>     * collations you get exactly what the collation says.
> 
>> That's a somewhat strange special case (along with similar ones for
>> INITCAP() and UPPER()) that applies to single-byte encodings with the
>> libc provider and the database default collation only. I assume it was
>> done for backwards compatibility?
> 
> Well, also for compatibility with our SQL parser's understanding
> of identifier lowercasing.

Maybe that was relevant before the "name" type got its own collation?

>> Should I put the special case back?
> 
> I think so.  It's stood for a lot of years now without field
> complaints, and I'm fairly sure there *were* field complaints
> before.  (I think that behavior long predates 176d5bae1d, which
> was just restoring the status quo ante after somebody else's
> overenthusiastic application of system locale infrastructure.)

I've been tempted several times recently to suggest that we should 
remove the separate libc-with-single-byte-encoding-but-not-C-locale code 
paths, because they have zero test coverage and probably about zero 
users.  But if those code paths actually have different semantics in 
some cases, then that will be difficult.  Just something to keep in mind.




Re: [18] Unintentional behavior change in commit e9931bfb75

From
Tom Lane
Date:
Peter Eisentraut <peter@eisentraut.org> writes:
> On 02.12.24 23:25, Tom Lane wrote:
>> Well, also for compatibility with our SQL parser's understanding
>> of identifier lowercasing.

> Maybe that was relevant before the "name" type got its own collation?

downcase_identifier doesn't give a fig about name's collation.

            regards, tom lane



Re: [18] Unintentional behavior change in commit e9931bfb75

From
Jeff Davis
Date:
On Wed, 2024-12-04 at 12:21 -0500, Tom Lane wrote:
> Peter Eisentraut <peter@eisentraut.org> writes:
> > On 02.12.24 23:25, Tom Lane wrote:
> > > Well, also for compatibility with our SQL parser's understanding
> > > of identifier lowercasing.
>
> > Maybe that was relevant before the "name" type got its own
> > collation?
>
> downcase_identifier doesn't give a fig about name's collation.

I'd like to understand better the relationship between the parser's
casefolding, object names in the catalog, the default collation, and
the default ctype.

The comment in downcase_identifier() says: "SQL99 specifies Unicode-
aware case normalization, which we don't yet have the infrastructure
for". The good news is that, with

https://commitfest.postgresql.org/51/5436/

we hopefully will have the infrastructure in place soon.

  (a) Do we want to use that infrastructure?
  (b) Do we want to use the default collation to provide the case
folding behavior, or do we want to always use the unicode behavior
that's built in to postgres?

I'm not quite sure where the single-byte encoding special case fits in
or how it helps. It seems like what it's really doing is locking the
database-wide ctype behavior down to either libc "C" or libc with any
non-Turkish-related locale. The reasoning behind this might answer
question (b) above, but I'm not sure which direction we should be
moving.

Thoughts?

Regards,
    Jeff Davis