Thread: [18] Unintentional behavior change in commit e9931bfb75
Commit e9931bfb75 (version 18) contained an unexpected behavior change. LOWER('I') returns: In the locale tr_TR.iso88599 (single byte encoding): 17 18 default i ı specified ı ı In the locale tr_TR.utf8: 17 18 default ı ı specified ı ı (Look carefully to see the dotted vs dotless "i".) The behavior is commented (commit 176d5bae1d) in formatting.c: * ... When using the default * collation, we apply the traditional Postgres behavior that * forces ASCII-style treatment of I/i, but in non-default * collations you get exactly what the collation says. */ for (p = result; *p; p++) { if (mylocale) *p = tolower_l((unsigned char) *p, mylocale->info.lt); else *p = pg_tolower((unsigned char) *p); } That's a somewhat strange special case (along with similar ones for INITCAP() and UPPER()) that applies to single-byte encodings with the libc provider and the database default collation only. I assume it was done for backwards compatibility? My commit e9931bfb75 (version 18) unifies the code paths for default and explicit collations, and in the process it eliminates the special case, and always uses tolower_l() for single-byte libc (whether default collation or not). Should I put the special case back? If not, it could break expression indexes on LOWER()/UPPER() after an upgrade for users with the database default collation of tr_TR who use libc and a single-byte encoding. But preserving the special case seems weirdly inconsistent -- surely the results should not depend on the encoding, right? Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > The behavior is commented (commit 176d5bae1d) in formatting.c: > * ... When using the default > * collation, we apply the traditional Postgres behavior that > * forces ASCII-style treatment of I/i, but in non-default > * collations you get exactly what the collation says. > That's a somewhat strange special case (along with similar ones for > INITCAP() and UPPER()) that applies to single-byte encodings with the > libc provider and the database default collation only. I assume it was > done for backwards compatibility? Well, also for compatibility with our SQL parser's understanding of identifier lowercasing. > Should I put the special case back? I think so. It's stood for a lot of years now without field complaints, and I'm fairly sure there *were* field complaints before. (I think that behavior long predates 176d5bae1d, which was just restoring the status quo ante after somebody else's overenthusiastic application of system locale infrastructure.) regards, tom lane
On Mon, 2024-12-02 at 17:25 -0500, Tom Lane wrote: > Well, also for compatibility with our SQL parser's understanding > of identifier lowercasing. But why only for single-byte encodings? And why not for ICU? > > Should I put the special case back? > > I think so. Done. I put the special case back in (commit e3fa2b037c) because the earlier commit wasn't intended to be a behavior change. I'm still not convinced that the special case behavior is great, but a lot of users are unaffected because they are on UTF8 anyway, so I'm fine keeping it. Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Mon, 2024-12-02 at 17:25 -0500, Tom Lane wrote: >> Well, also for compatibility with our SQL parser's understanding >> of identifier lowercasing. > But why only for single-byte encodings? And why not for ICU? I think the not-for-utf8 exclusion was mostly because that was how it was before, which was probably mostly historical accident. (I do vaguely recall that there was discussion on the point, but I'm too tired to go look in the archives for it.) As for ICU, that didn't exist back then, and I'm not going to defend whether it was a good idea for that code path to fail to reproduce this behavior. regards, tom lane
On 02.12.24 23:25, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: >> The behavior is commented (commit 176d5bae1d) in formatting.c: > >> * ... When using the default >> * collation, we apply the traditional Postgres behavior that >> * forces ASCII-style treatment of I/i, but in non-default >> * collations you get exactly what the collation says. > >> That's a somewhat strange special case (along with similar ones for >> INITCAP() and UPPER()) that applies to single-byte encodings with the >> libc provider and the database default collation only. I assume it was >> done for backwards compatibility? > > Well, also for compatibility with our SQL parser's understanding > of identifier lowercasing. Maybe that was relevant before the "name" type got its own collation? >> Should I put the special case back? > > I think so. It's stood for a lot of years now without field > complaints, and I'm fairly sure there *were* field complaints > before. (I think that behavior long predates 176d5bae1d, which > was just restoring the status quo ante after somebody else's > overenthusiastic application of system locale infrastructure.) I've been tempted several times recently to suggest that we should remove the separate libc-with-single-byte-encoding-but-not-C-locale code paths, because they have zero test coverage and probably about zero users. But if those code paths actually have different semantics in some cases, then that will be difficult. Just something to keep in mind.
Peter Eisentraut <peter@eisentraut.org> writes: > On 02.12.24 23:25, Tom Lane wrote: >> Well, also for compatibility with our SQL parser's understanding >> of identifier lowercasing. > Maybe that was relevant before the "name" type got its own collation? downcase_identifier doesn't give a fig about name's collation. regards, tom lane
On Wed, 2024-12-04 at 12:21 -0500, Tom Lane wrote: > Peter Eisentraut <peter@eisentraut.org> writes: > > On 02.12.24 23:25, Tom Lane wrote: > > > Well, also for compatibility with our SQL parser's understanding > > > of identifier lowercasing. > > > Maybe that was relevant before the "name" type got its own > > collation? > > downcase_identifier doesn't give a fig about name's collation. I'd like to understand better the relationship between the parser's casefolding, object names in the catalog, the default collation, and the default ctype. The comment in downcase_identifier() says: "SQL99 specifies Unicode- aware case normalization, which we don't yet have the infrastructure for". The good news is that, with https://commitfest.postgresql.org/51/5436/ we hopefully will have the infrastructure in place soon. (a) Do we want to use that infrastructure? (b) Do we want to use the default collation to provide the case folding behavior, or do we want to always use the unicode behavior that's built in to postgres? I'm not quite sure where the single-byte encoding special case fits in or how it helps. It seems like what it's really doing is locking the database-wide ctype behavior down to either libc "C" or libc with any non-Turkish-related locale. The reasoning behind this might answer question (b) above, but I'm not sure which direction we should be moving. Thoughts? Regards, Jeff Davis
On Mon, Dec 02, 2024 at 10:24:07PM -0800, Jeff Davis wrote: > On Mon, 2024-12-02 at 17:25 -0500, Tom Lane wrote: > > > Should I put the special case back? > > > > I think so. > > Done. I put the special case back in (commit e3fa2b037c) because the > earlier commit wasn't intended to be a behavior change. Commit e9931bf had also removed the corresponding regex special case: > @@ -620,20 +545,6 @@ pg_wc_toupper(pg_wchar c) > return c; > case PG_REGEX_BUILTIN: > return unicode_uppercase_simple(c); > - case PG_REGEX_LOCALE_WIDE: > - /* force C behavior for ASCII characters, per comments above */ > - if (c <= (pg_wchar) 127) > - return pg_ascii_toupper((unsigned char) c); > - if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF) > - return towupper((wint_t) c); The "comments above" still exist: * 2. In the "default" collation (which is supposed to obey LC_CTYPE): * * 2a. When working in UTF8 encoding, we use the <wctype.h> functions. * This assumes that every platform uses Unicode codepoints directly * as the wchar_t representation of Unicode. On some platforms * wchar_t is only 16 bits wide, so we have to punt for codepoints > 0xFFFF. * * 2b. In all other encodings, we use the <ctype.h> functions for pg_wchar * values up to 255, and punt for values above that. This is 100% correct * only in single-byte encodings such as LATINn. However, non-Unicode * multibyte encodings are mostly Far Eastern character sets for which the * properties being tested here aren't very relevant for higher code values * anyway. The difficulty with using the <wctype.h> functions with * non-Unicode multibyte encodings is that we can have no certainty that * the platform's wchar_t representation matches what we do in pg_wchar * conversions. * * 3. Here, we use the locale_t-extended forms of the <wctype.h> and <ctype.h> * functions, under exactly the same cases as #2. * * There is one notable difference between cases 2 and 3: in the "default" * collation we force ASCII letters to follow ASCII upcase/downcase rules, * while in a non-default collation we just let the library functions do what * they will. The case where this matters is treatment of I/i in Turkish, * and the behavior is meant to match the upper()/lower() SQL functions. I think the code for (2) and for "I/i in Turkish" haven't returned. Given commit e3fa2b0 restored the v17 "I/i in Turkish" treatment for plain lower(), the regex code likely needs a similar restoration. If not, the regex comments would need to change to match the code.
On Sat, 2025-04-12 at 05:34 -0700, Noah Misch wrote: > I think the code for (2) and for "I/i in Turkish" haven't returned. > Given > commit e3fa2b0 restored the v17 "I/i in Turkish" treatment for plain > lower(), > the regex code likely needs a similar restoration. If not, the regex > comments > would need to change to match the code. Great find, thank you! I'm curious how you came about this difference, was it through testing or code inspection? Patch attached. I also updated the top of the comment so that it's clear that it's referring to the libc provider specifically, and that ICU still has an issue with non-UTF8 encodings. Also, the force-to-ASCII-behavior special case is different for pg_wc_tolower/uppper vs LOWER()/UPPER: the former depends only on whether it's the default locale, whereas the latter depends on whether it's the default locale and the encoding is single-byte. Therefore the results in the tr_TR.UTF-8 locale for the libc provider are inconsistent: => select 'i' ~* 'I', 'I' ~* 'i', lower('I') = 'i', upper('i') = 'I'; ?column? | ?column? | ?column? | ?column? ----------+----------+----------+---------- t | t | f | f That behavior goes back a long way, so I'm not suggesting that we change it. Regards, Jeff Davis
Attachment
On Mon, Apr 14, 2025 at 12:59:50PM -0700, Jeff Davis wrote: > On Sat, 2025-04-12 at 05:34 -0700, Noah Misch wrote: > > I think the code for (2) and for "I/i in Turkish" haven't returned. > > Given > > commit e3fa2b0 restored the v17 "I/i in Turkish" treatment for plain > > lower(), > > the regex code likely needs a similar restoration. If not, the regex > > comments > > would need to change to match the code. > > Great find, thank you! I'm curious how you came about this difference, > was it through testing or code inspection? Code inspection. I saw commit e9931bf remove "per comments above" references without editing the referenced "comments above". > --- a/src/backend/regex/regc_pg_locale.c > +++ b/src/backend/regex/regc_pg_locale.c > @@ -21,9 +21,10 @@ > #include "utils/pg_locale.h" > > /* > - * To provide as much functionality as possible on a variety of platforms, > - * without going so far as to implement everything from scratch, we use > - * several implementation strategies depending on the situation: > + * For the libc provider, to provide as much functionality as possible on a > + * variety of platforms without going so far as to implement everything from > + * scratch, we use several implementation strategies depending on the > + * situation: > * > * 1. In C/POSIX collations, we use hard-wired code. We can't depend on > * the <ctype.h> functions since those will obey LC_CTYPE. Note that these > @@ -33,8 +34,9 @@ > * > * 2a. When working in UTF8 encoding, we use the <wctype.h> functions. > * This assumes that every platform uses Unicode codepoints directly > - * as the wchar_t representation of Unicode. On some platforms > - * wchar_t is only 16 bits wide, so we have to punt for codepoints > 0xFFFF. > + * as the wchar_t representation of Unicode. (XXX: This could be a problem > + * for ICU in non-UTF8 encodings.) On some platforms wchar_t is only 16 bits > + * wide, so we have to punt for codepoints > 0xFFFF. > * > * 2b. In all other encodings, we use the <ctype.h> functions for pg_wchar > * values up to 255, and punt for values above that. This is 100% correct This leaves outdated material in the big comment edited here. Most prominent: * 2. In the "default" collation (which is supposed to obey LC_CTYPE): ... * 3. Here, we use the locale_t-extended forms of the <wctype.h> and <ctype.h> * functions, under exactly the same cases as #2. v18 regc_pg_locale.c uses only the locale_t-extended forms, and it aims not to depend on LC_CTYPE. v17 used both tolower() and tolower_l(), but v18 uses the latter only. > @@ -562,10 +564,16 @@ pg_wc_toupper(pg_wchar c) > case PG_REGEX_STRATEGY_BUILTIN: > return unicode_uppercase_simple(c); > case PG_REGEX_STRATEGY_LIBC_WIDE: > + /* force C behavior for ASCII characters, per comments above */ > + if (pg_regex_locale->is_default && c <= (pg_wchar) 127) > + return pg_ascii_toupper((unsigned char) c); v17 has distinct symbols PG_REGEX_LOCALE_WIDE (default-locale case) and PG_REGEX_LOCALE_WIDE_L (locale_t) case. Consolidating them looked attractive when this is_default case was going away. Now that is_default will remain a factor, I don't think overloading PG_REGEX_STRATEGY_LIBC_WIDE is working out well. The "_L" suffix no longer captures the distinction, since there's no case that uses tolower() as opposed to tolower_l(). However, I think v17's concept of separate PG_REGEX_ symbols for the default-locale case is still the right thing for v18. In other words, this code should change to look more like v17, with the removal of non-locale_t calls being the main change.
On Mon, 2025-04-14 at 13:44 -0700, Noah Misch wrote: > v18 regc_pg_locale.c uses only the locale_t-extended forms, and it > aims not to > depend on LC_CTYPE. v17 used both tolower() and tolower_l(), but v18 > uses the > latter only. Fixed in v2-0001 by rewording the comment slightly. > > However, I think v17's > concept of separate PG_REGEX_ symbols for the default-locale case is > still the > right thing for v18. In other words, this code should change to look > more > like v17, with the removal of non-locale_t calls being the main > change. I tried that in v2-0003, but I think it ended up worse. Most pg_wc_xyz() functions don't care if it's the default collation or not, so there are a lot of duplicate cases. The previous approach is still there as v2-0002. Regrads, Jeff Davis
Attachment
On Tue, Apr 15, 2025 at 04:08:55PM -0700, Jeff Davis wrote: > On Mon, 2025-04-14 at 13:44 -0700, Noah Misch wrote: > > However, I think v17's > > concept of separate PG_REGEX_ symbols for the default-locale case is > > still the > > right thing for v18. In other words, this code should change to look > > more > > like v17, with the removal of non-locale_t calls being the main > > change. > > I tried that in v2-0003, but I think it ended up worse. Most > pg_wc_xyz() functions don't care if it's the default collation or not, > so there are a lot of duplicate cases. I'd likely adopt 0003, but I'm fine if you stop at 0002. > + * As a special case, in the "default" collation we force ASCII letters to I'd change s/we force/(2) and (3) force/ to make explicit that this isn't specific to (3). That's all I would change in v2.