Thread: [18] Unintentional behavior change in commit e9931bfb75
Commit e9931bfb75 (version 18) contained an unexpected behavior change. LOWER('I') returns: In the locale tr_TR.iso88599 (single byte encoding): 17 18 default i ı specified ı ı In the locale tr_TR.utf8: 17 18 default ı ı specified ı ı (Look carefully to see the dotted vs dotless "i".) The behavior is commented (commit 176d5bae1d) in formatting.c: * ... When using the default * collation, we apply the traditional Postgres behavior that * forces ASCII-style treatment of I/i, but in non-default * collations you get exactly what the collation says. */ for (p = result; *p; p++) { if (mylocale) *p = tolower_l((unsigned char) *p, mylocale->info.lt); else *p = pg_tolower((unsigned char) *p); } That's a somewhat strange special case (along with similar ones for INITCAP() and UPPER()) that applies to single-byte encodings with the libc provider and the database default collation only. I assume it was done for backwards compatibility? My commit e9931bfb75 (version 18) unifies the code paths for default and explicit collations, and in the process it eliminates the special case, and always uses tolower_l() for single-byte libc (whether default collation or not). Should I put the special case back? If not, it could break expression indexes on LOWER()/UPPER() after an upgrade for users with the database default collation of tr_TR who use libc and a single-byte encoding. But preserving the special case seems weirdly inconsistent -- surely the results should not depend on the encoding, right? Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > The behavior is commented (commit 176d5bae1d) in formatting.c: > * ... When using the default > * collation, we apply the traditional Postgres behavior that > * forces ASCII-style treatment of I/i, but in non-default > * collations you get exactly what the collation says. > That's a somewhat strange special case (along with similar ones for > INITCAP() and UPPER()) that applies to single-byte encodings with the > libc provider and the database default collation only. I assume it was > done for backwards compatibility? Well, also for compatibility with our SQL parser's understanding of identifier lowercasing. > Should I put the special case back? I think so. It's stood for a lot of years now without field complaints, and I'm fairly sure there *were* field complaints before. (I think that behavior long predates 176d5bae1d, which was just restoring the status quo ante after somebody else's overenthusiastic application of system locale infrastructure.) regards, tom lane
On Mon, 2024-12-02 at 17:25 -0500, Tom Lane wrote: > Well, also for compatibility with our SQL parser's understanding > of identifier lowercasing. But why only for single-byte encodings? And why not for ICU? > > Should I put the special case back? > > I think so. Done. I put the special case back in (commit e3fa2b037c) because the earlier commit wasn't intended to be a behavior change. I'm still not convinced that the special case behavior is great, but a lot of users are unaffected because they are on UTF8 anyway, so I'm fine keeping it. Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Mon, 2024-12-02 at 17:25 -0500, Tom Lane wrote: >> Well, also for compatibility with our SQL parser's understanding >> of identifier lowercasing. > But why only for single-byte encodings? And why not for ICU? I think the not-for-utf8 exclusion was mostly because that was how it was before, which was probably mostly historical accident. (I do vaguely recall that there was discussion on the point, but I'm too tired to go look in the archives for it.) As for ICU, that didn't exist back then, and I'm not going to defend whether it was a good idea for that code path to fail to reproduce this behavior. regards, tom lane
On 02.12.24 23:25, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: >> The behavior is commented (commit 176d5bae1d) in formatting.c: > >> * ... When using the default >> * collation, we apply the traditional Postgres behavior that >> * forces ASCII-style treatment of I/i, but in non-default >> * collations you get exactly what the collation says. > >> That's a somewhat strange special case (along with similar ones for >> INITCAP() and UPPER()) that applies to single-byte encodings with the >> libc provider and the database default collation only. I assume it was >> done for backwards compatibility? > > Well, also for compatibility with our SQL parser's understanding > of identifier lowercasing. Maybe that was relevant before the "name" type got its own collation? >> Should I put the special case back? > > I think so. It's stood for a lot of years now without field > complaints, and I'm fairly sure there *were* field complaints > before. (I think that behavior long predates 176d5bae1d, which > was just restoring the status quo ante after somebody else's > overenthusiastic application of system locale infrastructure.) I've been tempted several times recently to suggest that we should remove the separate libc-with-single-byte-encoding-but-not-C-locale code paths, because they have zero test coverage and probably about zero users. But if those code paths actually have different semantics in some cases, then that will be difficult. Just something to keep in mind.
Peter Eisentraut <peter@eisentraut.org> writes: > On 02.12.24 23:25, Tom Lane wrote: >> Well, also for compatibility with our SQL parser's understanding >> of identifier lowercasing. > Maybe that was relevant before the "name" type got its own collation? downcase_identifier doesn't give a fig about name's collation. regards, tom lane
On Wed, 2024-12-04 at 12:21 -0500, Tom Lane wrote: > Peter Eisentraut <peter@eisentraut.org> writes: > > On 02.12.24 23:25, Tom Lane wrote: > > > Well, also for compatibility with our SQL parser's understanding > > > of identifier lowercasing. > > > Maybe that was relevant before the "name" type got its own > > collation? > > downcase_identifier doesn't give a fig about name's collation. I'd like to understand better the relationship between the parser's casefolding, object names in the catalog, the default collation, and the default ctype. The comment in downcase_identifier() says: "SQL99 specifies Unicode- aware case normalization, which we don't yet have the infrastructure for". The good news is that, with https://commitfest.postgresql.org/51/5436/ we hopefully will have the infrastructure in place soon. (a) Do we want to use that infrastructure? (b) Do we want to use the default collation to provide the case folding behavior, or do we want to always use the unicode behavior that's built in to postgres? I'm not quite sure where the single-byte encoding special case fits in or how it helps. It seems like what it's really doing is locking the database-wide ctype behavior down to either libc "C" or libc with any non-Turkish-related locale. The reasoning behind this might answer question (b) above, but I'm not sure which direction we should be moving. Thoughts? Regards, Jeff Davis