Thread: Multi-byte character case-folding
Hi, At the moment, only single-byte characters in identifiers are case-folded, and multi-byte characters are not. For example, abĉDĚF is case-folded to "abĉdĚf". This can be referred to as "abĉdĚf" or "ABĉDĚF", but not "abĉděf" or "ABĈDĚF". downcase_identifier() has the following comment: /* * SQL99 specifies Unicode-aware case normalization, which we don't yet * have the infrastructure for. Instead we use tolower() to provide a * locale-aware translation. However, there are some locales where this * is not right either (eg, Turkish may do strange things with 'i' and * 'I'). Our current compromise is to use tolower() for characters with * the high bit set, as long as they aren't part of a multi-byte * character, and use an ASCII-only downcasing for 7-bit characters. */ So my question is, do we yet have the infrastructure to make case-folding consistent across all character widths? Thanks Thom
Thom Brown <thom@linux.com> writes: > At the moment, only single-byte characters in identifiers are > case-folded, and multi-byte characters are not. > ... > So my question is, do we yet have the infrastructure to make > case-folding consistent across all character widths? We still lack any built-in knowledge about this, and would have to rely on libc, which means the results would likely be platform-dependent and probably LC_CTYPE-dependent. More generally, I'd be mighty hesitant to change this behavior after it's stood for so many years. I suspect more people would complain that we broke their application than would be happy about it. Having said that, we are already relying on towlower() in places, and could do similarly here if we didn't care about the above issues. regards, tom lane
On 2020-Jul-06, Tom Lane wrote: > More generally, I'd be mighty hesitant to change this behavior after > it's stood for so many years. I suspect more people would complain > that we broke their application than would be happy about it. > > Having said that, we are already relying on towlower() in places, > and could do similarly here if we didn't care about the above issues. I think the fact that identifiers fail to follow language-specific case folding rules is more a known gotcha than a desired property, but on principle I tend to agree that Turkish people would not be happy about the prospect of us changing the downcasing rule in a major release -- it would mean having to edit any affected application code as part of a pg_upgrade process, which is not great. Now you could say that this can be fixed by adding a GUC that preserves the old behavior, but generally we don't like that too much. The counter argument is that there are more future users than there are current users. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > On 2020-Jul-06, Tom Lane wrote: >> More generally, I'd be mighty hesitant to change this behavior after >> it's stood for so many years. I suspect more people would complain >> that we broke their application than would be happy about it. > I think the fact that identifiers fail to follow language-specific case > folding rules is more a known gotcha than a desired property, but on > principle I tend to agree that Turkish people would not be happy about > the prospect of us changing the downcasing rule in a major release -- it > would mean having to edit any affected application code as part of a > pg_upgrade process, which is not great. It's not just the Turks. As near as I can tell, we'd likely break *every* app that's using such identifiers. For example, supposing I do test=# create table MYÉCLASS (f1 text); CREATE TABLE test=# \dt List of relations Schema | Name | Type | Owner --------+----------+-------+---------- public | myÉclass | table | postgres (1 row) pg_dump will render this as CREATE TABLE public."myÉclass" ( f1 text ); If we start to case-fold É, then the only way to access this table will be by double-quoting its name, which the application probably is not expecting (else it would have double-quoted in the original CREATE TABLE). > Now you could say that this can be fixed by adding a GUC that preserves > the old behavior, but generally we don't like that too much. Yes, a GUC changing this would be a headache. It would be just as much of a compatibility and security hazard as standard_conforming_strings (which indeed I've been thinking of proposing that we get rid of; it's hung around long enough). > The counter argument is that there are more future users than there are > current users. Especially if we drive away the current users :-(. In practice, we've heard very very few complaints about this, so my gut says to leave it alone. regards, tom lane
Tom Lane wrote: > CREATE TABLE public."myÉclass" ( > f1 text > ); > > If we start to case-fold É, then the only way to access this table will > be by double-quoting its name, which the application probably is not > expecting (else it would have double-quoted in the original CREATE TABLE). This problem already exists when migrating from a mono-byte database to a multi-byte database, since downcase_identifier() does use tolower() for mono-byte databases. db9=# show server_encoding ; server_encoding ----------------- LATIN9 (1 row) db9=# create table MYÉCLASS (f1 text); CREATE TABLE db9=# \d List of relations Schema | Name | Type | Owner --------+----------+-------+---------- public | myéclass | table | postgres (1 row) db9=# select * from MYÉCLASS; f1 ---- (0 rows) pg_dump will dump this as CREATE TABLE public."myéclass" ( f1 text ); So far so good. But after importing this into an UTF-8 database, the same "select * from MYÉCLASS" that used to work now fails: u8=# show server_encoding ; server_encoding ----------------- UTF8 (1 row) u8=# select * from MYÉCLASS; ERROR: relation "myÉclass" does not exist The compromise that is mentioned in downcase_identifier() justifying this inconsistency is not very convincing, because the issues in case folding due to linguistic differences exist both in mono-byte and multi-byte encodings. For instance, if it's fine to trust the locale to downcase 'İ' in a LATIN5 db, it should be okay in a UTF-8 db too. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: https://www.manitou-mail.org Twitter: @DanielVerite
"Daniel Verite" <daniel@manitou-mail.org> writes: > Tom Lane wrote: >> If we start to case-fold É, then the only way to access this table will >> be by double-quoting its name, which the application probably is not >> expecting (else it would have double-quoted in the original CREATE TABLE). > This problem already exists when migrating from a mono-byte database > to a multi-byte database, since downcase_identifier() does use > tolower() for mono-byte databases. Sure, but that's a tiny minority of use-cases. In particular it would not bite you after a straight upgrade to a new PG version. [ thinks... ] Wait, actually the described case would occur if you migrated *from* UTF8 (no folding) to LATINn (with folding). That's gotta be an even tinier minority. Migration to UTF8 would show different, though perhaps just as annoying, symptoms. Anyway, I freely concede that I'm ill-equipped to judge how annoying this is, since I don't program in any languages where it'd make a difference. But we mustn't fool ourselves: changing this would be just as dangerous as the standard_conforming_strings changeover was. I'm not really convinced it's worth it. In particular, I don't find the "it's required by the standard" argument convincing. The standard requires us to fold to upper case, too, but we've long ago decided to just say no to that. (Which reminds me: there are extensive threads in the archives analyzing whether it's practical to support more than one folding behavior. Those discussions would likely be relevant here.) regards, tom lane
On Mon, Jul 6, 2020 at 8:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > test=# create table MYÉCLASS (f1 text); > CREATE TABLE > test=# \dt > List of relations > Schema | Name | Type | Owner > --------+----------+-------+---------- > public | myÉclass | table | postgres > (1 row) > > pg_dump will render this as > > CREATE TABLE public."myÉclass" ( > f1 text > ); > > If we start to case-fold É, then the only way to access this table will > be by double-quoting its name, which the application probably is not > expecting (else it would have double-quoted in the original CREATE TABLE). While this is true, it's also pretty hard to imagine a user being satisfied with a table that ends up with this kind of mixed-case name. That's not to say that I have any good idea what to do about this. I just disagree with labelling the above case as a success. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2020-Jul-08, Robert Haas wrote: > That's not to say that I have any good idea what to do about this. I > just disagree with labelling the above case as a success. Yeah, particularly since it works differently in single-char encodings. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Robert Haas <robertmhaas@gmail.com> writes: > That's not to say that I have any good idea what to do about this. I > just disagree with labelling the above case as a success. I can't say that I like it either. But I'm afraid that changing it now will create many more problems than it solves. regards, tom lane
On Mon, Jul 6, 2020 at 08:32:22PM -0400, Tom Lane wrote: > Yes, a GUC changing this would be a headache. It would be just as much of > a compatibility and security hazard as standard_conforming_strings (which > indeed I've been thinking of proposing that we get rid of; it's hung > around long enough). +1 -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee