Thread: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
[HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
From
Peter Geoghegan
Date:
As I pointed out a couple of times already [1], we don't currently sanitize ICU's BCP 47 language tags within CREATE COLLATION. CREATE COLLATION will accept literally any string as a language tag as things stand, even when the string is unambiguously bogus. While I accept that there are limits on how far you can take sanitizing the BCP 47 tag format, due to its extensibility and "best effort" emphasis on forward and backward compatibility, we can and should do more here, IMHO. We should at least do the bare minimum, which has no possible downside, and some notable upsides. If I hack the CREATE COLLATION code to put any language tag string provided by the user through the same sanitization process that initdb already puts ICU language tags through, then we do much better. CREATE COLLATION rejects syntax errors, which seems desirable: postgres=# CREATE COLLATION test1 (provider = icu, locale = 'en-x-icu'); CREATE COLLATION postgres=# CREATE COLLATION test2 (provider = icu, locale = 'foo bar baz'); ERROR: XX000: could not convert locale name "foo bar baz" to language tag: U_ILLEGAL_ARGUMENT_ERROR LOCATION: get_icu_language_tag, collationcmds.c:454 postgres=# CREATE COLLATION test3 (provider = icu, locale = 'en-gb-icu'); ERROR: XX000: could not convert locale name "en-gb-icu" to language tag: U_ILLEGAL_ARGUMENT_ERROR LOCATION: get_icu_language_tag, collationcmds.c:454 postgres=# CREATE COLLATION test4 (provider = icu, locale = 'not-a-country'); CREATE COLLATION (To be more specific, I'm calling get_icu_language_tag()/uloc_toLanguageTag() [2] as an extra step for CREATE COLLATION here.) It's not like the current behavior is a disaster, or that the alternative behavior that I propose is perfect. The collation behavior you end up with today, having specified a language tag with a syntax error is the totally generic base Ducet collation behavior. Using 'foo bar baz' is effectively the same as using the preinstalled 'und-x-icu' collation, which I'm pretty sure is the same as using any current English locale anyway. That said, falling back on the most useful collation behavior based on inferences about the language tag is supposed to be about rolling with the complexities of internationalization, like political changes that are not yet accounted for by the CLDR/ICU version, and so on. It's no justification for not letting the user know when they've fat fingered a language tag, which they could easily miss. These are *syntax* errors. At one point a couple of months back, it was understood that get_icu_language_tag() might not always work with (assumed) valid locale names -- that is at least the impression that the commit message of eccead9 left me with. But, that was only with ICU 4.2, and in any case we've since stopped creating keyword variants at initdb time for other reasons (see 2bfd1b1 for details of those other reasons). I tend to think that we should not install any language tag that uloc_toLanguageTag() does not accept as valid on general principle (so not just at initdb time, when it's actually least needed). Thoughts? I can write a patch for this, if that helps. It should be straightforward. [1] postgr.es/m/CAH2-Wzm22vtxvD-e1oz90DE8Z_M61_8amHsDOZf1PWRKfRmj1g@mail.gmail.com [2] https://ssl.icu-project.org/apiref/icu4c/uloc_8h.html#a1d50c91925ca3853fce6f28cf7390c3c -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Andreas Karlsson
Date:
On 09/19/2017 12:46 AM, Peter Geoghegan wrote:> At one point a couple of months back, it was understood that > get_icu_language_tag() might not always work with (assumed) valid > locale names -- that is at least the impression that the commit > message of eccead9 left me with. But, that was only with ICU 4.2, and > in any case we've since stopped creating keyword variants at initdb > time for other reasons (see 2bfd1b1 for details of those other > reasons). I tend to think that we should not install any language tag > that uloc_toLanguageTag() does not accept as valid on general > principle (so not just at initdb time, when it's actually least > needed). > > Thoughts? I can write a patch for this, if that helps. It should be > straightforward. Hm, I like the idea but I see some issues. Enforcing the BCP47 seems like a good thing to me. I do not see any reason to allow input with syntax errors. The issue though is that we do not want to break people's databases when they upgrade to PostgreSQL 11. What if they have specified the locale in the old non-ICU format or they have a bogus value and we then error out on pg_upgrade or pg_restore? Andreas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
From
Tom Lane
Date:
Andreas Karlsson <andreas@proxel.se> writes: > Hm, I like the idea but I see some issues. > Enforcing the BCP47 seems like a good thing to me. I do not see any > reason to allow input with syntax errors. The issue though is that we do > not want to break people's databases when they upgrade to PostgreSQL 11. > What if they have specified the locale in the old non-ICU format or they > have a bogus value and we then error out on pg_upgrade or pg_restore? Well, if PG10 shipped with that restriction in place then it wouldn't be an issue ;-) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Geoghegan
Date:
On Tue, Sep 19, 2017 at 2:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andreas Karlsson <andreas@proxel.se> writes: >> Hm, I like the idea but I see some issues. > >> Enforcing the BCP47 seems like a good thing to me. I do not see any >> reason to allow input with syntax errors. The issue though is that we do >> not want to break people's databases when they upgrade to PostgreSQL 11. >> What if they have specified the locale in the old non-ICU format or they >> have a bogus value and we then error out on pg_upgrade or pg_restore? > > Well, if PG10 shipped with that restriction in place then it wouldn't > be an issue ;-) I was proposing that this be treated as an open item for v10; sorry if I was unclear on that. Much like the "ICU locales vs. ICU collations within pg_collation" issue, this seems like the kind of thing that we ought to go out of our way to get right in the *first* version. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Andreas Karlsson
Date:
On 09/19/2017 11:32 PM, Peter Geoghegan wrote: > On Tue, Sep 19, 2017 at 2:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Well, if PG10 shipped with that restriction in place then it wouldn't >> be an issue ;-) > > I was proposing that this be treated as an open item for v10; sorry if > I was unclear on that. Much like the "ICU locales vs. ICU collations > within pg_collation" issue, this seems like the kind of thing that we > ought to go out of our way to get right in the *first* version. If people think it is possible to get this done in time for PostgreSQL 10 and it does not break anything on older version of ICU (or the migration from older versions) I am all for it. Andreas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Geoghegan
Date:
On Tue, Sep 19, 2017 at 3:23 PM, Andreas Karlsson <andreas@proxel.se> wrote: > If people think it is possible to get this done in time for PostgreSQL 10 > and it does not break anything on older version of ICU (or the migration > from older versions) I am all for it. The only behavioral difference would occur when CREATE COLLATION is run (no changes to the initdb behavior, where the real risk exists). What this boils down to is that we have every reason to think that the right thing is to reject something that ICU's uloc_toLanguageTag() itself rejects as invalid (through the get_icu_language_tag() function). It looked like we were equivocating on following this at one point, prior to 2bfd1b1, in order to suit ICU 4.2 (again, see commit eccead9). I tend to think that the way we used to concatenate variant keywords against locale names during initdb was simply wrong in ICU 4.2, for some unknown reason. I think that the behavior that I propose might prevent things from silently failing on ICU 4.2 that were previously *believed* to work there, but in fact these things (variant keywords) never really worked. There might be an argument to be made for passing strict = FALSE to uloc_toLanguageTag() instead, so that we replace the language tag with one that is known to have valid syntax, and store that in pg_collation instead (while possibly raising a NOTICE). I guess that that would actually be the minimal fix here. I still favor a hard reject of invalid BCP 47 syntax, though. PostgreSQL's CREATE COLLATION is not one of the places where this kind of leniency makes sense. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Eisentraut
Date:
On 9/18/17 18:46, Peter Geoghegan wrote: > As I pointed out a couple of times already [1], we don't currently > sanitize ICU's BCP 47 language tags within CREATE COLLATION. There is no requirement that the locale strings for ICU need to be BCP 47. ICU locale names like 'de@collation=phonebook' are also acceptable. The reason they are not validated is that, as you know, ICU accepts any locale string as valid. You appear to have found a way to do some validation, but I would like to see that code. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Geoghegan
Date:
On Tue, Sep 19, 2017 at 5:52 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 9/18/17 18:46, Peter Geoghegan wrote: >> As I pointed out a couple of times already [1], we don't currently >> sanitize ICU's BCP 47 language tags within CREATE COLLATION. > > There is no requirement that the locale strings for ICU need to be BCP > 47. ICU locale names like 'de@collation=phonebook' are also acceptable. Right. But, we only document that BCP 47 is supported by Postgres. Maybe we can use get_icu_language_tag()/uloc_toLanguageTag() to ensure that we end up with BCP 47, even when the user happens to specify the legacy syntax. Should we be "canonicalizing" legacy ICU locale strings as BCP 47, too? > The reason they are not validated is that, as you know, ICU accepts any > locale string as valid. You appear to have found a way to do some > validation, but I would like to see that code. As I mentioned, I'm simply calling get_icu_language_tag()/uloc_toLanguageTag() to do that sanitization. The code to get the extra sanitization is completely trivial. I didn't post the patch that generates the errors in my opening e-mail because I'm not confident it's correct just yet. And, I think that I see a bigger problem: we pass a string that is almost certainly a BCP 47 string to ucol_open() from within pg_newlocale_from_collation(). We do so despite the fact that ucol_open() apparently doesn't accept BCP 47 syntax locale strings until ICU 54 [1]. Seems entirely possible that this accounts for the problems you saw on ICU 4.2, back when we were still creating keyword variants (I guess that the keyword variants seem very "BCP 47-ish" to me). I really think we need to add some kind of debug mode that makes ICU optionally spit out a locale display name at key points. We need this to gain confidence that the behavior that ICU provides actually matches what is expected across ICU different versions for different locale formats. We talked about this as a user-facing feature before, which can wait till v11; I just want this to debug problems like this one. [1] https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a4721e4c0a519bb0139a874e191223590 -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Geoghegan
Date:
On Tue, Sep 19, 2017 at 7:01 PM, Peter Geoghegan <pg@bowt.ie> wrote: > I didn't post the patch that generates the errors in my opening e-mail > because I'm not confident it's correct just yet. And, I think that I > see a bigger problem: we pass a string that is almost certainly a BCP > 47 string to ucol_open() from within pg_newlocale_from_collation(). We > do so despite the fact that ucol_open() apparently doesn't accept BCP > 47 syntax locale strings until ICU 54 [1]. Seems entirely possible > that this accounts for the problems you saw on ICU 4.2, back when we > were still creating keyword variants (I guess that the keyword > variants seem very "BCP 47-ish" to me). ISTM that the proper fix here is to use uloc_forLanguageTag() [1] (not to be confused with uloc_toLanguageTag()) to get a valid locale identifier on versions of ICU where BCP 47 format tags are not directly accepted as locale identifiers (versions prior to ICU 54). This would happen as an extra step within pg_newlocale_from_collation(), since BCP 47 format would be what is stored in pg_collation. Since uloc_forLanguageTag() become stable in ICU 4.2, the earliest version that we support, I believe that that would leave us in good shape. [1] https://ssl.icu-project.org/apiref/icu4c/uloc_8h.html#aa45d6457f72867880f079e27a63c6fcb -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Eisentraut
Date:
On 9/19/17 22:01, Peter Geoghegan wrote: > On Tue, Sep 19, 2017 at 5:52 PM, Peter Eisentraut > <peter.eisentraut@2ndquadrant.com> wrote: >> On 9/18/17 18:46, Peter Geoghegan wrote: >>> As I pointed out a couple of times already [1], we don't currently >>> sanitize ICU's BCP 47 language tags within CREATE COLLATION. >> >> There is no requirement that the locale strings for ICU need to be BCP >> 47. ICU locale names like 'de@collation=phonebook' are also acceptable. > > Right. But, we only document that BCP 47 is supported by Postgres. The documentation is admittedly not very concrete about what ICU locale names it accepts beyond talking about a "named collator provided by the ICU library". The examples we provide use the BCP 47 style, but that's just because we liked them that way. ICU <54 doesn't even support the BCP 47 style, so we need to keep supporting the old/other style anyway. > And, I think that I > see a bigger problem: we pass a string that is almost certainly a BCP > 47 string to ucol_open() from within pg_newlocale_from_collation(). We > do so despite the fact that ucol_open() apparently doesn't accept BCP > 47 syntax locale strings until ICU 54 [1]. pg_import_system_collations() takes care to use the non-BCP-47 style for such versions, so I think this is working correctly. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Geoghegan
Date:
On Wed, Sep 20, 2017 at 4:04 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > ICU <54 doesn't even support the BCP 47 style, so we need to keep > supporting the old/other style anyway. Yes it does -- just not as a native locale name. ucol_open() apparently became more liberal in what it would accept in ICU 54. >> And, I think that I >> see a bigger problem: we pass a string that is almost certainly a BCP >> 47 string to ucol_open() from within pg_newlocale_from_collation(). We >> do so despite the fact that ucol_open() apparently doesn't accept BCP >> 47 syntax locale strings until ICU 54 [1]. > > pg_import_system_collations() takes care to use the non-BCP-47 style for > such versions, so I think this is working correctly. But CREATE COLLATION doesn't use pg_import_system_collations(). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Geoghegan
Date:
On Wed, Sep 20, 2017 at 4:08 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> pg_import_system_collations() takes care to use the non-BCP-47 style for >> such versions, so I think this is working correctly. > > But CREATE COLLATION doesn't use pg_import_system_collations(). And perhaps more to the point: it highly confusing that we use one or the other of those 2 things ("langtag"/BCP 47 tag or "name"/legacy locale name) as "colcollate", depending on ICU version, thereby *behaving* as if ICU < 54 really didn't know anything about BCP 47 tags. Because, obviously earlier ICU versions know plenty about BCP 47, since 9 lines further down we use "langtag"/BCP 47 tag as collname for CollationCreate() -- regardless of ICU version. How can you say "ICU <54 doesn't even support the BCP 47 style", given all that? Those versions will still have locales named "*-x-icu" when users do "\dOS". Users will be highly confused when they quite reasonably try to generalize from the example in the docs and what "\dOS" shows, and get results that are wrong, often only in a very subtle way. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Andreas Karlsson
Date:
On 09/21/2017 01:40 AM, Peter Geoghegan wrote: > On Wed, Sep 20, 2017 at 4:08 PM, Peter Geoghegan <pg@bowt.ie> wrote: >>> pg_import_system_collations() takes care to use the non-BCP-47 style for >>> such versions, so I think this is working correctly. >> >> But CREATE COLLATION doesn't use pg_import_system_collations(). > > And perhaps more to the point: it highly confusing that we use one or > the other of those 2 things ("langtag"/BCP 47 tag or "name"/legacy > locale name) as "colcollate", depending on ICU version, thereby > *behaving* as if ICU < 54 really didn't know anything about BCP 47 > tags. Because, obviously earlier ICU versions know plenty about BCP > 47, since 9 lines further down we use "langtag"/BCP 47 tag as collname > for CollationCreate() -- regardless of ICU version. > > How can you say "ICU <54 doesn't even support the BCP 47 style", given > all that? Those versions will still have locales named "*-x-icu" when > users do "\dOS". Users will be highly confused when they quite > reasonably try to generalize from the example in the docs and what > "\dOS" shows, and get results that are wrong, often only in a very > subtle way. If we are fine with supporting only ICU 4.2 and later (which I think we are given that ICU 4.2 was released in 2009) then using uloc_forLanguageTag()[1] to validate and canonize seems like the right solution. I had missed that this function even existed when I last read the documentation. Does it return a BCP 47 tag in modern versions of ICU? I strongly prefer if there, as much as possible, is only one format for inputting ICU locales. 1. http://www.icu-project.org/apiref/icu4c/uloc_8h.html#aa45d6457f72867880f079e27a63c6fcb Andreas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Geoghegan
Date:
On Thu, Sep 21, 2017 at 2:49 AM, Andreas Karlsson <andreas@proxel.se> wrote: > If we are fine with supporting only ICU 4.2 and later (which I think we are > given that ICU 4.2 was released in 2009) then using uloc_forLanguageTag()[1] > to validate and canonize seems like the right solution. I had missed that > this function even existed when I last read the documentation. Does it > return a BCP 47 tag in modern versions of ICU? The decision to support ICU >= 4.2 was already made (see commit eccead9). I have no reason to think that that's a problem for us. As I've said, we currently use uloc_toLanguageTag() on all supported ICU versions, to at least create a collation name at initdb time, but also to get our new collation's colcollate when ICU >= 54. If we now put a uloc_forLanguageTag() in place before each ucol_open() call, then we can safely store a BCP 47 format tag as colcollate on *every* ICU version. We can reconstruct a traditional format locale string *on-the-fly* within pg_newlocale_from_collation() and get_collation_actual_version(), which would be what we'd pass to ucol_open() (we'd never pass "raw colcollate"). To keep things simple, we could always convert colcollate to the traditional format using uloc_forLanguageTag(), just in case we're on a version of ICU where ucol_open() doesn't like locales that are spelled using the BCP 47 format. It doesn't seem worth it to take advantage of the leniency on ICU >= 54, since that would be a special case (though we could if we wanted to). > I strongly prefer if there, as much as possible, is only one format for > inputting ICU locales. I agree, but the bigger issue is that we're *half way* between supporting only one format, and supporting two formats. AFAICT, there is no reason that we can't simply support one format on all ICU versions, and keep what ends up within pg_collation at initdb time essentially the same across ICU versions (except for those that are due to cultural/political developments). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Geoghegan
Date:
On Tue, Sep 19, 2017 at 7:01 PM, Peter Geoghegan <pg@bowt.ie> wrote: > I really think we need to add some kind of debug mode that makes ICU > optionally spit out a locale display name at key points. We need this > to gain confidence that the behavior that ICU provides actually > matches what is expected across ICU different versions for different > locale formats. We talked about this as a user-facing feature before, > which can wait till v11; I just want this to debug problems like this > one. I patched CREATE COLLATION to show ICU display name, which produces output like this: postgres=# create collation basque (provider=icu, locale='eu-u-kf-upper-kr-latn-digit-em-emoji-kn-true-co-eor'); NOTICE: 00000: ICU collation description is "Basque (colcasefirst=upper, Sort Order=European Ordering Rules, colnumeric=yes, colreorder=latn-digit, em=emoji)" CREATE COLLATION I used an ISO 639-1 language code (2 letter language code) above, which, as we can see, is recognized as Basque. ICU is also fine with the 3 letter 639-2 code "eus-", recognizing that as Basque, too. If I use an ISO 639-2 code for Basque that ICU/CLDR doesn't like, "baq-*", I can see that my expectations have only partially been met, since the notice doesn't say anything about the language Basque: postgres=# create collation actually_not_basque (provider=icu, locale='baq-u-kf-upper-kr-latn-digit-em-emoji-kn-true-co-eor'); NOTICE: 00000: ICU collation description is "baq (colcasefirst=upper, Sort Order=European Ordering Rules, colnumeric=yes, colreorder=latn-digit, em=emoji)" CREATE COLLATION Functionality like this is starting to look essential to me, rather than just a nice to have. Having this NOTICE would have made me realize our problems with ICU versions < 54 much earlier, if nothing else. If the purpose of NOTICE messages is to "Provide[s] information that might be helpful to users", then I'd say that this definitely qualifies. And, the extra code is trivial (we already get display name in the context of initdb). I strongly recommend that we slip this into v10, as part of fixing the problem with language tags that earlier ICU versions have. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Andreas Karlsson
Date:
On 09/21/2017 10:55 PM, Peter Geoghegan wrote: > I agree, but the bigger issue is that we're *half way* between > supporting only one format, and supporting two formats. AFAICT, there > is no reason that we can't simply support one format on all ICU > versions, and keep what ends up within pg_collation at initdb time > essentially the same across ICU versions (except for those that are > due to cultural/political developments). I think we are in agreement then, but I do not have the time to get this done before the release of 10 so would be happy if you implemented it. Peter E, what do you say in this? Andreas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Geoghegan
Date:
On Fri, Sep 22, 2017 at 1:50 AM, Andreas Karlsson <andreas@proxel.se> wrote: > On 09/21/2017 10:55 PM, Peter Geoghegan wrote: >> >> I agree, but the bigger issue is that we're *half way* between >> supporting only one format, and supporting two formats. AFAICT, there >> is no reason that we can't simply support one format on all ICU >> versions, and keep what ends up within pg_collation at initdb time >> essentially the same across ICU versions (except for those that are >> due to cultural/political developments). > > > I think we are in agreement then, but I do not have the time to get this > done before the release of 10 so would be happy if you implemented it. Peter > E, what do you say in this? I can write a patch for this, but will not without some kind of buy-in from Peter E. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Eisentraut
Date:
On 9/21/17 16:55, Peter Geoghegan wrote: >> I strongly prefer if there, as much as possible, is only one format for >> inputting ICU locales. > I agree, but the bigger issue is that we're *half way* between > supporting only one format, and supporting two formats. AFAICT, there > is no reason that we can't simply support one format on all ICU > versions, and keep what ends up within pg_collation at initdb time > essentially the same across ICU versions (except for those that are > due to cultural/political developments). After reviewing this thread and testing around a bit, I think the code is mostly fine as it is, but the documentation is lacking. Hence, attached is a patch to expand the documentation quite a bit, especially to document in more detail what ICU locale strings are accepted. The documentation has always stated, albeit perhaps in obscure ways, that we accept for locales whatever ICU accepts. I don't think we should restrict or override that in any way. That would cause existing documentation and lore on ICU to be invalid for PostgreSQL. I specifically disagree that we should, as appears to be suggested here, restrict the input locale strings to BCP 47 and reject or transform the traditional ICU-specific locale syntax. Again, that would cause existing ICU documentation material to become less applicable to PostgreSQL. And basically, there is no reason for it, because I am not aware that ICU plans to get rid of that syntax. Moreover, we need to support that syntax for older ICU versions anyway. A patch has been posted that, as I understand it, would allow BCP 47 syntax to be used with older versions as well. That's a nice idea, but it's a new feature and not appropriate for PG10 at this time. (Note that we also don't canonicalize libc locale names.) The attached documentation patch documents both locale naming forms in parallel. The other attached patch contains a test suite that verifies that the examples in the documentation actually work. It's not meant for committing into the mainline, but it was useful for me. During testing with various versions I have also discovered the following things: - The current code appears to be of the opinion that BCP 47 locale names are accepted as of ICU 54. That appears to be incorrect; I find that they work fine in ICU 52, but they don't work in ICU 4.2. I don't know where the cutoff is. That might be worth changing in the code if we can obtain more information. - What might have led to the above mistake is the following in the ucol_open() documentation: q{Starting with ICU 54, collation attributes can be specified via locale keywords as well, in the old locale extension syntax ("el@colCaseFirst=upper") or in language tag syntax ("el-u-kf-upper").} That is correct. If you use that syntax in earlier versions, the case-first specification in this example is ignored. You need to use ucol_setAttribute() then. (Note that the phonebook stuff still works, because that is not a "collation attribute".) (One of my plans for PG11 had been to allow specifying such collation attributes via additional CREATE COLLATION clauses and pg_collation columns, but that might be obsolete now.) - Moreover, it is not the case that ICU accepts just any sort of nonsense as a locale name. For example, "el@colCaseFirst=whatever" is rejected with U_ILLEGAL_ARGUMENT_ERROR. Now, it might in other cases be more liberal than we might be hoping for. But I think they have reasons for what they do. One conclusion here is that there are multiple dimensions to what sort of thing is accepted as a locale in different versions of ICU and what the canonical format is. If we insist that everything has to be written in the form that is preferred today, then we'll have awkward problems if a future version of ICU establishes even more variants that will then be preferred. I think there is room for incremental refinement here. Features like optionally checking or printing the canonical or preferred locale format or making the locale description available via a function or system view might all be valuable additions to a future PostgreSQL release. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Geoghegan
Date:
On Fri, Sep 22, 2017 at 11:34 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > After reviewing this thread and testing around a bit, I think the code > is mostly fine as it is, but the documentation is lacking. Hence, > attached is a patch to expand the documentation quite a bit, especially > to document in more detail what ICU locale strings are accepted. > > The documentation has always stated, albeit perhaps in obscure ways, > that we accept for locales whatever ICU accepts. I don't think we > should restrict or override that in any way. That would cause existing > documentation and lore on ICU to be invalid for PostgreSQL. I think that this thread is mostly about three fairly different things. I didn't plan it that way, but then I only realized the second two while investigating the first. Those are: 1. The question of whether and to what extent we should sanitize the colcollate string provided by the user within CREATE COLLATION for the ICU collation provider. 2. The question of what ends up in pg_collation at initdb time. In particular, the format of colcollate. 3. The question of whether or not we should ever accept a locale in the traditional format, rather than insisting on BCP 47 in every context. (This may have become conflated with issue 1.) > I specifically disagree that we should, as appears to be suggested here, > restrict the input locale strings to BCP 47 and reject or transform the > traditional ICU-specific locale syntax. Again, that would cause > existing ICU documentation material to become less applicable to > PostgreSQL. And basically, there is no reason for it, because I am not > aware that ICU plans to get rid of that syntax. I disagree, because ICU/CLDR seems to want to standardize on BCP 47 (with custom extensions), but I acknowledge that you have a point here. This isn't what I think is important, among all the points raised on this thread. I can let this go. > Moreover, we need to > support that syntax for older ICU versions anyway. FWIW, I don't think that we *need* to support it for older ICU versions, except as an implementation detail that gets us to and from BCP 47 as needed. > A patch has been > posted that, as I understand it, would allow BCP 47 syntax to be used > with older versions as well. That's a nice idea, but it's a new feature, which may have been my fault, which may havebeen my fault > and not appropriate for PG10 at this time. > > (Note that we also don't canonicalize libc locale names.) But you are *already* canonicalizing ICU collation names as BCP 47. My point here is: Why not finish the job off, and *also* canonicalize colcollate in the same way? This won't break ucol_open() if we take appropriate precautions when we go to use the Postgres collation/ICU locale. One concern that makes me suggest this is: What happens when the user *downgrades* ICU version, from a version where colcollate is BCP 47 to one where it would not have been at initdb time? That will break the downgrade in an unpleasant way, including in installations that never do a CREATE COLLATION themselves. We want to be able to restore a basebackup on a somewhat different OS, and have that still work following REINDEX. At least, that seems like it should be an important goal for us. I want Postgres to insist on always using BCP 47 in CREATE COLLATION for a few reasons. One that is relevant to this colcollate question is that by insisting on BCP 47 within CREATE COLLATION, there is no question of CREATE COLLATION having to consider the legacy locale format on ICU versions that don't handle both at the same time too well (this at least includes ICU 42). We'd always only accept BCP 47 from users, we'd always store BCP 47 as colcollate (and collation name), and we'd always use the traditional locale string format as an argument to ucol_open(). Consistency of interface and implementation, across all ICU versions. I might actually be convinced by what you say here if we weren't already canonicalizing collation name as BCP 47 (though I also understand why you did that). > During testing with various versions I have also discovered the > following things: > > - The current code appears to be of the opinion that BCP 47 locale names > are accepted as of ICU 54. That appears to be incorrect; I find that > they work fine in ICU 52, but they don't work in ICU 4.2. I don't know > where the cutoff is. That might be worth changing in the code if we can > obtain more information. I fear that BCP 47 is only quasi supported (without the proper conversion by uloc_forLanguageTag()) prior to ICU 54 (the version where it is actually documented as supported). We've already seen plenty of cases where ucol_open() locale string interpretation soldiers on, when it arguably shouldn't, so I hope that that isn't what you actually see on ICU 52. I strongly suggest looking at a variety of display names at CREATE COLLATION time (I can provide a rough patch that shows display name [1]), to make sure that it matches expectations on all ICU versions. Your testing patch does not satisfy me that versions prior to ICU 54 have a ucol_open() that accepts BCP 47 without *any* problem (it could be a subtle issue). Besides, if it isn't going to work on ICU 4.2 anyway, I think that the leniency of ICU 52 isn't actually interesting. We still have an awkward special case to deal with. > One conclusion here is that there are multiple dimensions to what sort > of thing is accepted as a locale in different versions of ICU and what > the canonical format is. If we insist that everything has to be written > in the form that is preferred today, then we'll have awkward problems if > a future version of ICU establishes even more variants that will then be > preferred. I'd feel a lot better about that if there was a NOTICE showing the ICU display name for the locale shown when the user does a CREATE COLLATION. I think that the risks from not doing that outweigh the risks of doing it, even at this late date. I could live without any sanitization if we did this much. Without this, the user is flying blind when using CREATE COLLATION. Why should the DBA be able to verify that the sorting they get is culturally appropriate? That's just not practical. > I think there is room for incremental refinement here. Features like > optionally checking or printing the canonical or preferred locale format > or making the locale description available via a function or system view > might all be valuable additions to a future PostgreSQL release. I don't think that there is room for incremental refinement when it comes to what ends up in pg_collation at initdb time. I don't like to create commotion at this late date, but there are some things that we'll only really get one chance at here. Admittedly, that's not true of everything I raise here; it's hard to strictly limit discussion to issues that have that quality. [1] postgr.es/m/CAH2-WznOpmJ+3xh6bvea_YUyd4ZdGiwG9ycE31Q09oU3XXw5vA@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Robert Haas
Date:
On Fri, Sep 22, 2017 at 4:46 PM, Peter Geoghegan <pg@bowt.ie> wrote: > But you are *already* canonicalizing ICU collation names as BCP 47. My > point here is: Why not finish the job off, and *also* canonicalize > colcollate in the same way? Peter, with respect, it's time to let this argument go. We're scheduled to wrap a GA release in just over 72 hours. It is far too late to change behavior like this. There is no time for other people who may be interested in this issue to form a well-considered opinion on the topic and carefully review a proposed patch. There is also no time for users to notice it in the next beta and complain before we go final. This ship has sailed. On the substantive issue, I am inclined (admittedly without deep study) to agree with Peter Eisentraut. We have never canonicalized collations before and therefore it is not essential that we do that now. That would be a new feature, and I don't think I'd be prepared to endorse adding it three days after feature freeze let alone three days before the GA wrap. I do agree that the lack of canonicalization is utterly terrible. The APIs that Unix-like operating systems provide for collations are poorly suited to our purposes and hopelessly squishy about semantics, and it's not clear how much better ICU will be. But that's a problem that we should address, if at all, at a deliberate pace and with adequate time for reflection, research, and comment, not precipitously and under extreme time pressure. I simply do not buy the theory that this cannot be changed later. It's been the case for as long as we've had pg_collate that a new system could have different collations than the old one, resulting in a dump/restore failure. I expect somebody's had that problem at some point, but I don't think it's become a major pain point because most people don't use exotic collations, and if they do they probably understand that they need those exotic collations to be on the new system too. So, if we decide to change this later, we'll want to find ways to make the upgrade as pain-free as possible and document whatever the situation may be, but we've made many backward-incompatible changes in the past and this one would hardly be the worst. I also believe that Peter Eisentraut is entirely correct to be concerned about whether BCP 47 (or anything else) can really be regarded as a stable canonical form for ICU purposes. His email indicates that the acceptable and canonical forms have changed multiple times in the course of releases new enough for us to care about them. Assuming that statement is correct, it would be extremely short-sighted of us to bank on them not changing any more. But even if all of the above argumentation is utterly and completely wrong, dredged up from the universe's deepest and most profound reserves of stupidity and destined for future entry into Webster's as the canonical example of cluelessness, we still shouldn't change it the weekend before the GA wraps. I'm afraid that this new RMT process has lulled us into believing that the release will happen on time no matter how much stuff we whack around at the last minute, which is a very dangerous idea for a group of software engineers to have. Before, we thought we had infinite time to fix our bugs; now, we think we have infinite latitude to classify anything we don't like as a bug. Neither of those ideas is good software engineering. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Geoghegan
Date:
On Fri, Sep 22, 2017 at 5:58 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Sep 22, 2017 at 4:46 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> But you are *already* canonicalizing ICU collation names as BCP 47. My >> point here is: Why not finish the job off, and *also* canonicalize >> colcollate in the same way? > > Peter, with respect, it's time to let this argument go. We're > scheduled to wrap a GA release in just over 72 hours. It is far too > late to change behavior like this. I didn't say that it wasn't. That's above my paygrade. > On the substantive issue, I am inclined (admittedly without deep > study) to agree with Peter Eisentraut. We have never canonicalized > collations before and therefore it is not essential that we do that > now. As I've said, we're *already* canonicalizing them for ICU. Just not consistently (across ICU versions, and arguably even within ICU versions). That's the problem -- we're half way between both positions. The problem is most emphatically *not* that we've failed to canonicalize them in the way that I happen to favor. > That would be a new feature, and I don't think I'd be prepared > to endorse adding it three days after feature freeze let alone three > days before the GA wrap. I do agree that the lack of canonicalization > is utterly terrible. The APIs that Unix-like operating systems > provide for collations are poorly suited to our purposes and > hopelessly squishy about semantics, and it's not clear how much better > ICU will be. In one important sense, this is a regression against libc, because you never had something like en_US.UTF-8 break on downgrading glibc version (like, when you restore a basebackup on a different OS with the same arch). Sure, you probably had to REINDEX text indexes, to be on the safe side, but once you did that there was no question about the older glibc having never heard of "en_US.UTF-8" as a LC_COLLATE/collcollate. I regret that I didn't catch it sooner. It now seems very obvious, and totally preventable given enough time. > I simply do not buy the theory that this cannot be changed later.in It can be changed later, of course -- at greater, though indeterminate cost. > It's been the case for as long as we've had pg_collate that a new > system could have different collations than the old one, resulting in > a dump/restore failure. I expect somebody's had that problem at some > point, but I don't think it's become a major pain point because most > people don't use exotic collations, and if they do they probably > understand that they need those exotic collations to be on the new > system too. Like I said, you don't need exotic collations to have the downgrade problem, unless *any* initdb ICU collation counts as exotic. No CREATE COLLATION is needed. > I also believe that Peter Eisentraut is entirely correct to be > concerned about whether BCP 47 (or anything else) can really be > regarded as a stable canonical form for ICU purposes. His email > indicates that the acceptable and canonical forms have changed > multiple times in the course of releases new enough for us to care > about them. Assuming that statement is correct, it would be extremely > short-sighted of us to bank on them not changing any more. That statement isn't correct. Including even the suggestion that Peter Eisentraut ever thought it. ICU uses BCP 47 for collation name *across all versions*. Just not as the collcollate value (that's only the case on versions of ICU >= 54). > But even if all of the above argumentation is utterly and completely > wrong, dredged up from the universe's deepest and most profound > reserves of stupidity and destined for future entry into Webster's as > the canonical example of cluelessness, we still shouldn't change it > the weekend before the GA wraps. That seems like a value judgement. I'm not going to tell you that you're wrong. What I will say is that I think we've done poorly here. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > Peter, with respect, it's time to let this argument go. We're > scheduled to wrap a GA release in just over 72 hours. FWIW, the release is a week from Monday, not Monday. (Or if it is Monday, somebody else is wrapping it.) We have some other embarrassingly critical things to fix, like bug #14825, so I can certainly sympathize with an argument that there's not enough committer bandwidth left to deal with this; but not with an argument that it's too late to change behavior period. The big concern I have here is that this feels a lot like something that we'll regret at leisure, if it's not right in the first release. I'd much rather be restrictive in v10 and then loosen the rules later, than be lax in v10 and then have to argue about whether to break backwards compatibility in order to gain saner behavior. > We have never canonicalized collations before and therefore it is not > essential that we do that now. Actually, we try; see initdb.c's check_locale_name(). It's not our fault that setlocale(3) fails to play along on many platforms. > I simply do not buy the theory that this cannot be changed later. OK, so you're promising not to whine when we break backwards compatibility on this point in v11? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Geoghegan
Date:
On Fri, Sep 22, 2017 at 8:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The big concern I have here is that this feels a lot like something that > we'll regret at leisure, if it's not right in the first release. I'd > much rather be restrictive in v10 and then loosen the rules later, than > be lax in v10 and then have to argue about whether to break backwards > compatibility in order to gain saner behavior. To the bests of my knowledge, the only restriction implied by limiting ourselves to the BCP 47 format (as part of standardizing what is stored in pg_collation) is that users might know about the traditional locale strings from some other place, and be surprised when their knowledge doesn't transfer to Postgres. Personally, I don't think that that's a big deal. If it actually is important, then I'm surprised that it took this long for a doc change mentioning it to be proposed (though the docs *do* say "Collations provided by ICU are created with names in BCP 47 language tag format"). >> We have never canonicalized collations before and therefore it is not >> essential that we do that now. > > Actually, we try; see initdb.c's check_locale_name(). It's not our > fault that setlocale(3) fails to play along on many platforms. But it will be our fault if we ship a v10 that does the kind of unsettled canonicalization you see within pg_import_system_collations() (the "collcollate = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name" thing). That looks very much like the tail wagging the dog to me. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags.Should it?
From
Noah Misch
Date:
On Tue, Sep 19, 2017 at 07:01:47PM -0700, Peter Geoghegan wrote: > On Tue, Sep 19, 2017 at 5:52 PM, Peter Eisentraut > <peter.eisentraut@2ndquadrant.com> wrote: > > On 9/18/17 18:46, Peter Geoghegan wrote: > >> As I pointed out a couple of times already [1], we don't currently > >> sanitize ICU's BCP 47 language tags within CREATE COLLATION. > > > > There is no requirement that the locale strings for ICU need to be BCP > > 47. ICU locale names like 'de@collation=phonebook' are also acceptable. > > Right. But, we only document that BCP 47 is supported by Postgres. > Maybe we can use get_icu_language_tag()/uloc_toLanguageTag() to ensure > that we end up with BCP 47, even when the user happens to specify the > legacy syntax. Should we be "canonicalizing" legacy ICU locale strings > as BCP 47, too? > > > The reason they are not validated is that, as you know, ICU accepts any > > locale string as valid. You appear to have found a way to do some > > validation, but I would like to see that code. > > As I mentioned, I'm simply calling > get_icu_language_tag()/uloc_toLanguageTag() to do that sanitization. > The code to get the extra sanitization is completely trivial. > > I didn't post the patch that generates the errors in my opening e-mail > because I'm not confident it's correct just yet. And, I think that I > see a bigger problem: we pass a string that is almost certainly a BCP > 47 string to ucol_open() from within pg_newlocale_from_collation(). We > do so despite the fact that ucol_open() apparently doesn't accept BCP > 47 syntax locale strings until ICU 54 [1]. Seems entirely possible > that this accounts for the problems you saw on ICU 4.2, back when we > were still creating keyword variants (I guess that the keyword > variants seem very "BCP 47-ish" to me). > > I really think we need to add some kind of debug mode that makes ICU > optionally spit out a locale display name at key points. We need this > to gain confidence that the behavior that ICU provides actually > matches what is expected across ICU different versions for different > locale formats. We talked about this as a user-facing feature before, > which can wait till v11; I just want this to debug problems like this > one. > > [1] https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a4721e4c0a519bb0139a874e191223590 [Action required within three days. This is a generic notification.] The above-described topic is currently a PostgreSQL 10 open item. Peter, since you committed the patch believed to have created it, you own this open item. If some other commit is more relevant or if this does not belong as a v10 open item, please let us know. Otherwise, please observe the policy on open item ownership[1] and send a status update within three calendar days of this message. Include a date for your subsequent status update. Testers may discover new open items at any time, and I want to plan to get them all fixed well in advance of shipping v10. Consequently, I will appreciate your efforts toward speedy resolution. Thanks. [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Robert Haas
Date:
On Fri, Sep 22, 2017 at 11:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > FWIW, the release is a week from Monday, not Monday. (Or if it is > Monday, somebody else is wrapping it.) Oops. > We have some other embarrassingly critical things to fix, like bug #14825, > so I can certainly sympathize with an argument that there's not enough > committer bandwidth left to deal with this; but not with an argument that > it's too late to change behavior period. Really? Traditionally, you've been one of the biggest opponents of whacking things around post-beta. Even post-beta1, let alone post-beta5. > The big concern I have here is that this feels a lot like something that > we'll regret at leisure, if it's not right in the first release. I'd > much rather be restrictive in v10 and then loosen the rules later, than > be lax in v10 and then have to argue about whether to break backwards > compatibility in order to gain saner behavior. I think it's inevitable that a certain number of users are going to have to cope with ICU version changes breaking stuff. If ICU decides a collation is stupid or unused and drops it, or is mis-defined and redefines it to some behavior that breaks things for somebody, they are going to have to deal with it. I don't think you can make that problem go away by any amount of strictness introduced into v10, but if you make the checks zealous enough, you can probably make them rule out input that users would have preferred to have accepted. I also think that if there's a compelling reason to bet on BCP 47 to be a stable canonical form, I haven't heard it presented here. At the risk of repeating myself, it's not even supported in some ICU versions we support, so how's that going to work? And if it's been changed in the recent past, why not again? Peter Geoghegan said that he doesn't know of any plans to eliminate BCP 47 support, but that doesn't seem like it's much proof of anything. >> I simply do not buy the theory that this cannot be changed later. > > OK, so you're promising not to whine when we break backwards compatibility > on this point in v11? If somebody has a collation that appears to work on v10 but really is doing nothing, and when the upgrade to v11 they get an error because we diagnose that the collation definition was not valid whereas v10 was unable to make that diagnosis, I promise not to whine about the backward compatibility break thereby introduced. If, on the other hand, you introduce an error check that is overly stringent and precludes people from defining collations that are legal and useful (in their judgement, not yours), I intend to whine about that extensively. And that applies to 10, 11, and any future versions for which I may be around. In short, I judge that allowing users access to *all* of the things that ICU has now, has had in the past in versions we support, or may have in the future is an important goal, but that preventing them from relying on options that may go away is not a goal at all, since barring the ability to predict the future, it's impossible anyway. If it's possible to prevent to write a precisely-targeted check that rules out only actually-meaningless collations and nothing else, I'm fine with that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Geoghegan
Date:
On Mon, Sep 25, 2017 at 9:06 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> The big concern I have here is that this feels a lot like something that >> we'll regret at leisure, if it's not right in the first release. I'd >> much rather be restrictive in v10 and then loosen the rules later, than >> be lax in v10 and then have to argue about whether to break backwards >> compatibility in order to gain saner behavior. > > I think it's inevitable that a certain number of users are going to > have to cope with ICU version changes breaking stuff. Wasn't the main point of adopting ICU that that doesn't happen when it isn't essential? There will be some risk with pg_dump across ICU versions, due rare to political changes. But, on pg_upgrade, the old collations will continue to work, even if they would not have appeared at initdb time on that ICU version, because ICU has all kinds of fallbacks. It will work with a language it has heard of, but a country it has never heard of, for example. I just want to have a consistent collcollate format, to consistently take advantage of that. My patch has no behavioral changes, apart from the sanitization, on ICU >= 54, BTW. It's really not very much code, especially when you exclude objective bug fixes. > If ICU decides > a collation is stupid or unused and drops it, or is mis-defined and > redefines it to some behavior that breaks things for somebody, they > are going to have to deal with it. I don't think you can make that > problem go away by any amount of strictness introduced into v10, but > if you make the checks zealous enough, you can probably make them rule > out input that users would have preferred to have accepted. The sanitization thing isn't my main concern. The stability issue is the major issue, and it isn't all that connected to sanitization. > I also think that if there's a compelling reason to bet on BCP 47 to > be a stable canonical form, I haven't heard it presented here. At the > risk of repeating myself, it's not even supported in some ICU versions > we support, so how's that going to work? And if it's been changed in > the recent past, why not again? Peter Geoghegan said that he doesn't > know of any plans to eliminate BCP 47 support, but that doesn't seem > like it's much proof of anything. It's described by an IETF standard, and the extensions are described by Unicode Technical Standard #35, a formal standard [1]. The BCP 47 format is not an ICU thing at all -- it's a Unicode consortium thing. An express goal of BCP 47 is forward and backward compatibility [2]. And, ICU themselves have said that that's what they're standardizing on, since they are upstream consumers of the CLDR (Unicode consortium) locale data. They've done that because they want to move away from their own format towards a universal format, usable in a wide variety of contexts. While I don't pretend to be able to predict the future, I think that this is something that is as safe as is practically achievable to standardize on. (It also has all kinds of provision for further extension, should that prove necessary.) > If, on the other hand, you introduce an error check that is overly > stringent and precludes people from defining collations that are legal > and useful (in their judgement, not yours), I intend to whine about > that extensively. And that applies to 10, 11, and any future versions > for which I may be around. That's not going to happen. > In short, I judge that allowing users access to *all* of the things > that ICU has now, has had in the past in versions we support, or may > have in the future is an important goal, but that preventing them from > relying on options that may go away is not a goal at all, since > barring the ability to predict the future, it's impossible anyway. +1 > If it's possible to prevent to write a precisely-targeted check that > rules out only actually-meaningless collations and nothing else, I'm > fine with that. That's what I've done with my patch, IMV. I admit that it's still a tiny bit subjective that we should reject an empty string, and not allow the fallback there, for example (we could instead fall back on that as "undefined"). I think that that's not going to be a problem for anyone. [1] http://unicode.org/reports/tr35/tr35-collation.html [2] https://tools.ietf.org/html/bcp47#section-3.4 -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Eisentraut
Date:
On 9/22/17 16:46, Peter Geoghegan wrote: > But you are *already* canonicalizing ICU collation names as BCP 47. My > point here is: Why not finish the job off, and *also* canonicalize > colcollate in the same way? This won't break ucol_open() if we take > appropriate precautions when we go to use the Postgres collation/ICU > locale. Reading over this code again, it is admittedly not quite clear why this "canonicalization" is in there right now. I think it had something to do with how we built the keyword variants at one point. It might not make sense. I'd be glad to take that out and use the result straight from uloc_getAvailable() for collcollate. That is, after all, the "canonical" version that ICU chooses to report to us. > One concern that makes me suggest this is: What happens when > the user *downgrades* ICU version, from a version where colcollate is > BCP 47 to one where it would not have been at initdb time? That will > break the downgrade in an unpleasant way, including in installations > that never do a CREATE COLLATION themselves. We want to be able to > restore a basebackup on a somewhat different OS, and have that still > work following REINDEX. At least, that seems like it should be an > important goal for us. This is an interesting point, and my proposal above would fix that. However, I think that taking a PostgreSQL data directory and moving or copying it to an *older* OS installation is always going to have a potential for problems. So I wouldn't spend a huge amount of effort just to fix this specific case. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Geoghegan
Date:
On Mon, Sep 25, 2017 at 11:40 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 9/22/17 16:46, Peter Geoghegan wrote: >> But you are *already* canonicalizing ICU collation names as BCP 47. My >> point here is: Why not finish the job off, and *also* canonicalize >> colcollate in the same way? This won't break ucol_open() if we take >> appropriate precautions when we go to use the Postgres collation/ICU >> locale. > > Reading over this code again, it is admittedly not quite clear why this > "canonicalization" is in there right now. I think it had something to > do with how we built the keyword variants at one point. It might not > make sense. I'd be glad to take that out and use the result straight > from uloc_getAvailable() for collcollate. That is, after all, the > "canonical" version that ICU chooses to report to us. But then our users categorically have to know about both formats, without any practical benefit to make up for it. You will also get people that don't realize that only one format is supported on some versions if go this way. >> One concern that makes me suggest this is: What happens when >> the user *downgrades* ICU version, from a version where colcollate is >> BCP 47 to one where it would not have been at initdb time? That will >> break the downgrade in an unpleasant way, including in installations >> that never do a CREATE COLLATION themselves. We want to be able to >> restore a basebackup on a somewhat different OS, and have that still >> work following REINDEX. At least, that seems like it should be an >> important goal for us. > > This is an interesting point, and my proposal above would fix that. I've already written a patch to standardize collcollate. If we do the way you describe above instead, then what happens when ICU finally removes the already deprecated legacy format? > However, I think that taking a PostgreSQL data directory and moving or > copying it to an *older* OS installation is always going to have a > potential for problems. So I wouldn't spend a huge amount of effort > just to fix this specific case. The downgrade thing is just the simplest, most immediate example of where failing to standardize collcollate now could cause problems. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Geoghegan
Date:
On Mon, Sep 25, 2017 at 12:14 PM, Peter Geoghegan <pg@bowt.ie> wrote: > But then our users categorically have to know about both formats, > without any practical benefit to make up for it. You will also get > people that don't realize that only one format is supported on some > versions if go this way. Oh, and if you go this way there is also a much bigger disadvantage: you also break pg_restore when you cross the ICU 54 version boundary in *either* direction (pg_collation.collname will be different, so actually equivalent collations will not be recognized as such for dump/restore purposes). This seems very likely, even, because ICU 54 isn't that old (we support versions as old as ICU 4.2), and because you only have to have used ICU initdb collation for this to happen (no CREATE COLLATION is necessary). Sure, you *may* be able to first do a manual CREATE COLLATION when that happens, and that will be picked up during the restore. But that's very user hostile. That must have been the real reason why you canonicalized pg_collation.collname (I doubt it had anything to do with how keyword variants used to be created during initdb, as you suggested). As Tom pointed out recently, we've actually always canonicalized collation name for libc. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Geoghegan
Date:
On Mon, Sep 25, 2017 at 12:52 PM, Peter Geoghegan <pg@bowt.ie> wrote: > That must have been the real reason why you canonicalized > pg_collation.collname (I doubt it had anything to do with how keyword > variants used to be created during initdb, as you suggested). As Tom > pointed out recently, we've actually always canonicalized collation > name for libc. On further examination, none of this really matters, because you simply cannot store ICU locale names like "en_US" within pg_collation; it's impossible to do that without breaking many things that have worked for a long time. initdb already canonicalizes the available libc collations to produce collations whose names have exactly the same "en_US" format. There will typically be both "en_US" and "en_US.utf8" entries within pg_collation with Glibc on Linux, for example (the former is created a convenient alias for the latter when the database encoding is UTF-8). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Geoghegan
Date:
On Mon, Sep 25, 2017 at 11:40 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > Reading over this code again, it is admittedly not quite clear why this > "canonicalization" is in there right now. I think it had something to > do with how we built the keyword variants at one point. It might not > make sense. I'd be glad to take that out and use the result straight > from uloc_getAvailable() for collcollate. That is, after all, the > "canonical" version that ICU chooses to report to us. Is anything going to happen here ahead of shipping v10? This remains an open item. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags.Should it?
From
Noah Misch
Date:
On Mon, Sep 25, 2017 at 08:26:21AM +0000, Noah Misch wrote: > On Tue, Sep 19, 2017 at 07:01:47PM -0700, Peter Geoghegan wrote: > > On Tue, Sep 19, 2017 at 5:52 PM, Peter Eisentraut > > <peter.eisentraut@2ndquadrant.com> wrote: > > > On 9/18/17 18:46, Peter Geoghegan wrote: > > >> As I pointed out a couple of times already [1], we don't currently > > >> sanitize ICU's BCP 47 language tags within CREATE COLLATION. > > > > > > There is no requirement that the locale strings for ICU need to be BCP > > > 47. ICU locale names like 'de@collation=phonebook' are also acceptable. > > > > Right. But, we only document that BCP 47 is supported by Postgres. > > Maybe we can use get_icu_language_tag()/uloc_toLanguageTag() to ensure > > that we end up with BCP 47, even when the user happens to specify the > > legacy syntax. Should we be "canonicalizing" legacy ICU locale strings > > as BCP 47, too? > > > > > The reason they are not validated is that, as you know, ICU accepts any > > > locale string as valid. You appear to have found a way to do some > > > validation, but I would like to see that code. > > > > As I mentioned, I'm simply calling > > get_icu_language_tag()/uloc_toLanguageTag() to do that sanitization. > > The code to get the extra sanitization is completely trivial. > > > > I didn't post the patch that generates the errors in my opening e-mail > > because I'm not confident it's correct just yet. And, I think that I > > see a bigger problem: we pass a string that is almost certainly a BCP > > 47 string to ucol_open() from within pg_newlocale_from_collation(). We > > do so despite the fact that ucol_open() apparently doesn't accept BCP > > 47 syntax locale strings until ICU 54 [1]. Seems entirely possible > > that this accounts for the problems you saw on ICU 4.2, back when we > > were still creating keyword variants (I guess that the keyword > > variants seem very "BCP 47-ish" to me). > > > > I really think we need to add some kind of debug mode that makes ICU > > optionally spit out a locale display name at key points. We need this > > to gain confidence that the behavior that ICU provides actually > > matches what is expected across ICU different versions for different > > locale formats. We talked about this as a user-facing feature before, > > which can wait till v11; I just want this to debug problems like this > > one. > > > > [1] https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a4721e4c0a519bb0139a874e191223590 > > [Action required within three days. This is a generic notification.] > > The above-described topic is currently a PostgreSQL 10 open item. Peter, > since you committed the patch believed to have created it, you own this open > item. If some other commit is more relevant or if this does not belong as a > v10 open item, please let us know. Otherwise, please observe the policy on > open item ownership[1] and send a status update within three calendar days of > this message. Include a date for your subsequent status update. Testers may > discover new open items at any time, and I want to plan to get them all fixed > well in advance of shipping v10. Consequently, I will appreciate your efforts > toward speedy resolution. Thanks. > > [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com This PostgreSQL 10 open item is past due for your status update. On the worst week to be violating open item policies. Kindly send a status update within 24 hours, and include a date for your subsequent status update. Refer to the policy on open item ownership: https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags.Should it?
From
Noah Misch
Date:
On Mon, Sep 25, 2017 at 09:36:44AM -0700, Peter Geoghegan wrote: > On Mon, Sep 25, 2017 at 9:06 AM, Robert Haas <robertmhaas@gmail.com> wrote: > >> The big concern I have here is that this feels a lot like something that > >> we'll regret at leisure, if it's not right in the first release. I'd > >> much rather be restrictive in v10 and then loosen the rules later, than > >> be lax in v10 and then have to argue about whether to break backwards > >> compatibility in order to gain saner behavior. > > > > I think it's inevitable that a certain number of users are going to > > have to cope with ICU version changes breaking stuff. > > Wasn't the main point of adopting ICU that that doesn't happen when it > isn't essential? There will be some risk with pg_dump across ICU > versions, due rare to political changes. But, on pg_upgrade, the old > collations will continue to work, even if they would not have appeared > at initdb time on that ICU version, because ICU has all kinds of > fallbacks. I wouldn't describe it that way. I agree that few, if any, ICU upgrades will remove country or language codes. Overall, though, almost every ICU upgrade will be difficult. Each ICU release, even a minor release like 58.2, changes the sorting rules in some tiny way. You then see "Rebuild all objects affected by this collation" messages. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes: > On Mon, Sep 25, 2017 at 09:36:44AM -0700, Peter Geoghegan wrote: >>> I think it's inevitable that a certain number of users are going to >>> have to cope with ICU version changes breaking stuff. >> Wasn't the main point of adopting ICU that that doesn't happen when it >> isn't essential? > I wouldn't describe it that way. I agree that few, if any, ICU upgrades will > remove country or language codes. Overall, though, almost every ICU upgrade > will be difficult. Each ICU release, even a minor release like 58.2, changes > the sorting rules in some tiny way. You then see "Rebuild all objects > affected by this collation" messages. Sure, but dealing with that is mechanical: reindex the necessary indexes and you're done. I think it's important to distinguish that from a case where users have to change their collation definitions. That is a qualitatively different, and much harder, upgrade problem. I'd also argue that the point of adopting ICU was exactly so we *could* distinguish those cases, and limit the scope of a normal upgrade to "reindex these identifiable indexes and you're done". In the libc world, when you upgrade libc's locale definitions, you have no idea what the consequences are. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Geoghegan
Date:
On Sat, Sep 30, 2017 at 8:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'd also argue that the point of adopting ICU was exactly so we *could* > distinguish those cases, and limit the scope of a normal upgrade to > "reindex these identifiable indexes and you're done". In the libc world, > when you upgrade libc's locale definitions, you have no idea what the > consequences are. Right. With libc, we think of collations as something that there is a small, fixed number of on a system, that we cannot safely assume anything about. But with ICU, all of the semantics of how natural languages should be sorted are exposed via various APIs, and there are literally more possible sets of collation behaviors than there are grains of sand in the Sahara (there are hundreds of distinct scripts, which we can change the overall ordering of arbitrarily, on top of all the other customizations). Clearly the libc way of looking at things doesn't really carry over. BCP 47 is supposed to be universal -- it's an IETF standard. That's where all the stability guarantees are. The officially recognized 'u' extension that ICU uses is a CLDR/Unicode thing, not an ICU thing. The same format could, in the future, be used by other collation providers, since there actually are other CLDR consumers/UCA implementations. And, ICU have said that they have deprecated the old locale format, and have standardized on BCP 47. As of ICU 54, it is recommended that ucol_open() be passed a string in BCP 47 format. I'm surprised that this issue was not resolved earlier in the week. I presumed that all of this was obvious to Peter E., but I seem to have been wrong about that. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Noah Misch
Date:
On Sat, Sep 30, 2017 at 11:25:43AM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > On Mon, Sep 25, 2017 at 09:36:44AM -0700, Peter Geoghegan wrote: > >>> I think it's inevitable that a certain number of users are going to > >>> have to cope with ICU version changes breaking stuff. > > >> Wasn't the main point of adopting ICU that that doesn't happen when it > >> isn't essential? > > > I wouldn't describe it that way. I agree that few, if any, ICU upgrades will > > remove country or language codes. Overall, though, almost every ICU upgrade > > will be difficult. Each ICU release, even a minor release like 58.2, changes > > the sorting rules in some tiny way. You then see "Rebuild all objects > > affected by this collation" messages. > > Sure, but dealing with that is mechanical: reindex the necessary indexes > and you're done. In the general case, one must revalidate CHECK constraints, re-partition tables, revalidate range values, and reindex. > In the libc world, > when you upgrade libc's locale definitions, you have no idea what the > consequences are. Yep. It's strictly better than the libc case. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes: > On Sat, Sep 30, 2017 at 11:25:43AM -0400, Tom Lane wrote: >> Sure, but dealing with that is mechanical: reindex the necessary indexes >> and you're done. > In the general case, one must revalidate CHECK constraints, re-partition > tables, revalidate range values, and reindex. True, but that is what it is: nothing we can do is going to affect the consequences of a collation behavior change, if there is one. What's more useful for our immediate purposes is to ask whether we can reliably detect a collation behavior change. False negatives are bad, but so are false positives, because those would force DBAs to jump through lots of hoops unnecessarily. So: are canonicalized locale descriptions any better or worse by that metric than non-canonicalized descriptions? In principle I think a canonicalized description might be more likely to be recognized as the "same" locale by another ICU version than one that isn't, but I don't know whether there's any meaningful difference in practice. Another point here is whether, even if a new ICU version recognizes a locale description as being "the same" interpretation that an old ICU version used, will it report the same collation version? Limited experimentation suggests that the collversions we're actually getting out of ICU depend on little other than the libicu version. "select distinct collversion from pg_collation where collversion is not null" produces this on ICU 4.2.1: 49.192.5.4149.192.0.41 and this on 52.1: 58.0.6.5058.0.0.50 and this on 57.1: 153.64.29153.64 This suggests to me that arguing about canonicalization is moot so far as avoiding reindexing goes: if you change ICU library versions, you're screwed and will have to jump through all the reindexing hoops, no matter what we do or don't do. (Maybe we are looking at the wrong information to populate collversion?) Now, it may still be worthwhile to argue about whether canonicalization will help the other component of the problem, which is whether you can dump and reload CREATE COLLATION commands into a new ICU version and expect to get more-or-less-the-same behavior as before. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Geoghegan
Date:
On Sat, Sep 30, 2017 at 12:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > This suggests to me that arguing about canonicalization is moot so > far as avoiding reindexing goes: if you change ICU library versions, > you're screwed and will have to jump through all the reindexing hoops, > no matter what we do or don't do. (Maybe we are looking at the wrong > information to populate collversion?) Just to be clear, I was never arguing that canonicalization would avoid reindexing, and I didn't think anyone else was. I believe that the version string will change when the behavior of its collator changes for any reason and in any way. This includes changes to how binary sort keys are generated. (We don't currently store binary keys on disk, so a change to that representation doesn't really necessitate a REINDEX for us. The UCA spec explicitly decouples compatibility for indexing with binary keys from changes needed due to human requirements. ICU binary sort keys are compressed, and this is sometimes improved, independently of the evolution of how natural language experts say text should be sorted.) > Now, it may still be worthwhile to argue about whether canonicalization > will help the other component of the problem, which is whether you can > dump and reload CREATE COLLATION commands into a new ICU version and > expect to get more-or-less-the-same behavior as before. Again, to be clear, that's what I was arguing all along. I think that users will get very good results with this approach. When the sorting behavior of a locale is refined by natural language experts, it's almost certainly a very small change that most users that are affected won't actually notice. For example, when en-us-x-icu is changed, that's actually due to a general change in the inherited root collator that doesn't really affect English speakers. There is no practical question about how you're supposed to sort English text. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Eisentraut
Date:
On 9/29/17 19:38, Peter Geoghegan wrote: > On Mon, Sep 25, 2017 at 11:40 AM, Peter Eisentraut > <peter.eisentraut@2ndquadrant.com> wrote: >> Reading over this code again, it is admittedly not quite clear why this >> "canonicalization" is in there right now. I think it had something to >> do with how we built the keyword variants at one point. It might not >> make sense. I'd be glad to take that out and use the result straight >> from uloc_getAvailable() for collcollate. That is, after all, the >> "canonical" version that ICU chooses to report to us. > > Is anything going to happen here ahead of shipping v10? This remains > an open item. I'm still pondering committing my documentation patch I had posted, and I've been reviewing your patches to see if there is anything else documentation-wise that could be added. As I had mentioned before, I disagree with the substance of the functionality changes you are proposing, and I think it would be way too late to make such significant changes. The general community opinion also seems to be in favor of letting things be. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Eisentraut
Date:
On 9/30/17 15:28, Tom Lane wrote: > Now, it may still be worthwhile to argue about whether canonicalization > will help the other component of the problem, which is whether you can > dump and reload CREATE COLLATION commands into a new ICU version and > expect to get more-or-less-the-same behavior as before. Arguably, you don't always want that. Maybe you selected a locale name like 'en-NEWCOUNTRY', and in old ICU versions you want that to fall back to default 'en' behavior and in newer you get the updated custom behavior. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Eisentraut
Date:
On 9/30/17 15:28, Tom Lane wrote: > This suggests to me that arguing about canonicalization is moot so > far as avoiding reindexing goes: if you change ICU library versions, > you're screwed and will have to jump through all the reindexing hoops, > no matter what we do or don't do. One reason for that is that the collation version also encodes things like if the internal method for computing sort keys changes. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Eisentraut
Date:
On 9/30/17 03:01, Noah Misch wrote: > On Mon, Sep 25, 2017 at 08:26:21AM +0000, Noah Misch wrote: >> On Tue, Sep 19, 2017 at 07:01:47PM -0700, Peter Geoghegan wrote: >>> On Tue, Sep 19, 2017 at 5:52 PM, Peter Eisentraut >>> <peter.eisentraut@2ndquadrant.com> wrote: >>>> On 9/18/17 18:46, Peter Geoghegan wrote: >>>>> As I pointed out a couple of times already [1], we don't currently >>>>> sanitize ICU's BCP 47 language tags within CREATE COLLATION. >>>> >>>> There is no requirement that the locale strings for ICU need to be BCP >>>> 47. ICU locale names like 'de@collation=phonebook' are also acceptable. >>> >>> Right. But, we only document that BCP 47 is supported by Postgres. >>> Maybe we can use get_icu_language_tag()/uloc_toLanguageTag() to ensure >>> that we end up with BCP 47, even when the user happens to specify the >>> legacy syntax. Should we be "canonicalizing" legacy ICU locale strings >>> as BCP 47, too? >>> >>>> The reason they are not validated is that, as you know, ICU accepts any >>>> locale string as valid. You appear to have found a way to do some >>>> validation, but I would like to see that code. >>> >>> As I mentioned, I'm simply calling >>> get_icu_language_tag()/uloc_toLanguageTag() to do that sanitization. >>> The code to get the extra sanitization is completely trivial. >>> >>> I didn't post the patch that generates the errors in my opening e-mail >>> because I'm not confident it's correct just yet. And, I think that I >>> see a bigger problem: we pass a string that is almost certainly a BCP >>> 47 string to ucol_open() from within pg_newlocale_from_collation(). We >>> do so despite the fact that ucol_open() apparently doesn't accept BCP >>> 47 syntax locale strings until ICU 54 [1]. Seems entirely possible >>> that this accounts for the problems you saw on ICU 4.2, back when we >>> were still creating keyword variants (I guess that the keyword >>> variants seem very "BCP 47-ish" to me). >>> >>> I really think we need to add some kind of debug mode that makes ICU >>> optionally spit out a locale display name at key points. We need this >>> to gain confidence that the behavior that ICU provides actually >>> matches what is expected across ICU different versions for different >>> locale formats. We talked about this as a user-facing feature before, >>> which can wait till v11; I just want this to debug problems like this >>> one. >>> >>> [1] https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a4721e4c0a519bb0139a874e191223590 >> >> [Action required within three days. This is a generic notification.] >> >> The above-described topic is currently a PostgreSQL 10 open item. Peter, >> since you committed the patch believed to have created it, you own this open >> item. If some other commit is more relevant or if this does not belong as a >> v10 open item, please let us know. Otherwise, please observe the policy on >> open item ownership[1] and send a status update within three calendar days of >> this message. Include a date for your subsequent status update. Testers may >> discover new open items at any time, and I want to plan to get them all fixed >> well in advance of shipping v10. Consequently, I will appreciate your efforts >> toward speedy resolution. Thanks. >> >> [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com > > This PostgreSQL 10 open item is past due for your status update. On the worst > week to be violating open item policies. Kindly send a status update within > 24 hours, and include a date for your subsequent status update. Refer to the > policy on open item ownership: > https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com I had previously replied that I think nothing should be changed. What process should be applied in that case? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Geoghegan
Date:
On Sun, Oct 1, 2017 at 6:27 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > I'm still pondering committing my documentation patch I had posted, and > I've been reviewing your patches to see if there is anything else > documentation-wise that could be added. -1 from me -- I don't think we should tell users about the deprecated locale format at all. I hope that ICU continue to support the deprecated locale string format within ucol_open() into the future, and that users will only ever use BCP 47 within CREATE COLLATION (as the 'locale' string). Perhaps it won't matter that much in the end, and will turn out to be more of a wart than something that causes pain for users. It's hard to predict. > As I had mentioned before, I disagree with the substance of the > functionality changes you are proposing, and I think it would be way too > late to make such significant changes. The general community opinion > also seems to be in favor of letting things be. It does seem too late. It's disappointing that we did not do better here. This problem was entirely avoidable. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Noah Misch
Date:
On Sun, Oct 01, 2017 at 09:56:11AM -0400, Peter Eisentraut wrote: > On 9/30/17 03:01, Noah Misch wrote: > > This PostgreSQL 10 open item is past due for your status update. On the worst > > week to be violating open item policies. Kindly send a status update within > > 24 hours, and include a date for your subsequent status update. Refer to the > > policy on open item ownership: > > https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com > > I had previously replied that I think nothing should be changed. What > process should be applied in that case? If you determine community decision-making is not against that conclusion, edit the list to move the item from "Open Issues" to "Non-bugs". -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Eisentraut
Date:
On 10/1/17 14:26, Peter Geoghegan wrote: > It does seem too late. It's disappointing that we did not do better > here. This problem was entirely avoidable. I understand where you are coming from. My experience with developing this feature has been that it is quite fragile in some ways. We have had this out there for testing for many months, and we have fixed many bugs, and follow-up bugs, up to very recently. We don't have good automatic test coverage, so we need to rely on user testing. Making significant code and design changes a week or two before the final release is just too late, even if the changes were undisputed. And this wasn't just one specific isolated change, it was a set of overlapping changes that seemingly kept changing and expanding. Analyzing and reviewing that under pressure, and then effectively owning it going forward, too, is not something I could commit to. I'm satisfied that what we are shipping is "good enough". It has been in development for over a year, it has been reviewed and tested for months, and a lot of credit for that goes to you. The "old" locale support took many years to take shape, and this one probably will, too, but at some point I feel we have to let it be for a while and take a breather and get some feedback and field experience. I'm available to discuss the changes you are envisioning, preferably one at a time, in the near future as part of the PG11 development process. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
From
Peter Geoghegan
Date:
On Mon, Oct 2, 2017 at 9:42 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > I understand where you are coming from. My experience with developing > this feature has been that it is quite fragile in some ways. We have > had this out there for testing for many months, and we have fixed many > bugs, and follow-up bugs, up to very recently. We don't have good > automatic test coverage, so we need to rely on user testing. Making > significant code and design changes a week or two before the final > release is just too late, even if the changes were undisputed. And this > wasn't just one specific isolated change, it was a set of overlapping > changes that seemingly kept changing and expanding. For the record, I don't accept that summary. We could have not bothered with the sanitization thing at all, and I wouldn't have cared very much. I even revised my proposal such that you would never have needed to do the language tag mapping at all [1] (albeit while desupporting ICU versions prior to 4.6). And, as recently as 7 days ago, you yourself said: "Reading over this code again, it is admittedly not quite clear why this "canonicalization" is in there right now. I think it had something to do with how we built the keyword variants at one point. It might not make sense. I'd be glad to take that out and use the result straight from uloc_getAvailable() for collcollate." [2] This would have been far more radical than what I proposed. > I'm satisfied that what we are shipping is "good enough". It has been > in development for over a year, it has been reviewed and tested for > months, and a lot of credit for that goes to you. It is "good enough", I suppose. I am disappointed by this particular outcome, but that's mostly because it could have been avoided. I'm still very happy that you took on this project, and I don't want that to be overshadowed by this particular disagreement. > I'm available to discuss the changes you are envisioning, preferably one > at a time, in the near future as part of the PG11 development process. I don't really think that there is anything more to be done about the issues raised on this thread, with the possible exception of the DEBUG1 display name string thing. The sanitization just isn't valuable enough to add after the fact. Apart from the risks of adding it after the fact, another reason not to bother with it is that it's *incredibly* forgiving. So forgiving that you could reasonably argue that we might as well not have it at all. I may well do more on ICU with you in the future, but not in the area of how things are canonicalized or sanitized. It's too late for that now. [1] https://www.postgresql.org/message-id/CAH2-WzmVtRyNg2gT=u=ktEC-jM3aKq4bYzJ0u2=OsxE+O3KQ4g@mail.gmail.com [2] https://www.postgresql.org/message-id/f6c0fca7-e277-3f46-c0c1-adc001bffdd7@2ndquadrant.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers