Thread: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

[HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

From

Peter Geoghegan

Date:

19 September 2017, 01:46:39

As I pointed out a couple of times already [1], we don't currently
sanitize ICU's BCP 47 language tags within CREATE COLLATION. CREATE
COLLATION will accept literally any string as a language tag as things
stand, even when the string is unambiguously bogus. While I accept
that there are limits on how far you can take sanitizing the BCP 47
tag format, due to its extensibility and "best effort" emphasis on
forward and backward compatibility, we can and should do more here,
IMHO. We should at least do the bare minimum, which has no possible
downside, and some notable upsides.

If I hack the CREATE COLLATION code to put any language tag string
provided by the user through the same sanitization process that initdb
already puts ICU language tags through, then we do much better. CREATE
COLLATION rejects syntax errors, which seems desirable:

postgres=# CREATE COLLATION test1 (provider = icu, locale = 'en-x-icu');
CREATE COLLATION
postgres=# CREATE COLLATION test2 (provider = icu, locale = 'foo bar baz');
ERROR:  XX000: could not convert locale name "foo bar baz" to language
tag: U_ILLEGAL_ARGUMENT_ERROR
LOCATION:  get_icu_language_tag, collationcmds.c:454
postgres=# CREATE COLLATION test3 (provider = icu, locale = 'en-gb-icu');
ERROR:  XX000: could not convert locale name "en-gb-icu" to language
tag: U_ILLEGAL_ARGUMENT_ERROR
LOCATION:  get_icu_language_tag, collationcmds.c:454
postgres=# CREATE COLLATION test4 (provider = icu, locale = 'not-a-country');
CREATE COLLATION

(To be more specific, I'm calling
get_icu_language_tag()/uloc_toLanguageTag() [2] as an extra step for
CREATE COLLATION here.)

It's not like the current behavior is a disaster, or that the
alternative behavior that I propose is perfect. The collation behavior
you end up with today, having specified a language tag with a syntax
error is the totally generic base Ducet collation behavior. Using 'foo
bar baz' is effectively the same as using the preinstalled 'und-x-icu'
collation, which I'm pretty sure is the same as using any current
English locale anyway. That said, falling back on the most useful
collation behavior based on inferences about the language tag is
supposed to be about rolling with the complexities of
internationalization, like political changes that are not yet
accounted for by the CLDR/ICU version, and so on. It's no
justification for not letting the user know when they've fat fingered
a language tag, which they could easily miss. These are *syntax*
errors.

At one point a couple of months back, it was understood that
get_icu_language_tag() might not always work with (assumed) valid
locale names -- that is at least the impression that the commit
message of eccead9 left me with. But, that was only with ICU 4.2, and
in any case we've since stopped creating keyword variants at initdb
time for other reasons (see 2bfd1b1 for details of those other
reasons). I tend to think that we should not install any language tag
that uloc_toLanguageTag() does not accept as valid on general
principle (so not just at initdb time, when it's actually least
needed).

Thoughts? I can write a patch for this, if that helps. It should be
straightforward.

[1] postgr.es/m/CAH2-Wzm22vtxvD-e1oz90DE8Z_M61_8amHsDOZf1PWRKfRmj1g@mail.gmail.com
[2] https://ssl.icu-project.org/apiref/icu4c/uloc_8h.html#a1d50c91925ca3853fce6f28cf7390c3c
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Andreas Karlsson

Date:

20 September 2017, 00:19:54

On 09/19/2017 12:46 AM, Peter Geoghegan wrote:> At one point a couple of 
months back, it was understood that
> get_icu_language_tag() might not always work with (assumed) valid
> locale names -- that is at least the impression that the commit
> message of eccead9 left me with. But, that was only with ICU 4.2, and
> in any case we've since stopped creating keyword variants at initdb
> time for other reasons (see 2bfd1b1 for details of those other
> reasons). I tend to think that we should not install any language tag
> that uloc_toLanguageTag() does not accept as valid on general
> principle (so not just at initdb time, when it's actually least
> needed).
> 
> Thoughts? I can write a patch for this, if that helps. It should be
> straightforward.

Hm, I like the idea but I see some issues.

Enforcing the BCP47 seems like a good thing to me. I do not see any 
reason to allow input with syntax errors. The issue though is that we do 
not want to break people's databases when they upgrade to PostgreSQL 11. 
What if they have specified the locale in the old non-ICU format or they 
have a bogus value and we then error out on pg_upgrade or pg_restore?

Andreas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

From

Tom Lane

Date:

20 September 2017, 00:22:58

Andreas Karlsson <andreas@proxel.se> writes:
> Hm, I like the idea but I see some issues.

> Enforcing the BCP47 seems like a good thing to me. I do not see any 
> reason to allow input with syntax errors. The issue though is that we do 
> not want to break people's databases when they upgrade to PostgreSQL 11. 
> What if they have specified the locale in the old non-ICU format or they 
> have a bogus value and we then error out on pg_upgrade or pg_restore?

Well, if PG10 shipped with that restriction in place then it wouldn't
be an issue ;-)
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Geoghegan

Date:

20 September 2017, 00:32:32

On Tue, Sep 19, 2017 at 2:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andreas Karlsson <andreas@proxel.se> writes:
>> Hm, I like the idea but I see some issues.
>
>> Enforcing the BCP47 seems like a good thing to me. I do not see any
>> reason to allow input with syntax errors. The issue though is that we do
>> not want to break people's databases when they upgrade to PostgreSQL 11.
>> What if they have specified the locale in the old non-ICU format or they
>> have a bogus value and we then error out on pg_upgrade or pg_restore?
>
> Well, if PG10 shipped with that restriction in place then it wouldn't
> be an issue ;-)

I was proposing that this be treated as an open item for v10; sorry if
I was unclear on that. Much like the "ICU locales vs. ICU collations
within pg_collation" issue, this seems like the kind of thing that we
ought to go out of our way to get right in the *first* version.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Andreas Karlsson

Date:

20 September 2017, 01:23:10

On 09/19/2017 11:32 PM, Peter Geoghegan wrote:
> On Tue, Sep 19, 2017 at 2:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Well, if PG10 shipped with that restriction in place then it wouldn't
>> be an issue ;-)
> 
> I was proposing that this be treated as an open item for v10; sorry if
> I was unclear on that. Much like the "ICU locales vs. ICU collations
> within pg_collation" issue, this seems like the kind of thing that we
> ought to go out of our way to get right in the *first* version.

If people think it is possible to get this done in time for PostgreSQL 
10 and it does not break anything on older version of ICU (or the 
migration from older versions) I am all for it.

Andreas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Geoghegan

Date:

20 September 2017, 01:47:04

On Tue, Sep 19, 2017 at 3:23 PM, Andreas Karlsson <andreas@proxel.se> wrote:
> If people think it is possible to get this done in time for PostgreSQL 10
> and it does not break anything on older version of ICU (or the migration
> from older versions) I am all for it.

The only behavioral difference would occur when CREATE COLLATION is
run (no changes to the initdb behavior, where the real risk exists).

What this boils down to is that we have every reason to think that the
right thing is to reject something that ICU's uloc_toLanguageTag()
itself rejects as invalid (through the get_icu_language_tag()
function). It looked like we were equivocating on following this at
one point, prior to 2bfd1b1, in order to suit ICU 4.2 (again, see
commit eccead9). I tend to think that the way we used to concatenate
variant keywords against locale names during initdb was simply wrong
in ICU 4.2, for some unknown reason. I think that the behavior that I
propose might prevent things from silently failing on ICU 4.2 that
were previously *believed* to work there, but in fact these things
(variant keywords) never really worked.

There might be an argument to be made for passing strict = FALSE to
uloc_toLanguageTag() instead, so that we replace the language tag with
one that is known to have valid syntax, and store that in pg_collation
instead (while possibly raising a NOTICE). I guess that that would
actually be the minimal fix here. I still favor a hard reject of
invalid BCP 47 syntax, though. PostgreSQL's CREATE COLLATION is not
one of the places where this kind of leniency makes sense.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Eisentraut

Date:

20 September 2017, 03:52:22

On 9/18/17 18:46, Peter Geoghegan wrote:
> As I pointed out a couple of times already [1], we don't currently
> sanitize ICU's BCP 47 language tags within CREATE COLLATION.

There is no requirement that the locale strings for ICU need to be BCP
47.  ICU locale names like 'de@collation=phonebook' are also acceptable.

The reason they are not validated is that, as you know, ICU accepts any
locale string as valid.  You appear to have found a way to do some
validation, but I would like to see that code.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Geoghegan

Date:

20 September 2017, 05:01:47

On Tue, Sep 19, 2017 at 5:52 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 9/18/17 18:46, Peter Geoghegan wrote:
>> As I pointed out a couple of times already [1], we don't currently
>> sanitize ICU's BCP 47 language tags within CREATE COLLATION.
>
> There is no requirement that the locale strings for ICU need to be BCP
> 47.  ICU locale names like 'de@collation=phonebook' are also acceptable.

Right. But, we only document that BCP 47 is supported by Postgres.
Maybe we can use get_icu_language_tag()/uloc_toLanguageTag() to ensure
that we end up with BCP 47, even when the user happens to specify the
legacy syntax. Should we be "canonicalizing" legacy ICU locale strings
as BCP 47, too?

> The reason they are not validated is that, as you know, ICU accepts any
> locale string as valid.  You appear to have found a way to do some
> validation, but I would like to see that code.

As I mentioned, I'm simply calling
get_icu_language_tag()/uloc_toLanguageTag() to do that sanitization.
The code to get the extra sanitization is completely trivial.

I didn't post the patch that generates the errors in my opening e-mail
because I'm not confident it's correct just yet. And, I think that I
see a bigger problem: we pass a string that is almost certainly a BCP
47 string to ucol_open() from within pg_newlocale_from_collation(). We
do so despite the fact that ucol_open() apparently doesn't accept BCP
47 syntax locale strings until ICU 54 [1]. Seems entirely possible
that this accounts for the problems you saw on ICU 4.2, back when we
were still creating keyword variants (I guess that the keyword
variants seem very "BCP 47-ish" to me).

I really think we need to add some kind of debug mode that makes ICU
optionally spit out a locale display name at key points. We need this
to gain confidence that the behavior that ICU provides actually
matches what is expected across ICU different versions for different
locale formats. We talked about this as a user-facing feature before,
which can wait till v11; I just want this to debug problems like this
one.

[1] https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a4721e4c0a519bb0139a874e191223590
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Geoghegan

Date:

20 September 2017, 21:23:58

On Tue, Sep 19, 2017 at 7:01 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I didn't post the patch that generates the errors in my opening e-mail
> because I'm not confident it's correct just yet. And, I think that I
> see a bigger problem: we pass a string that is almost certainly a BCP
> 47 string to ucol_open() from within pg_newlocale_from_collation(). We
> do so despite the fact that ucol_open() apparently doesn't accept BCP
> 47 syntax locale strings until ICU 54 [1]. Seems entirely possible
> that this accounts for the problems you saw on ICU 4.2, back when we
> were still creating keyword variants (I guess that the keyword
> variants seem very "BCP 47-ish" to me).

ISTM that the proper fix here is to use uloc_forLanguageTag() [1] (not
to be confused with uloc_toLanguageTag()) to get a valid locale
identifier on versions of ICU where BCP 47 format tags are not
directly accepted as locale identifiers (versions prior to ICU 54).
This would happen as an extra step within
pg_newlocale_from_collation(), since BCP 47 format would be what is
stored in pg_collation.

Since uloc_forLanguageTag() become stable in ICU 4.2, the earliest
version that we support, I believe that that would leave us in good
shape.

[1] https://ssl.icu-project.org/apiref/icu4c/uloc_8h.html#aa45d6457f72867880f079e27a63c6fcb
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Eisentraut

Date:

21 September 2017, 02:04:52

On 9/19/17 22:01, Peter Geoghegan wrote:
> On Tue, Sep 19, 2017 at 5:52 PM, Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com> wrote:
>> On 9/18/17 18:46, Peter Geoghegan wrote:
>>> As I pointed out a couple of times already [1], we don't currently
>>> sanitize ICU's BCP 47 language tags within CREATE COLLATION.
>>
>> There is no requirement that the locale strings for ICU need to be BCP
>> 47.  ICU locale names like 'de@collation=phonebook' are also acceptable.
> 
> Right. But, we only document that BCP 47 is supported by Postgres.

The documentation is admittedly not very concrete about what ICU locale
names it accepts beyond talking about a "named collator provided by the
ICU library".  The examples we provide use the BCP 47 style, but that's
just because we liked them that way.

ICU <54 doesn't even support the BCP 47 style, so we need to keep
supporting the old/other style anyway.

> And, I think that I
> see a bigger problem: we pass a string that is almost certainly a BCP
> 47 string to ucol_open() from within pg_newlocale_from_collation(). We
> do so despite the fact that ucol_open() apparently doesn't accept BCP
> 47 syntax locale strings until ICU 54 [1].

pg_import_system_collations() takes care to use the non-BCP-47 style for
such versions, so I think this is working correctly.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Geoghegan

Date:

21 September 2017, 02:08:27

On Wed, Sep 20, 2017 at 4:04 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> ICU <54 doesn't even support the BCP 47 style, so we need to keep
> supporting the old/other style anyway.

Yes it does -- just not as a native locale name. ucol_open()
apparently became more liberal in what it would accept in ICU 54.

>> And, I think that I
>> see a bigger problem: we pass a string that is almost certainly a BCP
>> 47 string to ucol_open() from within pg_newlocale_from_collation(). We
>> do so despite the fact that ucol_open() apparently doesn't accept BCP
>> 47 syntax locale strings until ICU 54 [1].
>
> pg_import_system_collations() takes care to use the non-BCP-47 style for
> such versions, so I think this is working correctly.

But CREATE COLLATION doesn't use pg_import_system_collations().

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Geoghegan

Date:

21 September 2017, 02:40:45

On Wed, Sep 20, 2017 at 4:08 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> pg_import_system_collations() takes care to use the non-BCP-47 style for
>> such versions, so I think this is working correctly.
>
> But CREATE COLLATION doesn't use pg_import_system_collations().

And perhaps more to the point: it highly confusing that we use one or
the other of those 2 things ("langtag"/BCP 47 tag or "name"/legacy
locale name) as "colcollate", depending on ICU version, thereby
*behaving* as if ICU < 54 really didn't know anything about BCP 47
tags. Because, obviously earlier ICU versions know plenty about BCP
47, since 9 lines further down we use "langtag"/BCP 47 tag as collname
for CollationCreate() -- regardless of ICU version.

How can you say "ICU <54 doesn't even support the BCP 47 style", given
all that? Those versions will still have locales named "*-x-icu" when
users do "\dOS". Users will be highly confused when they quite
reasonably try to generalize from the example in the docs and what
"\dOS" shows, and get results that are wrong, often only in a very
subtle way.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Andreas Karlsson

Date:

21 September 2017, 12:49:36

On 09/21/2017 01:40 AM, Peter Geoghegan wrote:
> On Wed, Sep 20, 2017 at 4:08 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>>> pg_import_system_collations() takes care to use the non-BCP-47 style for
>>> such versions, so I think this is working correctly.
>>
>> But CREATE COLLATION doesn't use pg_import_system_collations().
> 
> And perhaps more to the point: it highly confusing that we use one or
> the other of those 2 things ("langtag"/BCP 47 tag or "name"/legacy
> locale name) as "colcollate", depending on ICU version, thereby
> *behaving* as if ICU < 54 really didn't know anything about BCP 47
> tags. Because, obviously earlier ICU versions know plenty about BCP
> 47, since 9 lines further down we use "langtag"/BCP 47 tag as collname
> for CollationCreate() -- regardless of ICU version.
> 
> How can you say "ICU <54 doesn't even support the BCP 47 style", given
> all that? Those versions will still have locales named "*-x-icu" when
> users do "\dOS". Users will be highly confused when they quite
> reasonably try to generalize from the example in the docs and what
> "\dOS" shows, and get results that are wrong, often only in a very
> subtle way.

If we are fine with supporting only ICU 4.2 and later (which I think we 
are given that ICU 4.2 was released in 2009) then using 
uloc_forLanguageTag()[1] to validate and canonize seems like the right 
solution. I had missed that this function even existed when I last read 
the documentation. Does it return a BCP 47 tag in modern versions of ICU?

I strongly prefer if there, as much as possible, is only one format for 
inputting ICU locales.

1. 
http://www.icu-project.org/apiref/icu4c/uloc_8h.html#aa45d6457f72867880f079e27a63c6fcb

Andreas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Geoghegan

Date:

21 September 2017, 23:55:53

On Thu, Sep 21, 2017 at 2:49 AM, Andreas Karlsson <andreas@proxel.se> wrote:
> If we are fine with supporting only ICU 4.2 and later (which I think we are
> given that ICU 4.2 was released in 2009) then using uloc_forLanguageTag()[1]
> to validate and canonize seems like the right solution. I had missed that
> this function even existed when I last read the documentation. Does it
> return a BCP 47 tag in modern versions of ICU?

The decision to support ICU >= 4.2 was already made (see commit
eccead9). I have no reason to think that that's a problem for us.

As I've said, we currently use uloc_toLanguageTag() on all supported
ICU versions, to at least create a collation name at initdb time, but
also to get our new collation's colcollate when ICU >= 54. If we now
put a uloc_forLanguageTag() in place before each ucol_open() call,
then we can safely store a BCP 47 format tag as colcollate on *every*
ICU version. We can reconstruct a traditional format locale string
*on-the-fly* within pg_newlocale_from_collation() and
get_collation_actual_version(), which would be what we'd pass to
ucol_open() (we'd never pass "raw colcollate").

To keep things simple, we could always convert colcollate to the
traditional format using uloc_forLanguageTag(), just in case we're on
a version of ICU where ucol_open() doesn't like locales that are
spelled using the BCP 47 format. It doesn't seem worth it to take
advantage of the leniency on ICU >= 54, since that would be a special
case (though we could if we wanted to).

> I strongly prefer if there, as much as possible, is only one format for
> inputting ICU locales.

I agree, but the bigger issue is that we're *half way* between
supporting only one format, and supporting two formats. AFAICT, there
is no reason that we can't simply support one format on all ICU
versions, and keep what ends up within pg_collation at initdb time
essentially the same across ICU versions (except for those that are
due to cultural/political developments).

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Geoghegan

Date:

22 September 2017, 03:53:05

On Tue, Sep 19, 2017 at 7:01 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I really think we need to add some kind of debug mode that makes ICU
> optionally spit out a locale display name at key points. We need this
> to gain confidence that the behavior that ICU provides actually
> matches what is expected across ICU different versions for different
> locale formats. We talked about this as a user-facing feature before,
> which can wait till v11; I just want this to debug problems like this
> one.

I patched CREATE COLLATION to show ICU display name, which produces
output like this:

postgres=# create collation basque (provider=icu,
locale='eu-u-kf-upper-kr-latn-digit-em-emoji-kn-true-co-eor');
NOTICE:  00000: ICU collation description is "Basque
(colcasefirst=upper, Sort Order=European Ordering Rules,
colnumeric=yes, colreorder=latn-digit, em=emoji)"
CREATE COLLATION

I used an ISO 639-1 language code (2 letter language code) above,
which, as we can see, is recognized as Basque. ICU is also fine with
the 3 letter 639-2 code "eus-", recognizing that as Basque, too. If I
use an ISO 639-2 code for Basque that ICU/CLDR doesn't like, "baq-*",
I can see that my expectations have only partially been met, since the
notice doesn't say anything about the language Basque:

postgres=# create collation actually_not_basque (provider=icu,
locale='baq-u-kf-upper-kr-latn-digit-em-emoji-kn-true-co-eor');
NOTICE:  00000: ICU collation description is "baq (colcasefirst=upper,
Sort Order=European Ordering Rules, colnumeric=yes,
colreorder=latn-digit, em=emoji)"
CREATE COLLATION

Functionality like this is starting to look essential to me, rather
than just a nice to have. Having this NOTICE would have made me
realize our problems with ICU versions < 54 much earlier, if nothing
else. If the purpose of NOTICE messages is to "Provide[s] information
that might be helpful to users", then I'd say that this definitely
qualifies. And, the extra code is trivial (we already get display name
in the context of initdb). I strongly recommend that we slip this into
v10, as part of fixing the problem with language tags that earlier ICU
versions have.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Andreas Karlsson

Date:

22 September 2017, 11:50:01

On 09/21/2017 10:55 PM, Peter Geoghegan wrote:
> I agree, but the bigger issue is that we're *half way* between
> supporting only one format, and supporting two formats. AFAICT, there
> is no reason that we can't simply support one format on all ICU
> versions, and keep what ends up within pg_collation at initdb time
> essentially the same across ICU versions (except for those that are
> due to cultural/political developments).

I think we are in agreement then, but I do not have the time to get this 
done before the release of 10 so would be happy if you implemented it. 
Peter E, what do you say in this?

Andreas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Geoghegan

Date:

22 September 2017, 18:43:24

On Fri, Sep 22, 2017 at 1:50 AM, Andreas Karlsson <andreas@proxel.se> wrote:
> On 09/21/2017 10:55 PM, Peter Geoghegan wrote:
>>
>> I agree, but the bigger issue is that we're *half way* between
>> supporting only one format, and supporting two formats. AFAICT, there
>> is no reason that we can't simply support one format on all ICU
>> versions, and keep what ends up within pg_collation at initdb time
>> essentially the same across ICU versions (except for those that are
>> due to cultural/political developments).
>
>
> I think we are in agreement then, but I do not have the time to get this
> done before the release of 10 so would be happy if you implemented it. Peter
> E, what do you say in this?

I can write a patch for this, but will not without some kind of buy-in
from Peter E.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Eisentraut

Date:

22 September 2017, 21:34:45

On 9/21/17 16:55, Peter Geoghegan wrote:
>> I strongly prefer if there, as much as possible, is only one format for
>> inputting ICU locales.
> I agree, but the bigger issue is that we're *half way* between
> supporting only one format, and supporting two formats. AFAICT, there
> is no reason that we can't simply support one format on all ICU
> versions, and keep what ends up within pg_collation at initdb time
> essentially the same across ICU versions (except for those that are
> due to cultural/political developments).

After reviewing this thread and testing around a bit, I think the code
is mostly fine as it is, but the documentation is lacking.  Hence,
attached is a patch to expand the documentation quite a bit, especially
to document in more detail what ICU locale strings are accepted.

The documentation has always stated, albeit perhaps in obscure ways,
that we accept for locales whatever ICU accepts.  I don't think we
should restrict or override that in any way.  That would cause existing
documentation and lore on ICU to be invalid for PostgreSQL.

I specifically disagree that we should, as appears to be suggested here,
restrict the input locale strings to BCP 47 and reject or transform the
traditional ICU-specific locale syntax.  Again, that would cause
existing ICU documentation material to become less applicable to
PostgreSQL.  And basically, there is no reason for it, because I am not
aware that ICU plans to get rid of that syntax.  Moreover, we need to
support that syntax for older ICU versions anyway.  A patch has been
posted that, as I understand it, would allow BCP 47 syntax to be used
with older versions as well.  That's a nice idea, but it's a new feature
and not appropriate for PG10 at this time.

(Note that we also don't canonicalize libc locale names.)

The attached documentation patch documents both locale naming forms in
parallel.

The other attached patch contains a test suite that verifies that the
examples in the documentation actually work.  It's not meant for
committing into the mainline, but it was useful for me.

During testing with various versions I have also discovered the
following things:

- The current code appears to be of the opinion that BCP 47 locale names
are accepted as of ICU 54.  That appears to be incorrect; I find that
they work fine in ICU 52, but they don't work in ICU 4.2.  I don't know
where the cutoff is.  That might be worth changing in the code if we can
obtain more information.

- What might have led to the above mistake is the following in the
ucol_open() documentation: q{Starting with ICU 54, collation attributes
can be specified via locale keywords as well, in the old locale
extension syntax ("el@colCaseFirst=upper") or in language tag syntax
("el-u-kf-upper").}  That is correct.  If you use that syntax in earlier
versions, the case-first specification in this example is ignored.  You
need to use ucol_setAttribute() then.  (Note that the phonebook stuff
still works, because that is not a "collation attribute".)

(One of my plans for PG11 had been to allow specifying such collation
attributes via additional CREATE COLLATION clauses and pg_collation
columns, but that might be obsolete now.)

- Moreover, it is not the case that ICU accepts just any sort of
nonsense as a locale name.  For example, "el@colCaseFirst=whatever" is
rejected with U_ILLEGAL_ARGUMENT_ERROR.  Now, it might in other cases be
more liberal than we might be hoping for.  But I think they have reasons
for what they do.

One conclusion here is that there are multiple dimensions to what sort
of thing is accepted as a locale in different versions of ICU and what
the canonical format is.  If we insist that everything has to be written
in the form that is preferred today, then we'll have awkward problems if
a future version of ICU establishes even more variants that will then be
preferred.

I think there is room for incremental refinement here.  Features like
optionally checking or printing the canonical or preferred locale format
or making the locale description available via a function or system view
might all be valuable additions to a future PostgreSQL release.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Geoghegan

Date:

22 September 2017, 23:46:34

On Fri, Sep 22, 2017 at 11:34 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> After reviewing this thread and testing around a bit, I think the code
> is mostly fine as it is, but the documentation is lacking.  Hence,
> attached is a patch to expand the documentation quite a bit, especially
> to document in more detail what ICU locale strings are accepted.
>
> The documentation has always stated, albeit perhaps in obscure ways,
> that we accept for locales whatever ICU accepts.  I don't think we
> should restrict or override that in any way.  That would cause existing
> documentation and lore on ICU to be invalid for PostgreSQL.

I think that this thread is mostly about three fairly different
things. I didn't plan it that way, but then I only realized the second
two while investigating the first. Those are:

1. The question of whether and to what extent we should sanitize the
colcollate string provided by the user within CREATE COLLATION for the
ICU collation provider.

2. The question of what ends up in pg_collation at initdb time. In
particular, the format of colcollate.

3. The question of whether or not we should ever accept a locale in
the traditional format, rather than insisting on BCP 47 in every
context. (This may have become conflated with issue 1.)

> I specifically disagree that we should, as appears to be suggested here,
> restrict the input locale strings to BCP 47 and reject or transform the
> traditional ICU-specific locale syntax.  Again, that would cause
> existing ICU documentation material to become less applicable to
> PostgreSQL.  And basically, there is no reason for it, because I am not
> aware that ICU plans to get rid of that syntax.

I disagree, because ICU/CLDR seems to want to standardize on BCP 47
(with custom extensions), but I acknowledge that you have a point
here. This isn't what I think is important, among all the points
raised on this thread. I can let this go.

> Moreover, we need to
> support that syntax for older ICU versions anyway.

FWIW, I don't think that we *need* to support it for older ICU
versions, except as an implementation detail that gets us to and from
BCP 47 as needed.

> A patch has been
> posted that, as I understand it, would allow BCP 47 syntax to be used
> with older versions as well.  That's a nice idea, but it's a new feature, which may have been my fault, which may
havebeen my fault

> and not appropriate for PG10 at this time.
>
> (Note that we also don't canonicalize libc locale names.)

But you are *already* canonicalizing ICU collation names as BCP 47. My
point here is: Why not finish the job off, and *also* canonicalize
colcollate in the same way? This won't break ucol_open() if we take
appropriate precautions when we go to use the Postgres collation/ICU
locale. One concern that makes me suggest this is: What happens when
the user *downgrades* ICU version, from a version where colcollate is
BCP 47 to one where it would not have been at initdb time? That will
break the downgrade in an unpleasant way, including in installations
that never do a CREATE COLLATION themselves. We want to be able to
restore a basebackup on a somewhat different OS, and have that still
work following REINDEX. At least, that seems like it should be an
important goal for us.

I want Postgres to insist on always using BCP 47 in CREATE COLLATION
for a few reasons. One that is relevant to this colcollate question is
that by insisting on BCP 47 within CREATE COLLATION, there is no
question of CREATE COLLATION having to consider the legacy locale
format on ICU versions that don't handle both at the same time too
well (this at least includes ICU 42). We'd always only accept BCP 47
from users, we'd always store BCP 47 as colcollate (and collation
name), and we'd always use the traditional locale string format as an
argument to ucol_open(). Consistency of interface and implementation,
across all ICU versions.

I might actually be convinced by what you say here if we weren't
already canonicalizing collation name as BCP 47 (though I also
understand why you did that).

> During testing with various versions I have also discovered the
> following things:
>
> - The current code appears to be of the opinion that BCP 47 locale names
> are accepted as of ICU 54.  That appears to be incorrect; I find that
> they work fine in ICU 52, but they don't work in ICU 4.2.  I don't know
> where the cutoff is.  That might be worth changing in the code if we can
> obtain more information.

I fear that BCP 47 is only quasi supported (without the proper
conversion by uloc_forLanguageTag()) prior to ICU 54 (the version
where it is actually documented as supported). We've already seen
plenty of cases where ucol_open() locale string interpretation
soldiers on, when it arguably shouldn't, so I hope that that isn't
what you actually see on ICU 52. I strongly suggest looking at a
variety of display names at CREATE COLLATION time (I can provide a
rough patch that shows display name [1]), to make sure that it matches
expectations on all ICU versions. Your testing patch does not satisfy
me that versions prior to ICU 54 have a ucol_open() that accepts BCP
47 without *any* problem (it could be a subtle issue).

Besides, if it isn't going to work on ICU 4.2 anyway, I think that the
leniency of ICU 52 isn't actually interesting. We still have an
awkward special case to deal with.

> One conclusion here is that there are multiple dimensions to what sort
> of thing is accepted as a locale in different versions of ICU and what
> the canonical format is.  If we insist that everything has to be written
> in the form that is preferred today, then we'll have awkward problems if
> a future version of ICU establishes even more variants that will then be
> preferred.

I'd feel a lot better about that if there was a NOTICE showing the ICU
display name for the locale shown when the user does a CREATE
COLLATION. I think that the risks from not doing that outweigh the
risks of doing it, even at this late date. I could live without any
sanitization if we did this much. Without this, the user is flying
blind when using CREATE COLLATION. Why should the DBA be able to
verify that the sorting they get is culturally appropriate? That's
just not practical.

> I think there is room for incremental refinement here.  Features like
> optionally checking or printing the canonical or preferred locale format
> or making the locale description available via a function or system view
> might all be valuable additions to a future PostgreSQL release.

I don't think that there is room for incremental refinement when it
comes to what ends up in pg_collation at initdb time. I don't like to
create commotion at this late date, but there are some things that
we'll only really get one chance at here. Admittedly, that's not true
of everything I raise here; it's hard to strictly limit discussion to
issues that have that quality.

[1] postgr.es/m/CAH2-WznOpmJ+3xh6bvea_YUyd4ZdGiwG9ycE31Q09oU3XXw5vA@mail.gmail.com

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Robert Haas

Date:

23 September 2017, 03:58:16

On Fri, Sep 22, 2017 at 4:46 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> But you are *already* canonicalizing ICU collation names as BCP 47. My
> point here is: Why not finish the job off, and *also* canonicalize
> colcollate in the same way?

Peter, with respect, it's time to let this argument go.  We're
scheduled to wrap a GA release in just over 72 hours.  It is far too
late to change behavior like this.  There is no time for other people
who may be interested in this issue to form a well-considered opinion
on the topic and carefully review a proposed patch.  There is also no
time for users to notice it in the next beta and complain before we go
final.  This ship has sailed.

On the substantive issue, I am inclined (admittedly without deep
study) to agree with Peter Eisentraut.  We have never canonicalized
collations before and therefore it is not essential that we do that
now.  That would be a new feature, and I don't think I'd be prepared
to endorse adding it three days after feature freeze let alone three
days before the GA wrap.  I do agree that the lack of canonicalization
is utterly terrible.  The APIs that Unix-like operating systems
provide for collations are poorly suited to our purposes and
hopelessly squishy about semantics, and it's not clear how much better
ICU will be.  But that's a problem that we should address, if at all,
at a deliberate pace and with adequate time for reflection, research,
and comment, not precipitously and under extreme time pressure.

I simply do not buy the theory that this cannot be changed later.
It's been the case for as long as we've had pg_collate that a new
system could have different collations than the old one, resulting in
a dump/restore failure.  I expect somebody's had that problem at some
point, but I don't think it's become a major pain point because most
people don't use exotic collations, and if they do they probably
understand that they need those exotic collations to be on the new
system too.  So, if we decide to change this later, we'll want to find
ways to make the upgrade as pain-free as possible and document
whatever the situation may be, but we've made many
backward-incompatible changes in the past and this one would hardly be
the worst.

I also believe that Peter Eisentraut is entirely correct to be
concerned about whether BCP 47 (or anything else) can really be
regarded as a stable canonical form for ICU purposes.  His email
indicates that the acceptable and canonical forms have changed
multiple times in the course of releases new enough for us to care
about them.  Assuming that statement is correct, it would be extremely
short-sighted of us to bank on them not changing any more.

But even if all of the above argumentation is utterly and completely
wrong, dredged up from the universe's deepest and most profound
reserves of stupidity and destined for future entry into Webster's as
the canonical example of cluelessness, we still shouldn't change it
the weekend before the GA wraps.  I'm afraid that this new RMT process
has lulled us into believing that the release will happen on time no
matter how much stuff we whack around at the last minute, which is a
very dangerous idea for a group of software engineers to have.
Before, we thought we had infinite time to fix our bugs; now, we think
we have infinite latitude to classify anything we don't like as a bug.
Neither of those ideas is good software engineering.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Geoghegan

Date:

23 September 2017, 04:53:02

On Fri, Sep 22, 2017 at 5:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Sep 22, 2017 at 4:46 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> But you are *already* canonicalizing ICU collation names as BCP 47. My
>> point here is: Why not finish the job off, and *also* canonicalize
>> colcollate in the same way?
>
> Peter, with respect, it's time to let this argument go.  We're
> scheduled to wrap a GA release in just over 72 hours.  It is far too
> late to change behavior like this.

I didn't say that it wasn't. That's above my paygrade.

> On the substantive issue, I am inclined (admittedly without deep
> study) to agree with Peter Eisentraut.  We have never canonicalized
> collations before and therefore it is not essential that we do that
> now.

As I've said, we're *already* canonicalizing them for ICU. Just not
consistently (across ICU versions, and arguably even within ICU
versions). That's the problem -- we're half way between both
positions.

The problem is most emphatically *not* that we've failed to
canonicalize them in the way that I happen to favor.

> That would be a new feature, and I don't think I'd be prepared
> to endorse adding it three days after feature freeze let alone three
> days before the GA wrap.  I do agree that the lack of canonicalization
> is utterly terrible.  The APIs that Unix-like operating systems
> provide for collations are poorly suited to our purposes and
> hopelessly squishy about semantics, and it's not clear how much better
> ICU will be.

In one important sense, this is a regression against libc, because you
never had something like en_US.UTF-8 break on downgrading glibc
version (like, when you restore a basebackup on a different OS with
the same arch). Sure, you probably had to REINDEX text indexes, to be
on the safe side, but once you did that there was no question about
the older glibc having never heard of "en_US.UTF-8" as a
LC_COLLATE/collcollate.

I regret that I didn't catch it sooner. It now seems very obvious, and
totally preventable given enough time.

> I simply do not buy the theory that this cannot be changed later.in

It can be changed later, of course -- at greater, though indeterminate cost.

> It's been the case for as long as we've had pg_collate that a new
> system could have different collations than the old one, resulting in
> a dump/restore failure.  I expect somebody's had that problem at some
> point, but I don't think it's become a major pain point because most
> people don't use exotic collations, and if they do they probably
> understand that they need those exotic collations to be on the new
> system too.

Like I said, you don't need exotic collations to have the downgrade
problem, unless *any* initdb ICU collation counts as exotic. No CREATE
COLLATION is needed.

> I also believe that Peter Eisentraut is entirely correct to be
> concerned about whether BCP 47 (or anything else) can really be
> regarded as a stable canonical form for ICU purposes.  His email
> indicates that the acceptable and canonical forms have changed
> multiple times in the course of releases new enough for us to care
> about them.  Assuming that statement is correct, it would be extremely
> short-sighted of us to bank on them not changing any more.

That statement isn't correct. Including even the suggestion that Peter
Eisentraut ever thought it. ICU uses BCP 47 for collation name *across
all versions*. Just not as the collcollate value (that's only the case
on versions of ICU >= 54).

> But even if all of the above argumentation is utterly and completely
> wrong, dredged up from the universe's deepest and most profound
> reserves of stupidity and destined for future entry into Webster's as
> the canonical example of cluelessness, we still shouldn't change it
> the weekend before the GA wraps.

That seems like a value judgement. I'm not going to tell you that
you're wrong. What I will say is that I think we've done poorly here.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

From

Tom Lane

Date:

23 September 2017, 06:56:38

Robert Haas <robertmhaas@gmail.com> writes:
> Peter, with respect, it's time to let this argument go.  We're
> scheduled to wrap a GA release in just over 72 hours.

FWIW, the release is a week from Monday, not Monday.  (Or if it is
Monday, somebody else is wrapping it.)

We have some other embarrassingly critical things to fix, like bug #14825,
so I can certainly sympathize with an argument that there's not enough
committer bandwidth left to deal with this; but not with an argument that
it's too late to change behavior period.

The big concern I have here is that this feels a lot like something that
we'll regret at leisure, if it's not right in the first release.  I'd
much rather be restrictive in v10 and then loosen the rules later, than
be lax in v10 and then have to argue about whether to break backwards
compatibility in order to gain saner behavior.

> We have never canonicalized collations before and therefore it is not
> essential that we do that now.

Actually, we try; see initdb.c's check_locale_name().  It's not our
fault that setlocale(3) fails to play along on many platforms.

> I simply do not buy the theory that this cannot be changed later.

OK, so you're promising not to whine when we break backwards compatibility
on this point in v11?
        regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Geoghegan

Date:

23 September 2017, 07:51:16

On Fri, Sep 22, 2017 at 8:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The big concern I have here is that this feels a lot like something that
> we'll regret at leisure, if it's not right in the first release.  I'd
> much rather be restrictive in v10 and then loosen the rules later, than
> be lax in v10 and then have to argue about whether to break backwards
> compatibility in order to gain saner behavior.

To the bests of my knowledge, the only restriction implied by limiting
ourselves to the BCP 47 format (as part of standardizing what is
stored in pg_collation) is that users might know about the traditional
locale strings from some other place, and be surprised when their
knowledge doesn't transfer to Postgres. Personally, I don't think that
that's a big deal. If it actually is important, then I'm surprised
that it took this long for a doc change mentioning it to be proposed
(though the docs *do* say "Collations provided by ICU are created with
names in BCP 47 language tag format").

>> We have never canonicalized collations before and therefore it is not
>> essential that we do that now.
>
> Actually, we try; see initdb.c's check_locale_name().  It's not our
> fault that setlocale(3) fails to play along on many platforms.

But it will be our fault if we ship a v10 that does the kind of
unsettled canonicalization you see within
pg_import_system_collations() (the "collcollate =
U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name" thing). That looks
very much like the tail wagging the dog to me.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags.Should it?

From

Noah Misch

Date:

25 September 2017, 11:26:21

On Tue, Sep 19, 2017 at 07:01:47PM -0700, Peter Geoghegan wrote:
> On Tue, Sep 19, 2017 at 5:52 PM, Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com> wrote:
> > On 9/18/17 18:46, Peter Geoghegan wrote:
> >> As I pointed out a couple of times already [1], we don't currently
> >> sanitize ICU's BCP 47 language tags within CREATE COLLATION.
> >
> > There is no requirement that the locale strings for ICU need to be BCP
> > 47.  ICU locale names like 'de@collation=phonebook' are also acceptable.
> 
> Right. But, we only document that BCP 47 is supported by Postgres.
> Maybe we can use get_icu_language_tag()/uloc_toLanguageTag() to ensure
> that we end up with BCP 47, even when the user happens to specify the
> legacy syntax. Should we be "canonicalizing" legacy ICU locale strings
> as BCP 47, too?
> 
> > The reason they are not validated is that, as you know, ICU accepts any
> > locale string as valid.  You appear to have found a way to do some
> > validation, but I would like to see that code.
> 
> As I mentioned, I'm simply calling
> get_icu_language_tag()/uloc_toLanguageTag() to do that sanitization.
> The code to get the extra sanitization is completely trivial.
> 
> I didn't post the patch that generates the errors in my opening e-mail
> because I'm not confident it's correct just yet. And, I think that I
> see a bigger problem: we pass a string that is almost certainly a BCP
> 47 string to ucol_open() from within pg_newlocale_from_collation(). We
> do so despite the fact that ucol_open() apparently doesn't accept BCP
> 47 syntax locale strings until ICU 54 [1]. Seems entirely possible
> that this accounts for the problems you saw on ICU 4.2, back when we
> were still creating keyword variants (I guess that the keyword
> variants seem very "BCP 47-ish" to me).
> 
> I really think we need to add some kind of debug mode that makes ICU
> optionally spit out a locale display name at key points. We need this
> to gain confidence that the behavior that ICU provides actually
> matches what is expected across ICU different versions for different
> locale formats. We talked about this as a user-facing feature before,
> which can wait till v11; I just want this to debug problems like this
> one.
> 
> [1] https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a4721e4c0a519bb0139a874e191223590

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Peter,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Robert Haas

Date:

25 September 2017, 19:06:19

On Fri, Sep 22, 2017 at 11:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> FWIW, the release is a week from Monday, not Monday.  (Or if it is
> Monday, somebody else is wrapping it.)

Oops.

> We have some other embarrassingly critical things to fix, like bug #14825,
> so I can certainly sympathize with an argument that there's not enough
> committer bandwidth left to deal with this; but not with an argument that
> it's too late to change behavior period.

Really?  Traditionally, you've been one of the biggest opponents of
whacking things around post-beta.  Even post-beta1, let alone
post-beta5.

> The big concern I have here is that this feels a lot like something that
> we'll regret at leisure, if it's not right in the first release.  I'd
> much rather be restrictive in v10 and then loosen the rules later, than
> be lax in v10 and then have to argue about whether to break backwards
> compatibility in order to gain saner behavior.

I think it's inevitable that a certain number of users are going to
have to cope with ICU version changes breaking stuff.  If ICU decides
a collation is stupid or unused and drops it, or is mis-defined and
redefines it to some behavior that breaks things for somebody, they
are going to have to deal with it.  I don't think you can make that
problem go away by any amount of strictness introduced into v10, but
if you make the checks zealous enough, you can probably make them rule
out input that users would have preferred to have accepted.

I also think that if there's a compelling reason to bet on BCP 47 to
be a stable canonical form, I haven't heard it presented here.  At the
risk of repeating myself, it's not even supported in some ICU versions
we support, so how's that going to work?  And if it's been changed in
the recent past, why not again?  Peter Geoghegan said that he doesn't
know of any plans to eliminate BCP 47 support, but that doesn't seem
like it's much proof of anything.

>> I simply do not buy the theory that this cannot be changed later.
>
> OK, so you're promising not to whine when we break backwards compatibility
> on this point in v11?

If somebody has a collation that appears to work on v10 but really is
doing nothing, and when the upgrade to v11 they get an error because
we diagnose that the collation definition was not valid whereas v10
was unable to make that diagnosis, I promise not to whine about the
backward compatibility break thereby introduced.

If, on the other hand, you introduce an error check that is overly
stringent and precludes people from defining collations that are legal
and useful (in their judgement, not yours), I intend to whine about
that extensively.  And that applies to 10, 11, and any future versions
for which I may be around.

In short, I judge that allowing users access to *all* of the things
that ICU has now, has had in the past in versions we support, or may
have in the future is an important goal, but that preventing them from
relying on options that may go away is not a goal at all, since
barring the ability to predict the future, it's impossible anyway.

If it's possible to prevent to write a precisely-targeted check that
rules out only actually-meaningless collations and nothing else, I'm
fine with that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Geoghegan

Date:

25 September 2017, 19:36:44

On Mon, Sep 25, 2017 at 9:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> The big concern I have here is that this feels a lot like something that
>> we'll regret at leisure, if it's not right in the first release.  I'd
>> much rather be restrictive in v10 and then loosen the rules later, than
>> be lax in v10 and then have to argue about whether to break backwards
>> compatibility in order to gain saner behavior.
>
> I think it's inevitable that a certain number of users are going to
> have to cope with ICU version changes breaking stuff.

Wasn't the main point of adopting ICU that that doesn't happen when it
isn't essential? There will be some risk with pg_dump across ICU
versions, due rare to political changes. But, on pg_upgrade, the old
collations will continue to work, even if they would not have appeared
at initdb time on that ICU version, because ICU has all kinds of
fallbacks. It will work with a language it has heard of, but a country
it has never heard of, for example. I just want to have a consistent
collcollate format, to consistently take advantage of that.

My patch has no behavioral changes, apart from the sanitization, on
ICU >= 54, BTW. It's really not very much code, especially when you
exclude objective bug fixes.

> If ICU decides
> a collation is stupid or unused and drops it, or is mis-defined and
> redefines it to some behavior that breaks things for somebody, they
> are going to have to deal with it.  I don't think you can make that
> problem go away by any amount of strictness introduced into v10, but
> if you make the checks zealous enough, you can probably make them rule
> out input that users would have preferred to have accepted.

The sanitization thing isn't my main concern. The stability issue is
the major issue, and it isn't all that connected to sanitization.

> I also think that if there's a compelling reason to bet on BCP 47 to
> be a stable canonical form, I haven't heard it presented here.  At the
> risk of repeating myself, it's not even supported in some ICU versions
> we support, so how's that going to work?  And if it's been changed in
> the recent past, why not again?  Peter Geoghegan said that he doesn't
> know of any plans to eliminate BCP 47 support, but that doesn't seem
> like it's much proof of anything.

It's described by an IETF standard, and the extensions are described
by Unicode Technical Standard #35, a formal standard [1]. The BCP 47
format is not an ICU thing at all -- it's a Unicode consortium thing.
An express goal of BCP 47 is forward and backward compatibility [2].
And, ICU themselves have said that that's what they're standardizing
on, since they are upstream consumers of the CLDR (Unicode consortium)
locale data. They've done that because they want to move away from
their own format towards a universal format, usable in a wide variety
of contexts.

While I don't pretend to be able to predict the future, I think that
this is something that is as safe as is practically achievable to
standardize on. (It also has all kinds of provision for further
extension, should that prove necessary.)

> If, on the other hand, you introduce an error check that is overly
> stringent and precludes people from defining collations that are legal
> and useful (in their judgement, not yours), I intend to whine about
> that extensively.  And that applies to 10, 11, and any future versions
> for which I may be around.

That's not going to happen.

> In short, I judge that allowing users access to *all* of the things
> that ICU has now, has had in the past in versions we support, or may
> have in the future is an important goal, but that preventing them from
> relying on options that may go away is not a goal at all, since
> barring the ability to predict the future, it's impossible anyway.

+1

> If it's possible to prevent to write a precisely-targeted check that
> rules out only actually-meaningless collations and nothing else, I'm
> fine with that.

That's what I've done with my patch, IMV. I admit that it's still a
tiny bit subjective that we should reject an empty string, and not
allow the fallback there, for example (we could instead fall back on
that as "undefined"). I think that that's not going to be a problem
for anyone.

[1] http://unicode.org/reports/tr35/tr35-collation.html
[2] https://tools.ietf.org/html/bcp47#section-3.4
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Eisentraut

Date:

25 September 2017, 21:40:41

On 9/22/17 16:46, Peter Geoghegan wrote:
> But you are *already* canonicalizing ICU collation names as BCP 47. My
> point here is: Why not finish the job off, and *also* canonicalize
> colcollate in the same way? This won't break ucol_open() if we take
> appropriate precautions when we go to use the Postgres collation/ICU
> locale.

Reading over this code again, it is admittedly not quite clear why this
"canonicalization" is in there right now.  I think it had something to
do with how we built the keyword variants at one point.  It might not
make sense.  I'd be glad to take that out and use the result straight
from uloc_getAvailable() for collcollate.  That is, after all, the
"canonical" version that ICU chooses to report to us.

> One concern that makes me suggest this is: What happens when
> the user *downgrades* ICU version, from a version where colcollate is
> BCP 47 to one where it would not have been at initdb time? That will
> break the downgrade in an unpleasant way, including in installations
> that never do a CREATE COLLATION themselves. We want to be able to
> restore a basebackup on a somewhat different OS, and have that still
> work following REINDEX. At least, that seems like it should be an
> important goal for us.

This is an interesting point, and my proposal above would fix that.
However, I think that taking a PostgreSQL data directory and moving or
copying it to an *older* OS installation is always going to have a
potential for problems.  So I wouldn't spend a huge amount of effort
just to fix this specific case.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Geoghegan

Date:

25 September 2017, 22:14:37

On Mon, Sep 25, 2017 at 11:40 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 9/22/17 16:46, Peter Geoghegan wrote:
>> But you are *already* canonicalizing ICU collation names as BCP 47. My
>> point here is: Why not finish the job off, and *also* canonicalize
>> colcollate in the same way? This won't break ucol_open() if we take
>> appropriate precautions when we go to use the Postgres collation/ICU
>> locale.
>
> Reading over this code again, it is admittedly not quite clear why this
> "canonicalization" is in there right now.  I think it had something to
> do with how we built the keyword variants at one point.  It might not
> make sense.  I'd be glad to take that out and use the result straight
> from uloc_getAvailable() for collcollate.  That is, after all, the
> "canonical" version that ICU chooses to report to us.

But then our users categorically have to know about both formats,
without any practical benefit to make up for it. You will also get
people that don't realize that only one format is supported on some
versions if go this way.

>> One concern that makes me suggest this is: What happens when
>> the user *downgrades* ICU version, from a version where colcollate is
>> BCP 47 to one where it would not have been at initdb time? That will
>> break the downgrade in an unpleasant way, including in installations
>> that never do a CREATE COLLATION themselves. We want to be able to
>> restore a basebackup on a somewhat different OS, and have that still
>> work following REINDEX. At least, that seems like it should be an
>> important goal for us.
>
> This is an interesting point, and my proposal above would fix that.

I've already written a patch to standardize collcollate. If we do the
way you describe above instead, then what happens when ICU finally
removes the already deprecated legacy format?

> However, I think that taking a PostgreSQL data directory and moving or
> copying it to an *older* OS installation is always going to have a
> potential for problems.  So I wouldn't spend a huge amount of effort
> just to fix this specific case.

The downgrade thing is just the simplest, most immediate example of
where failing to standardize collcollate now could cause problems.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Geoghegan

Date:

25 September 2017, 22:52:49

On Mon, Sep 25, 2017 at 12:14 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> But then our users categorically have to know about both formats,
> without any practical benefit to make up for it. You will also get
> people that don't realize that only one format is supported on some
> versions if go this way.

Oh, and if you go this way there is also a much bigger disadvantage:
you also break pg_restore when you cross the ICU 54 version boundary
in *either* direction (pg_collation.collname will be different, so
actually equivalent collations will not be recognized as such for
dump/restore purposes). This seems very likely, even, because ICU 54
isn't that old (we support versions as old as ICU 4.2), and because
you only have to have used ICU initdb collation for this to happen (no
CREATE COLLATION is necessary). Sure, you *may* be able to first do a
manual CREATE COLLATION when that happens, and that will be picked up
during the restore. But that's very user hostile.

That must have been the real reason why you canonicalized
pg_collation.collname (I doubt it had anything to do with how keyword
variants used to be created during initdb, as you suggested). As Tom
pointed out recently, we've actually always canonicalized collation
name for libc.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Geoghegan

Date:

26 September 2017, 00:49:37

On Mon, Sep 25, 2017 at 12:52 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> That must have been the real reason why you canonicalized
> pg_collation.collname (I doubt it had anything to do with how keyword
> variants used to be created during initdb, as you suggested). As Tom
> pointed out recently, we've actually always canonicalized collation
> name for libc.

On further examination, none of this really matters, because you
simply cannot store ICU locale names like "en_US" within pg_collation;
it's impossible to do that without breaking many things that have
worked for a long time. initdb already canonicalizes the available
libc collations to produce collations whose names have exactly the
same "en_US" format. There will typically be both "en_US" and
"en_US.utf8" entries within pg_collation with Glibc on Linux, for example
(the former is created a convenient alias for the latter when the
database encoding is UTF-8).

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Geoghegan

Date:

30 September 2017, 02:38:00

On Mon, Sep 25, 2017 at 11:40 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> Reading over this code again, it is admittedly not quite clear why this
> "canonicalization" is in there right now.  I think it had something to
> do with how we built the keyword variants at one point.  It might not
> make sense.  I'd be glad to take that out and use the result straight
> from uloc_getAvailable() for collcollate.  That is, after all, the
> "canonical" version that ICU chooses to report to us.

Is anything going to happen here ahead of shipping v10? This remains
an open item.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags.Should it?

From

Noah Misch

Date:

30 September 2017, 10:01:12

On Mon, Sep 25, 2017 at 08:26:21AM +0000, Noah Misch wrote:
> On Tue, Sep 19, 2017 at 07:01:47PM -0700, Peter Geoghegan wrote:
> > On Tue, Sep 19, 2017 at 5:52 PM, Peter Eisentraut
> > <peter.eisentraut@2ndquadrant.com> wrote:
> > > On 9/18/17 18:46, Peter Geoghegan wrote:
> > >> As I pointed out a couple of times already [1], we don't currently
> > >> sanitize ICU's BCP 47 language tags within CREATE COLLATION.
> > >
> > > There is no requirement that the locale strings for ICU need to be BCP
> > > 47.  ICU locale names like 'de@collation=phonebook' are also acceptable.
> > 
> > Right. But, we only document that BCP 47 is supported by Postgres.
> > Maybe we can use get_icu_language_tag()/uloc_toLanguageTag() to ensure
> > that we end up with BCP 47, even when the user happens to specify the
> > legacy syntax. Should we be "canonicalizing" legacy ICU locale strings
> > as BCP 47, too?
> > 
> > > The reason they are not validated is that, as you know, ICU accepts any
> > > locale string as valid.  You appear to have found a way to do some
> > > validation, but I would like to see that code.
> > 
> > As I mentioned, I'm simply calling
> > get_icu_language_tag()/uloc_toLanguageTag() to do that sanitization.
> > The code to get the extra sanitization is completely trivial.
> > 
> > I didn't post the patch that generates the errors in my opening e-mail
> > because I'm not confident it's correct just yet. And, I think that I
> > see a bigger problem: we pass a string that is almost certainly a BCP
> > 47 string to ucol_open() from within pg_newlocale_from_collation(). We
> > do so despite the fact that ucol_open() apparently doesn't accept BCP
> > 47 syntax locale strings until ICU 54 [1]. Seems entirely possible
> > that this accounts for the problems you saw on ICU 4.2, back when we
> > were still creating keyword variants (I guess that the keyword
> > variants seem very "BCP 47-ish" to me).
> > 
> > I really think we need to add some kind of debug mode that makes ICU
> > optionally spit out a locale display name at key points. We need this
> > to gain confidence that the behavior that ICU provides actually
> > matches what is expected across ICU different versions for different
> > locale formats. We talked about this as a user-facing feature before,
> > which can wait till v11; I just want this to debug problems like this
> > one.
> > 
> > [1] https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a4721e4c0a519bb0139a874e191223590
> 
> [Action required within three days.  This is a generic notification.]
> 
> The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> since you committed the patch believed to have created it, you own this open
> item.  If some other commit is more relevant or if this does not belong as a
> v10 open item, please let us know.  Otherwise, please observe the policy on
> open item ownership[1] and send a status update within three calendar days of
> this message.  Include a date for your subsequent status update.  Testers may
> discover new open items at any time, and I want to plan to get them all fixed
> well in advance of shipping v10.  Consequently, I will appreciate your efforts
> toward speedy resolution.  Thanks.
> 
> [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

This PostgreSQL 10 open item is past due for your status update.  On the worst
week to be violating open item policies.  Kindly send a status update within
24 hours, and include a date for your subsequent status update.  Refer to the
policy on open item ownership:
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags.Should it?

From

Noah Misch

Date:

30 September 2017, 10:31:48

On Mon, Sep 25, 2017 at 09:36:44AM -0700, Peter Geoghegan wrote:
> On Mon, Sep 25, 2017 at 9:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> The big concern I have here is that this feels a lot like something that
> >> we'll regret at leisure, if it's not right in the first release.  I'd
> >> much rather be restrictive in v10 and then loosen the rules later, than
> >> be lax in v10 and then have to argue about whether to break backwards
> >> compatibility in order to gain saner behavior.
> >
> > I think it's inevitable that a certain number of users are going to
> > have to cope with ICU version changes breaking stuff.
> 
> Wasn't the main point of adopting ICU that that doesn't happen when it
> isn't essential? There will be some risk with pg_dump across ICU
> versions, due rare to political changes. But, on pg_upgrade, the old
> collations will continue to work, even if they would not have appeared
> at initdb time on that ICU version, because ICU has all kinds of
> fallbacks.

I wouldn't describe it that way.  I agree that few, if any, ICU upgrades will
remove country or language codes.  Overall, though, almost every ICU upgrade
will be difficult.  Each ICU release, even a minor release like 58.2, changes
the sorting rules in some tiny way.  You then see "Rebuild all objects
affected by this collation" messages.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

From

Tom Lane

Date:

30 September 2017, 18:25:43

Noah Misch <noah@leadboat.com> writes:
> On Mon, Sep 25, 2017 at 09:36:44AM -0700, Peter Geoghegan wrote:
>>> I think it's inevitable that a certain number of users are going to
>>> have to cope with ICU version changes breaking stuff.

>> Wasn't the main point of adopting ICU that that doesn't happen when it
>> isn't essential?

> I wouldn't describe it that way.  I agree that few, if any, ICU upgrades will
> remove country or language codes.  Overall, though, almost every ICU upgrade
> will be difficult.  Each ICU release, even a minor release like 58.2, changes
> the sorting rules in some tiny way.  You then see "Rebuild all objects
> affected by this collation" messages.

Sure, but dealing with that is mechanical: reindex the necessary indexes
and you're done.  I think it's important to distinguish that from a case
where users have to change their collation definitions.  That is a
qualitatively different, and much harder, upgrade problem.

I'd also argue that the point of adopting ICU was exactly so we *could*
distinguish those cases, and limit the scope of a normal upgrade to
"reindex these identifiable indexes and you're done".  In the libc world,
when you upgrade libc's locale definitions, you have no idea what the
consequences are.
        regards, tom lane


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Geoghegan

Date:

30 September 2017, 19:47:06

On Sat, Sep 30, 2017 at 8:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'd also argue that the point of adopting ICU was exactly so we *could*
> distinguish those cases, and limit the scope of a normal upgrade to
> "reindex these identifiable indexes and you're done".  In the libc world,
> when you upgrade libc's locale definitions, you have no idea what the
> consequences are.

Right. With libc, we think of collations as something that there is a
small, fixed number of on a system, that we cannot safely assume
anything about. But with ICU, all of the semantics of how natural
languages should be sorted are exposed via various APIs, and there are
literally more possible sets of collation behaviors than there are
grains of sand in the Sahara (there are hundreds of distinct scripts,
which we can change the overall ordering of arbitrarily, on top of all
the other customizations). Clearly the libc way of looking at things
doesn't really carry over.

BCP 47 is supposed to be universal -- it's an IETF standard. That's
where all the stability guarantees are. The officially recognized 'u'
extension that ICU uses is a CLDR/Unicode thing, not an ICU thing. The
same format could, in the future, be used by other collation
providers, since there actually are other CLDR consumers/UCA
implementations. And, ICU have said that they have deprecated the old
locale format, and have standardized on BCP 47. As of ICU 54, it is
recommended that ucol_open() be passed a string in BCP 47 format.

I'm surprised that this issue was not resolved earlier in the week. I
presumed that all of this was obvious to Peter E., but I seem to have
been wrong about that.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Noah Misch

Date:

30 September 2017, 21:29:06

On Sat, Sep 30, 2017 at 11:25:43AM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Mon, Sep 25, 2017 at 09:36:44AM -0700, Peter Geoghegan wrote:
> >>> I think it's inevitable that a certain number of users are going to
> >>> have to cope with ICU version changes breaking stuff.
> 
> >> Wasn't the main point of adopting ICU that that doesn't happen when it
> >> isn't essential?
> 
> > I wouldn't describe it that way.  I agree that few, if any, ICU upgrades will
> > remove country or language codes.  Overall, though, almost every ICU upgrade
> > will be difficult.  Each ICU release, even a minor release like 58.2, changes
> > the sorting rules in some tiny way.  You then see "Rebuild all objects
> > affected by this collation" messages.
> 
> Sure, but dealing with that is mechanical: reindex the necessary indexes
> and you're done.

In the general case, one must revalidate CHECK constraints, re-partition
tables, revalidate range values, and reindex.

> In the libc world,
> when you upgrade libc's locale definitions, you have no idea what the
> consequences are.

Yep.  It's strictly better than the libc case.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

From

Tom Lane

Date:

30 September 2017, 22:28:14

Noah Misch <noah@leadboat.com> writes:
> On Sat, Sep 30, 2017 at 11:25:43AM -0400, Tom Lane wrote:
>> Sure, but dealing with that is mechanical: reindex the necessary indexes
>> and you're done.

> In the general case, one must revalidate CHECK constraints, re-partition
> tables, revalidate range values, and reindex.

True, but that is what it is: nothing we can do is going to affect the
consequences of a collation behavior change, if there is one.  What's more
useful for our immediate purposes is to ask whether we can reliably detect
a collation behavior change.  False negatives are bad, but so are false
positives, because those would force DBAs to jump through lots of hoops
unnecessarily.

So: are canonicalized locale descriptions any better or worse by that
metric than non-canonicalized descriptions?  In principle I think a
canonicalized description might be more likely to be recognized as
the "same" locale by another ICU version than one that isn't, but
I don't know whether there's any meaningful difference in practice.

Another point here is whether, even if a new ICU version recognizes
a locale description as being "the same" interpretation that an old
ICU version used, will it report the same collation version?  Limited
experimentation suggests that the collversions we're actually getting
out of ICU depend on little other than the libicu version.  "select
distinct collversion from pg_collation where collversion is not null"
produces this on ICU 4.2.1:
49.192.5.4149.192.0.41

and this on 52.1:
58.0.6.5058.0.0.50

and this on 57.1:
153.64.29153.64

This suggests to me that arguing about canonicalization is moot so
far as avoiding reindexing goes: if you change ICU library versions,
you're screwed and will have to jump through all the reindexing hoops,
no matter what we do or don't do.  (Maybe we are looking at the wrong
information to populate collversion?)

Now, it may still be worthwhile to argue about whether canonicalization
will help the other component of the problem, which is whether you can
dump and reload CREATE COLLATION commands into a new ICU version and
expect to get more-or-less-the-same behavior as before.
        regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Geoghegan

Date:

30 September 2017, 22:46:43

On Sat, Sep 30, 2017 at 12:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This suggests to me that arguing about canonicalization is moot so
> far as avoiding reindexing goes: if you change ICU library versions,
> you're screwed and will have to jump through all the reindexing hoops,
> no matter what we do or don't do.  (Maybe we are looking at the wrong
> information to populate collversion?)

Just to be clear, I was never arguing that canonicalization would
avoid reindexing, and I didn't think anyone else was. I believe that
the version string will change when the behavior of its collator
changes for any reason and in any way. This includes changes to how
binary sort keys are generated.

(We don't currently store binary keys on disk, so a change to that
representation doesn't really necessitate a REINDEX for us. The UCA
spec explicitly decouples compatibility for indexing with binary keys
from changes needed due to human requirements. ICU binary sort keys
are compressed, and this is sometimes improved, independently of the
evolution of how natural language experts say text should be sorted.)

> Now, it may still be worthwhile to argue about whether canonicalization
> will help the other component of the problem, which is whether you can
> dump and reload CREATE COLLATION commands into a new ICU version and
> expect to get more-or-less-the-same behavior as before.

Again, to be clear, that's what I was arguing all along. I think that
users will get very good results with this approach. When the sorting
behavior of a locale is refined by natural language experts, it's
almost certainly a very small change that most users that are affected
won't actually notice. For example, when en-us-x-icu is changed,
that's actually due to a general change in the inherited root collator
that doesn't really affect English speakers. There is no practical
question about how you're supposed to sort English text.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Eisentraut

Date:

01 October 2017, 16:27:56

On 9/29/17 19:38, Peter Geoghegan wrote:
> On Mon, Sep 25, 2017 at 11:40 AM, Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com> wrote:
>> Reading over this code again, it is admittedly not quite clear why this
>> "canonicalization" is in there right now.  I think it had something to
>> do with how we built the keyword variants at one point.  It might not
>> make sense.  I'd be glad to take that out and use the result straight
>> from uloc_getAvailable() for collcollate.  That is, after all, the
>> "canonical" version that ICU chooses to report to us.
> 
> Is anything going to happen here ahead of shipping v10? This remains
> an open item.

I'm still pondering committing my documentation patch I had posted, and
I've been reviewing your patches to see if there is anything else
documentation-wise that could be added.

As I had mentioned before, I disagree with the substance of the
functionality changes you are proposing, and I think it would be way too
late to make such significant changes.  The general community opinion
also seems to be in favor of letting things be.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Eisentraut

Date:

01 October 2017, 16:50:15

On 9/30/17 15:28, Tom Lane wrote:
> Now, it may still be worthwhile to argue about whether canonicalization
> will help the other component of the problem, which is whether you can
> dump and reload CREATE COLLATION commands into a new ICU version and
> expect to get more-or-less-the-same behavior as before.

Arguably, you don't always want that.  Maybe you selected a locale name
like 'en-NEWCOUNTRY', and in old ICU versions you want that to fall back
to default 'en' behavior and in newer you get the updated custom behavior.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Eisentraut

Date:

01 October 2017, 16:52:55

On 9/30/17 15:28, Tom Lane wrote:
> This suggests to me that arguing about canonicalization is moot so
> far as avoiding reindexing goes: if you change ICU library versions,
> you're screwed and will have to jump through all the reindexing hoops,
> no matter what we do or don't do.

One reason for that is that the collation version also encodes things
like if the internal method for computing sort keys changes.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Eisentraut

Date:

01 October 2017, 16:56:11

On 9/30/17 03:01, Noah Misch wrote:
> On Mon, Sep 25, 2017 at 08:26:21AM +0000, Noah Misch wrote:
>> On Tue, Sep 19, 2017 at 07:01:47PM -0700, Peter Geoghegan wrote:
>>> On Tue, Sep 19, 2017 at 5:52 PM, Peter Eisentraut
>>> <peter.eisentraut@2ndquadrant.com> wrote:
>>>> On 9/18/17 18:46, Peter Geoghegan wrote:
>>>>> As I pointed out a couple of times already [1], we don't currently
>>>>> sanitize ICU's BCP 47 language tags within CREATE COLLATION.
>>>>
>>>> There is no requirement that the locale strings for ICU need to be BCP
>>>> 47.  ICU locale names like 'de@collation=phonebook' are also acceptable.
>>>
>>> Right. But, we only document that BCP 47 is supported by Postgres.
>>> Maybe we can use get_icu_language_tag()/uloc_toLanguageTag() to ensure
>>> that we end up with BCP 47, even when the user happens to specify the
>>> legacy syntax. Should we be "canonicalizing" legacy ICU locale strings
>>> as BCP 47, too?
>>>
>>>> The reason they are not validated is that, as you know, ICU accepts any
>>>> locale string as valid.  You appear to have found a way to do some
>>>> validation, but I would like to see that code.
>>>
>>> As I mentioned, I'm simply calling
>>> get_icu_language_tag()/uloc_toLanguageTag() to do that sanitization.
>>> The code to get the extra sanitization is completely trivial.
>>>
>>> I didn't post the patch that generates the errors in my opening e-mail
>>> because I'm not confident it's correct just yet. And, I think that I
>>> see a bigger problem: we pass a string that is almost certainly a BCP
>>> 47 string to ucol_open() from within pg_newlocale_from_collation(). We
>>> do so despite the fact that ucol_open() apparently doesn't accept BCP
>>> 47 syntax locale strings until ICU 54 [1]. Seems entirely possible
>>> that this accounts for the problems you saw on ICU 4.2, back when we
>>> were still creating keyword variants (I guess that the keyword
>>> variants seem very "BCP 47-ish" to me).
>>>
>>> I really think we need to add some kind of debug mode that makes ICU
>>> optionally spit out a locale display name at key points. We need this
>>> to gain confidence that the behavior that ICU provides actually
>>> matches what is expected across ICU different versions for different
>>> locale formats. We talked about this as a user-facing feature before,
>>> which can wait till v11; I just want this to debug problems like this
>>> one.
>>>
>>> [1] https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a4721e4c0a519bb0139a874e191223590
>>
>> [Action required within three days.  This is a generic notification.]
>>
>> The above-described topic is currently a PostgreSQL 10 open item.  Peter,
>> since you committed the patch believed to have created it, you own this open
>> item.  If some other commit is more relevant or if this does not belong as a
>> v10 open item, please let us know.  Otherwise, please observe the policy on
>> open item ownership[1] and send a status update within three calendar days of
>> this message.  Include a date for your subsequent status update.  Testers may
>> discover new open items at any time, and I want to plan to get them all fixed
>> well in advance of shipping v10.  Consequently, I will appreciate your efforts
>> toward speedy resolution.  Thanks.
>>
>> [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
> 
> This PostgreSQL 10 open item is past due for your status update.  On the worst
> week to be violating open item policies.  Kindly send a status update within
> 24 hours, and include a date for your subsequent status update.  Refer to the
> policy on open item ownership:
> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

I had previously replied that I think nothing should be changed.  What
process should be applied in that case?


-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Geoghegan

Date:

01 October 2017, 21:26:27

On Sun, Oct 1, 2017 at 6:27 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> I'm still pondering committing my documentation patch I had posted, and
> I've been reviewing your patches to see if there is anything else
> documentation-wise that could be added.

-1 from me -- I don't think we should tell users about the deprecated
locale format at all.

I hope that ICU continue to support the deprecated locale string
format within ucol_open() into the future, and that users will only
ever use BCP 47 within CREATE COLLATION (as the 'locale' string).
Perhaps it won't matter that much in the end, and will turn out to be
more of a wart than something that causes pain for users. It's hard to
predict.

> As I had mentioned before, I disagree with the substance of the
> functionality changes you are proposing, and I think it would be way too
> late to make such significant changes.  The general community opinion
> also seems to be in favor of letting things be.

It does seem too late. It's disappointing that we did not do better
here. This problem was entirely avoidable.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Noah Misch

Date:

01 October 2017, 22:00:07

On Sun, Oct 01, 2017 at 09:56:11AM -0400, Peter Eisentraut wrote:
> On 9/30/17 03:01, Noah Misch wrote:
> > This PostgreSQL 10 open item is past due for your status update.  On the worst
> > week to be violating open item policies.  Kindly send a status update within
> > 24 hours, and include a date for your subsequent status update.  Refer to the
> > policy on open item ownership:
> > https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
> 
> I had previously replied that I think nothing should be changed.  What
> process should be applied in that case?

If you determine community decision-making is not against that conclusion,
edit the list to move the item from "Open Issues" to "Non-bugs".


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Eisentraut

Date:

02 October 2017, 19:42:36

On 10/1/17 14:26, Peter Geoghegan wrote:
> It does seem too late. It's disappointing that we did not do better
> here. This problem was entirely avoidable.

I understand where you are coming from.  My experience with developing
this feature has been that it is quite fragile in some ways.  We have
had this out there for testing for many months, and we have fixed many
bugs, and follow-up bugs, up to very recently.  We don't have good
automatic test coverage, so we need to rely on user testing.  Making
significant code and design changes a week or two before the final
release is just too late, even if the changes were undisputed.  And this
wasn't just one specific isolated change, it was a set of overlapping
changes that seemingly kept changing and expanding.  Analyzing and
reviewing that under pressure, and then effectively owning it going
forward, too, is not something I could commit to.

I'm satisfied that what we are shipping is "good enough".  It has been
in development for over a year, it has been reviewed and tested for
months, and a lot of credit for that goes to you.  The "old" locale
support took many years to take shape, and this one probably will, too,
but at some point I feel we have to let it be for a while and take a
breather and get some feedback and field experience.

I'm available to discuss the changes you are envisioning, preferably one
at a time, in the near future as part of the PG11 development process.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

From

Peter Geoghegan

Date:

02 October 2017, 20:27:21

On Mon, Oct 2, 2017 at 9:42 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> I understand where you are coming from.  My experience with developing
> this feature has been that it is quite fragile in some ways.  We have
> had this out there for testing for many months, and we have fixed many
> bugs, and follow-up bugs, up to very recently.  We don't have good
> automatic test coverage, so we need to rely on user testing.  Making
> significant code and design changes a week or two before the final
> release is just too late, even if the changes were undisputed.  And this
> wasn't just one specific isolated change, it was a set of overlapping
> changes that seemingly kept changing and expanding.

For the record, I don't accept that summary. We could have not
bothered with the sanitization thing at all, and I wouldn't have cared
very much. I even revised my proposal such that you would never have
needed to do the language tag mapping at all [1] (albeit while
desupporting ICU versions prior to 4.6). And, as recently as 7 days
ago, you yourself said:

"Reading over this code again, it is admittedly not quite clear why
this "canonicalization" is in there right now.  I think it had
something to do with how we built the keyword variants at one point.
It might not make sense.  I'd be glad to take that out and use the
result straight  from uloc_getAvailable() for collcollate." [2]

This would have been far more radical than what I proposed.

> I'm satisfied that what we are shipping is "good enough".  It has been
> in development for over a year, it has been reviewed and tested for
> months, and a lot of credit for that goes to you.

It is "good enough", I suppose. I am disappointed by this particular
outcome, but that's mostly because it could have been avoided. I'm
still very happy that you took on this project, and I don't want that
to be overshadowed by this particular disagreement.

> I'm available to discuss the changes you are envisioning, preferably one
> at a time, in the near future as part of the PG11 development process.

I don't really think that there is anything more to be done about the
issues raised on this thread, with the possible exception of the
DEBUG1 display name string thing. The sanitization just isn't valuable
enough to add after the fact. Apart from the risks of adding it after
the fact, another reason not to bother with it is that it's
*incredibly* forgiving. So forgiving that you could reasonably argue
that we might as well not have it at all.

I may well do more on ICU with you in the future, but not in the area
of how things are canonicalized or sanitized. It's too late for that
now.

[1] https://www.postgresql.org/message-id/CAH2-WzmVtRyNg2gT=u=ktEC-jM3aKq4bYzJ0u2=OsxE+O3KQ4g@mail.gmail.com
[2] https://www.postgresql.org/message-id/f6c0fca7-e277-3f46-c0c1-adc001bffdd7@2ndquadrant.com
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers