Thread: Pg_upgrade and collation

Pg_upgrade and collation

From
Bruce Momjian
Date:
The attached patch documents that pg_upgrade requires old/new servers to
use compatibile collation library versions as well.

I would like to apply this to all PG branches.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +

Attachment

Re: Pg_upgrade and collation

From
Alvaro Herrera
Date:
Bruce Momjian wrote:
> The attached patch documents that pg_upgrade requires old/new servers to
> use compatibile collation library versions as well.

I think this is way too thin to be helpful:

> --- 61,68 ----
>     checking for compatible compile-time settings, including 32/64-bit
>     binaries.  It is important that
>     any external modules are also binary compatible, though this cannot
> !   be checked by <application>pg_upgrade</>.  Compatible collation
> !   library versions must also be used.
>    </para>

I think it would be useful to indicate what to do if they are not
compatible.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Pg_upgrade and collation

From
Bruce Momjian
Date:
On Fri, Jun 17, 2016 at 05:51:54PM -0400, Alvaro Herrera wrote:
> Bruce Momjian wrote:
> > The attached patch documents that pg_upgrade requires old/new servers to
> > use compatibile collation library versions as well.
>
> I think this is way too thin to be helpful:

Well, this is a much larger issue than pg_upgrade, e.g. moving a data
directory from one cluster to another with a different collation library
version could also cause problems, and I don't know that is documented
at all.

If we want to go larger, we have to do this in a more central location.

>
> > --- 61,68 ----
> >     checking for compatible compile-time settings, including 32/64-bit
> >     binaries.  It is important that
> >     any external modules are also binary compatible, though this cannot
> > !   be checked by <application>pg_upgrade</>.  Compatible collation
> > !   library versions must also be used.
> >    </para>
>
> I think it would be useful to indicate what to do if they are not
> compatible.

The indexes don't work reliably.  We don't document what happens if
shared objects don't match either, but again, if we want to clarify
this, we need to do it more centrally.  Ideas?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +


Re: Pg_upgrade and collation

From
Bruce Momjian
Date:
On Fri, Jun 17, 2016 at 06:01:59PM -0400, Bruce Momjian wrote:
> On Fri, Jun 17, 2016 at 05:51:54PM -0400, Alvaro Herrera wrote:
> > Bruce Momjian wrote:
> > > The attached patch documents that pg_upgrade requires old/new servers to
> > > use compatibile collation library versions as well.
> >
> > I think this is way too thin to be helpful:
>
> Well, this is a much larger issue than pg_upgrade, e.g. moving a data
> directory from one cluster to another with a different collation library
> version could also cause problems, and I don't know that is documented
> at all.
>
> If we want to go larger, we have to do this in a more central location.

Frankly, pg_upgrade is, by definition, upgrading on the same server, so
I don't even see how they could have mismatched collation library
versions, but it seemed good to document it.  The larger issue of moving
clusters is a separate issue that needs documentation somewhere else.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +


Re: Pg_upgrade and collation

From
Alvaro Herrera
Date:
Bruce Momjian wrote:
> On Fri, Jun 17, 2016 at 06:01:59PM -0400, Bruce Momjian wrote:
> > On Fri, Jun 17, 2016 at 05:51:54PM -0400, Alvaro Herrera wrote:
> > > Bruce Momjian wrote:
> > > > The attached patch documents that pg_upgrade requires old/new servers to
> > > > use compatibile collation library versions as well.
> > >
> > > I think this is way too thin to be helpful:
> >
> > Well, this is a much larger issue than pg_upgrade, e.g. moving a data
> > directory from one cluster to another with a different collation library
> > version could also cause problems, and I don't know that is documented
> > at all.
> >
> > If we want to go larger, we have to do this in a more central location.
>
> Frankly, pg_upgrade is, by definition, upgrading on the same server, so
> I don't even see how they could have mismatched collation library
> versions, but it seemed good to document it.

By this argument, the proposed patch seems pointless to me.

> The larger issue of moving clusters is a separate issue that needs
> documentation somewhere else.

Sure.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Pg_upgrade and collation

From
Bruce Momjian
Date:
On Fri, Jun 17, 2016 at 06:11:58PM -0400, Alvaro Herrera wrote:
> > Frankly, pg_upgrade is, by definition, upgrading on the same server, so
> > I don't even see how they could have mismatched collation library
> > versions, but it seemed good to document it.
>
> By this argument, the proposed patch seems pointless to me.
>
> > The larger issue of moving clusters is a separate issue that needs
> > documentation somewhere else.
>
> Sure.

In looking at the docs, it seems it would go in the Backup section
somewhere:

    https://www.postgresql.org/docs/9.6/static/backup.html

Seems it would apply to both of these backup sections:

    24.2. File System Level Backup
    24.3. Continuous Archiving and Point-in-Time Recovery (PITR)

and also here:

    25.2. Log-Shipping Standby Servers

It seems odd to put it in all of these places, but where can we
centrally put it?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +


Re: Pg_upgrade and collation

From
Bruce Momjian
Date:
On Mon, Jun 20, 2016 at 11:16:36AM -0400, Bruce Momjian wrote:
> In looking at the docs, it seems it would go in the Backup section
> somewhere:
>
>     https://www.postgresql.org/docs/9.6/static/backup.html
>
> Seems it would apply to both of these backup sections:
>
>     24.2. File System Level Backup
>     24.3. Continuous Archiving and Point-in-Time Recovery (PITR)
>
> and also here:
>
>     25.2. Log-Shipping Standby Servers
>
> It seems odd to put it in all of these places, but where can we
> centrally put it?

In looking at the docs, I found that the section "Creating a Database
Cluster", which covers initdb and collations, as the best place to put
this warning.  Patch attached.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +

Attachment

Re: Pg_upgrade and collation

From
Peter Geoghegan
Date:
On Fri, Jun 17, 2016 at 2:51 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> I think this is way too thin to be helpful:
>
>> --- 61,68 ----
>>     checking for compatible compile-time settings, including 32/64-bit
>>     binaries.  It is important that
>>     any external modules are also binary compatible, though this cannot
>> !   be checked by <application>pg_upgrade</>.  Compatible collation
>> !   library versions must also be used.
>>    </para>

Unfortunately, the reality is that as things stand, there is no way to
test compatibility on all platforms. Glibc does have a notion of
collation versioning, though [1].

I have long advocated adopting ICU as our defacto standard "collation
provider", primarily so that we can directly control collations and
collation versioning. I think that doing this would solve many
problems. Besides, even SQLite has optional ICU support. PostgreSQL is
the only major database system that I'm aware of that relies on
operating system collations exclusively.

I've avoided committing to work on it because I'm concerned that it
would not be well received.

[1] https://www.gnu.org/software/autoconf/manual/autoconf-2.63/html_node/Special-Shell-Variables.html
--
Peter Geoghegan


Re: Pg_upgrade and collation

From
Bruce Momjian
Date:
On Tue, Jun 28, 2016 at 02:58:58PM -0700, Peter Geoghegan wrote:
> On Fri, Jun 17, 2016 at 2:51 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > I think this is way too thin to be helpful:
> >
> >> --- 61,68 ----
> >>     checking for compatible compile-time settings, including 32/64-bit
> >>     binaries.  It is important that
> >>     any external modules are also binary compatible, though this cannot
> >> !   be checked by <application>pg_upgrade</>.  Compatible collation
> >> !   library versions must also be used.
> >>    </para>
>
> Unfortunately, the reality is that as things stand, there is no way to
> test compatibility on all platforms. Glibc does have a notion of
> collation versioning, though [1].

Yes, the patch text is clearly weasel-words in that we can't explain how
to detect incompatible.

> I have long advocated adopting ICU as our defacto standard "collation
> provider", primarily so that we can directly control collations and
> collation versioning. I think that doing this would solve many
> problems. Besides, even SQLite has optional ICU support. PostgreSQL is
> the only major database system that I'm aware of that relies on
> operating system collations exclusively.

I am hopeful ICU has improved enough since we last researched that
support for it will soon be added.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +


Re: Pg_upgrade and collation

From
Peter Geoghegan
Date:
On Tue, Jun 28, 2016 at 3:20 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> I have long advocated adopting ICU as our defacto standard "collation
>> provider", primarily so that we can directly control collations and
>> collation versioning. I think that doing this would solve many
>> problems. Besides, even SQLite has optional ICU support. PostgreSQL is
>> the only major database system that I'm aware of that relies on
>> operating system collations exclusively.
>
> I am hopeful ICU has improved enough since we last researched that
> support for it will soon be added.

There is a patch available that is not ready to be submitted, and
doesn't have a real advocate, but is at least enough to convince me
that it's very doable. Performance is certainly no impediment to
adopting ICU, even without considering that it effectively
re-introduces abbreviated keys for text when the C collation is not
used.

The best argument for ICU is the evidently lax attitude that the glibc
people have towards the correctness and consistency of their
collations:

https://bugzilla.redhat.com/show_bug.cgi?id=1320356#c3

Here, Carlos O'Donnell, a glic committer, says "Regarding (b), the
collations in glibc may change from build to build depending on
changes in the algorithms or locales. You cannot rely on the collation
stay the same once the process exits (nor can you rely upon it via a
shared memory mapping to another process sorting strings in memory)".
Frankly, we have no excuse for not heeding his warning.

I'm not annoyed at the glibc people for taking this position. There
is, quite simply, a misalignment of incentives. For the glibc people,
the assumption is that any problem with collations leads only to
slight annoyance from end users, as when the GUI produces subtly wrong
ordering. Whereas, for us, any inconsistency is an extremely serious
problem. Here we have the maintainers of glibc telling us that they
feel like it's okay that that can happen at any time. Surely that
isn't good enough.

ICU as a project has every incentive to see things the same way as we
do. The library explicitly decouples collation rule versions from
algorithm versions. All of this is carefully considered, for the
benefit of the numerous major database systems that use ICU.

--
Peter Geoghegan


Re: Pg_upgrade and collation

From
Alvaro Herrera
Date:
Peter Geoghegan wrote:

> The best argument for ICU is the evidently lax attitude that the glibc
> people have towards the correctness and consistency of their
> collations:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1320356#c3
>
> Here, Carlos O'Donnell, a glic committer, says "Regarding (b), the
> collations in glibc may change from build to build depending on
> changes in the algorithms or locales. You cannot rely on the collation
> stay the same once the process exits (nor can you rely upon it via a
> shared memory mapping to another process sorting strings in memory)".
> Frankly, we have no excuse for not heeding his warning.
>
> I'm not annoyed at the glibc people for taking this position. There
> is, quite simply, a misalignment of incentives. For the glibc people,
> the assumption is that any problem with collations leads only to
> slight annoyance from end users, as when the GUI produces subtly wrong
> ordering. Whereas, for us, any inconsistency is an extremely serious
> problem. Here we have the maintainers of glibc telling us that they
> feel like it's okay that that can happen at any time. Surely that
> isn't good enough.

Uhmm.  Until now I saw all this ICU thing as having fringe benefit on
strange platforms only, but it is seeming more and more like we need to
take it seriously.  I'm not prepared to spend effort on it myself,
though.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Pg_upgrade and collation

From
Peter Geoghegan
Date:
On Tue, Jun 28, 2016 at 3:50 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Uhmm.  Until now I saw all this ICU thing as having fringe benefit on
> strange platforms only, but it is seeming more and more like we need to
> take it seriously.  I'm not prepared to spend effort on it myself,
> though.

Let me put it this way: If we lived in a world where
internationalization was a new idea, and someone proposed collation
support that relied on the OS today, the patch would be rejected in
about 2 minutes. The author would be pointed in the direction of
"Notes to Operator Class Implementors" within the nbtree README.

There are numerous user-visible benefits to ICU support, too, like:

* Case-insensitive collations become possible (with work in other
areas). No more contrib/citext hack. This is something that we seem to
want to work towards.

* Abbreviated keys in indexes with collated text becomes possible.
(Already mentioned that abbreviated keys for collated text + sorting
are effectively reintroduced.)

* More useful collations available for certain languages, such as
Japanese. Apparently, the JIS X 4061 algorithm produces results that
Japanese people find more useful, but glibc doesn't support it, and
never will.

* We might be able to document WAL compatibility usefully, now. The
documentation never gets around to explaining what two instances are
compatible for the purposes of physical replication. I can't think of
any other factor that prevents us from locking that down.

* Upgrade major OS versions without difficulty.

* User-defined collations, where you can mix and match certain facets
of how text is sorted as you please. Basically, ICU offers rich
functionality that we can bubble up to our users without too much
effort, as other database systems have.

--
Peter Geoghegan


Re: Pg_upgrade and collation

From
Bruce Momjian
Date:
On Tue, Jun 28, 2016 at 05:21:51PM -0400, Bruce Momjian wrote:
> On Mon, Jun 20, 2016 at 11:16:36AM -0400, Bruce Momjian wrote:
> > In looking at the docs, it seems it would go in the Backup section
> > somewhere:
> >
> >     https://www.postgresql.org/docs/9.6/static/backup.html
> >
> > Seems it would apply to both of these backup sections:
> >
> >     24.2. File System Level Backup
> >     24.3. Continuous Archiving and Point-in-Time Recovery (PITR)
> >
> > and also here:
> >
> >     25.2. Log-Shipping Standby Servers
> >
> > It seems odd to put it in all of these places, but where can we
> > centrally put it?
>
> In looking at the docs, I found that the section "Creating a Database
> Cluster", which covers initdb and collations, as the best place to put
> this warning.  Patch attached.

Patch applied and backpatched.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +


Re: Pg_upgrade and collation

From
Peter Eisentraut
Date:
On 6/28/16 5:58 PM, Peter Geoghegan wrote:
> I have long advocated adopting ICU as our defacto standard "collation
> provider", primarily so that we can directly control collations and
> collation versioning. I think that doing this would solve many
> problems.

I plan to submit a patch for ICU support for September.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Pg_upgrade and collation

From
Peter Geoghegan
Date:
On Sat, Jul 9, 2016 at 7:02 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 6/28/16 5:58 PM, Peter Geoghegan wrote:
>>
>> I have long advocated adopting ICU as our defacto standard "collation
>> provider", primarily so that we can directly control collations and
>> collation versioning. I think that doing this would solve many
>> problems.
>
>
> I plan to submit a patch for ICU support for September.

That's fantastic news! Your knowledge of packaging will be useful
here. I will review your patch.

--
Peter Geoghegan