Thread: 16: Collation versioning and dependency helpers

16: Collation versioning and dependency helpers

From
Jeff Davis
Date:
Motivation:

We haven't fully solved the changing collation-provider problem. An
upgrade of the OS may change the version of libc or icu, and that might
affect the collation, which could leave you with various corrupt
database objects including:

  * indexes
  * constraints
  * range types or multiranges (or other types dependent
    on collation for internal consistency)
  * materialized views
  * partitioned tables (range or hash)

There's discussion about trying to reliably detect these changes and
remedy them. But there are major challenges; for instance, glibc
doesn't give a reliable signal that a collation may have changed, which
would leave us with a lot of false positives and create a new set of
problems (e.g. reindexing when it's unnecessary). And even with ICU, we
don't have a way to support multiple versions of a provider or of a
single collation, so trying to upgrade would still be a hassle.

Proposal:

Add in some tools to make it easier for administrators to find out if
they are at risk and solve the problem for themselves in a systematic
way.

Patches:

  0001: Treat "default" collation as unpinned, so that entries in
pg_depend are created. The rationale is that, since the "default"
collation can change, it's not really an immutable system object, and
it's worth tracking which objects are affected by it. It seems to bloat
pg_depend by about 5-10% though -- that doesn't seem great, but I'm not
sure if it's a real problem or not.

  0002: Enable pg_collation_actual_version() to work on the default
collation (oid=100) so that it doesn't need to be treated as a special
case.

  0003: Fix ALTER COLLATION "default" REFRESH VERSION, which currently
throws an unhelpful internal error. Instead, issue a more helpful error
that suggests "ALTER DATABASE ... REFRESH COLLATION VERSION" instead.

  0004: Add system views:
    pg_collation_versions: quickly see the current (from the catalog)
and actual (from the provider) versions of each collation
    pg_collation_dependencies: map of objects to the collations they
depend on

Along with these patches, you can use some tricks to verify data, such
as /contrib/amcheck; or fix the data with things like:

  * REINDEX
  * VACUUM FULL/TRUNCATE/CLUSTER
  * REFRESH MATERIALIZED VIEW

And then refresh the collation version when you're confident that your
data is valid.

TODO:

  * The dependencies view is not rigorously complete, because the
directed dependency graph doesn't quite establish an "affected by"
relationship. One exception is that a composite type doesn't depend on
its associated relation, so a composite type over a range type doesn't
depend on the range type.
  * Consider adding in some verification helpers that can verify that a
value is still valid (e.g. a range type that depends on a collation
might have corrupt values). We could have a collation verifier for
types that are collation-dependenent, or perhaps just go through the
input and output functions and catch any errors.
  * Consider better tracking of which collation versions were active on
a particular object since the last REINDEX (or REFRESH MATERIALIZED
VIEW, TRUNCATE, or other command that would remove any trace of data
affected by the previous collation version).

Regards,
    Jeff Davis


Attachment

Re: 16: Collation versioning and dependency helpers

From
Thomas Munro
Date:
On Sun, Oct 30, 2022 at 5:41 PM Jeff Davis <pgsql@j-davis.com> wrote:
> We haven't fully solved the changing collation-provider problem. An
> upgrade of the OS may change the version of libc or icu, and that might
> affect the collation, which could leave you with various corrupt
> database objects including:
>
>   * indexes
>   * constraints
>   * range types or multiranges (or other types dependent
>     on collation for internal consistency)
>   * materialized views
>   * partitioned tables (range or hash)

Check.

> There's discussion about trying to reliably detect these changes and
> remedy them. But there are major challenges; for instance, glibc
> doesn't give a reliable signal that a collation may have changed, which
> would leave us with a lot of false positives and create a new set of
> problems (e.g. reindexing when it's unnecessary). And even with ICU, we
> don't have a way to support multiple versions of a provider or of a
> single collation, so trying to upgrade would still be a hassle.

FWIW some experimental code for multi-version ICU is proposed for
discussion here:

https://commitfest.postgresql.org/40/3956/

> Proposal:
>
> Add in some tools to make it easier for administrators to find out if
> they are at risk and solve the problem for themselves in a systematic
> way.

Excellent goal.

> Patches:
>
>   0001: Treat "default" collation as unpinned, so that entries in
> pg_depend are created. The rationale is that, since the "default"
> collation can change, it's not really an immutable system object, and
> it's worth tracking which objects are affected by it. It seems to bloat
> pg_depend by about 5-10% though -- that doesn't seem great, but I'm not
> sure if it's a real problem or not.

FWIW we did this (plus a lot more) in the per-index version tracking
feature reverted from 14.

>   0002: Enable pg_collation_actual_version() to work on the default
> collation (oid=100) so that it doesn't need to be treated as a special
> case.

Makes sense.

>   0003: Fix ALTER COLLATION "default" REFRESH VERSION, which currently
> throws an unhelpful internal error. Instead, issue a more helpful error
> that suggests "ALTER DATABASE ... REFRESH COLLATION VERSION" instead.

Makes sense.

>   0004: Add system views:
>     pg_collation_versions: quickly see the current (from the catalog)
> and actual (from the provider) versions of each collation
>     pg_collation_dependencies: map of objects to the collations they
> depend on
>
> Along with these patches, you can use some tricks to verify data, such
> as /contrib/amcheck; or fix the data with things like:
>
>   * REINDEX
>   * VACUUM FULL/TRUNCATE/CLUSTER
>   * REFRESH MATERIALIZED VIEW
>
> And then refresh the collation version when you're confident that your
> data is valid.

Here you run into an argument that we had many times in that cycle:
what's the point of views that suffer both false positives and false
negatives?

> TODO:

>   * Consider better tracking of which collation versions were active on
> a particular object since the last REINDEX (or REFRESH MATERIALIZED
> VIEW, TRUNCATE, or other command that would remove any trace of data
> affected by the previous collation version).

Right, the per-object dependency tracking feature, reverted from 14,
aimed to do exactly that.  It fell down on (1) some specific bugs that
were hard to fix, like dependencies inherited via composite types when
you change the composite type, and (2) doubt expressed by Tom, and
earlier Stephen, that pg_depend was a good place to store version
information.



Re: 16: Collation versioning and dependency helpers

From
Jeff Davis
Date:
On Sun, 2022-10-30 at 19:10 +1300, Thomas Munro wrote:
> FWIW we did this (plus a lot more) in the per-index version tracking
> feature reverted from 14.

Thank you. I will catch up on that patch/thread.

> >   0002: Enable pg_collation_actual_version() to work on the default
>
> Makes sense.
>
> >   0003: Fix ALTER COLLATION "default" REFRESH VERSION, which
>
> Makes sense.

Committed these two small changes.

> >   0004: Add system views:
> >     pg_collation_versions: quickly see the current (from the
> > catalog)
> > and actual (from the provider) versions of each collation
> >     pg_collation_dependencies: map of objects to the collations
> > they
> > depend on
> >
> > Along with these patches, you can use some tricks to verify data,
> > such
> > as /contrib/amcheck; or fix the data with things like:
> >
> >   * REINDEX
> >   * VACUUM FULL/TRUNCATE/CLUSTER
> >   * REFRESH MATERIALIZED VIEW
> >
> > And then refresh the collation version when you're confident that
> > your
> > data is valid.
>
> Here you run into an argument that we had many times in that cycle:
> what's the point of views that suffer both false positives and false
> negatives?

The pg_collation_versions view is just a convenience, useful because
the default collation isn't represented normally in pg_collation so it
needs to be special-cased.

I could see how it would be tricky to precisely track the dependencies
through composite types (that is, create the proper pg_depend records),
but to just provide a view of the affected-by relationship seems more
doable. I'll review the previous discussion and see what I come up
with.

Of course, the view will just show an "affected by" relationship, it
won't show which objects are actually in violation of the current
collation version. But it at least gives the administrator a starting
place.

Regards,
    Jeff Davis




Re: 16: Collation versioning and dependency helpers

From
Andres Freund
Date:
Hi,

On 2022-10-31 16:36:54 -0700, Jeff Davis wrote:
> Committed these two small changes.

FWIW, as it stands cfbot can't apply the remaining changes:
http://cfbot.cputube.org/patch_40_3977.log

Perhaps worth posting a new version? Or are the remaining patches abandoned in
favor of the other threads?

Greetings,

Andres Freund



Re: 16: Collation versioning and dependency helpers

From
Jeff Davis
Date:
On Tue, 2022-12-06 at 10:53 -0800, Andres Freund wrote:
> Perhaps worth posting a new version? Or are the remaining patches
> abandoned in
> favor of the other threads?

Marked what is there as committed, and the remainder is abandoned in
favor of other threads.

Thanks,
    Jeff Davis