Thread: Q: documentation improvement re collation version mismatch

Q: documentation improvement re collation version mismatch

From

Karsten Hilbert

Date:

09 November 2022, 11:45:17

Dear all,

regarding changed collation versions this

    https://www.postgresql.org/docs/devel/sql-altercollation.html

says:

    The following query can be used to identify all
    collations in the current database that need to be
    refreshed and the objects that depend on them:

    SELECT pg_describe_object(refclassid, refobjid, refobjsubid) AS "Collation",
           pg_describe_object(classid, objid, objsubid) AS "Object"
      FROM pg_depend d JOIN pg_collation c
           ON refclassid = 'pg_collation'::regclass AND refobjid = c.oid
      WHERE c.collversion <> pg_collation_actual_version(c.oid)
      ORDER BY 1, 2;

I feel the result of that query can be slightly surprising
because it does not return (to my testing) any objects
depending on the database default collation, nor the database
itself (as per a collation version mismatch in pg_database).

Now, there is a line

    For the database default collation, there is an analogous
    command ALTER DATABASE ... REFRESH COLLATION VERSION.

right above that query but the query comment does not really
make it clear that the database default collation is _not_
identified to be in mismatch, if so. IOW, the database
default collation may still need to be refreshed even if the
query does not return any rows.

Perhaps this query (taken from the net)

    SELECT    -- get collation-change endangered indices
        indrelid::regclass::text,
        indexrelid::regclass::text,
        collname,
        pg_get_indexdef(indexrelid)
    FROM (
            SELECT
                indexrelid,
                indrelid,
                indcollation[i] coll
            FROM
                pg_index, generate_subscripts(indcollation, 1) g(i)
        ) s
            JOIN pg_collation c ON coll=c.oid
    WHERE
        collprovider IN ('d', 'c')
            AND
        collname NOT IN ('C', 'POSIX');

could be added to the paragraph (or it could be folded into
the first query by a UNION or some such) ?

Or perhaps one could move the "ALTER DATABASE ... REFRESH
..." hint _below_ the query paragraph and add "Note: you may
need to refresh the default collation even if the query above
does not show any objects directly affected by a collation
version change" ?

Thanks for considering.

Best,
Karsten
--
GPG  40BE 5B0E C98E 1713 AFA6  5BC0 3BEA AC80 7D4F C89B

Re: Q: documentation improvement re collation version mismatch

From

Julien Rouhaud

Date:

10 November 2022, 08:33:22

Hi,

On Wed, Nov 09, 2022 at 12:45:17PM +0100, Karsten Hilbert wrote:
> Dear all,
>
> regarding changed collation versions this
>
>     https://www.postgresql.org/docs/devel/sql-altercollation.html
>
> says:
>
>     The following query can be used to identify all
>     collations in the current database that need to be
>     refreshed and the objects that depend on them:
>
>     SELECT pg_describe_object(refclassid, refobjid, refobjsubid) AS "Collation",
>            pg_describe_object(classid, objid, objsubid) AS "Object"
>       FROM pg_depend d JOIN pg_collation c
>            ON refclassid = 'pg_collation'::regclass AND refobjid = c.oid
>       WHERE c.collversion <> pg_collation_actual_version(c.oid)
>       ORDER BY 1, 2;
>
> I feel the result of that query can be slightly surprising
> because it does not return (to my testing) any objects
> depending on the database default collation, nor the database
> itself (as per a collation version mismatch in pg_database).

Indeed.  The default collation is "pinned", so we don't record any dependency
on it.

But also, getting the list of direct dependency to a collation is also almost
useless as there are so many other scenario where we wouldn't record an index
dependency on a collation.

> Now, there is a line
>
> Perhaps this query (taken from the net)
>
>     SELECT    -- get collation-change endangered indices
>         indrelid::regclass::text,
>         indexrelid::regclass::text,
>         collname,
>         pg_get_indexdef(indexrelid)
>     FROM (
>             SELECT
>                 indexrelid,
>                 indrelid,
>                 indcollation[i] coll
>             FROM
>                 pg_index, generate_subscripts(indcollation, 1) g(i)
>         ) s
>             JOIN pg_collation c ON coll=c.oid
>     WHERE
>         collprovider IN ('d', 'c')
>             AND
>         collname NOT IN ('C', 'POSIX');
>
> could be added to the paragraph (or it could be folded into
> the first query by a UNION or some such) ?

That query is a bit better, but unfortunately there are a lot of cases it won't
detect (like some use of collation in expressions or WHERE clauses), so if you
had a collation library upgrade that breaks your collations you can't use that
to reliably fix your indexes.

For now, the only safe way to go is either reindex everything, or everything
except some safe cases (non-partial indexes on plain-non-collatable datatypes
only).  Usually, those safe cases are usually enough to avoid most of useless
reindex activity.

Aw: Re: Q: documentation improvement re collation version mismatch

From

Karsten Hilbert

Date:

10 November 2022, 10:47:01

Thanks, Julien, for your explanation.

> > regarding changed collation versions this
> >
> >     https://www.postgresql.org/docs/devel/sql-altercollation.html
> >
> > says:
> >
> >     The following query can be used to identify all
> >     collations in the current database that need to be
> >     refreshed and the objects that depend on them:
> >
> >     SELECT pg_describe_object(refclassid, refobjid, refobjsubid) AS "Collation",
> >            pg_describe_object(classid, objid, objsubid) AS "Object"
> >       FROM pg_depend d JOIN pg_collation c
> >            ON refclassid = 'pg_collation'::regclass AND refobjid = c.oid
> >       WHERE c.collversion <> pg_collation_actual_version(c.oid)
> >       ORDER BY 1, 2;
> >
> > I feel the result of that query can be slightly surprising
> > because it does not return (to my testing) any objects
> > depending on the database default collation, nor the database
> > itself (as per a collation version mismatch in pg_database).
>
> Indeed.  The default collation is "pinned", so we don't record any dependency
> on it.

Indirectly we do, don't we ?  Or else

> >     WHERE
> >         collprovider IN ('d', 'c')

would not make much sense, right ?

The comment above the query in the official documentation is rather assertive
(even if may true to the letter) and may warrant some more cautionary
wording ?   Added, perhaps, some variation of this:

> For now, the only safe way to go is either reindex everything, or everything
> except some safe cases (non-partial indexes on plain-non-collatable datatypes
> only).

Best,
Karsten

Re: Q: documentation improvement re collation version mismatch

From

Julien Rouhaud

Date:

10 November 2022, 11:36:21

On Thu, Nov 10, 2022 at 11:47:01AM +0100, Karsten Hilbert wrote:
> Thanks, Julien, for your explanation.
> 
> > > regarding changed collation versions this
> > >
> > >     https://www.postgresql.org/docs/devel/sql-altercollation.html
> > >
> > > says:
> > >
> > >     The following query can be used to identify all
> > >     collations in the current database that need to be
> > >     refreshed and the objects that depend on them:
> > >
> > >     SELECT pg_describe_object(refclassid, refobjid, refobjsubid) AS "Collation",
> > >            pg_describe_object(classid, objid, objsubid) AS "Object"
> > >       FROM pg_depend d JOIN pg_collation c
> > >            ON refclassid = 'pg_collation'::regclass AND refobjid = c.oid
> > >       WHERE c.collversion <> pg_collation_actual_version(c.oid)
> > >       ORDER BY 1, 2;
> > >
> > > I feel the result of that query can be slightly surprising
> > > because it does not return (to my testing) any objects
> > > depending on the database default collation, nor the database
> > > itself (as per a collation version mismatch in pg_database).
> >
> > Indeed.  The default collation is "pinned", so we don't record any dependency
> > on it.
> 
> Indirectly we do, don't we ?  Or else
> 
> > >     WHERE
> > >         collprovider IN ('d', 'c')
> 
> would not make much sense, right ?

What I meant is that we don't insert record in pg_depend to track dependencies
on pinned object, including the default collation.  The collprovider here comes
from pg_index.indcollation which is a different thing.  It can indeed store the
default collation, but it's only a small step toward less false negative.

Try that query with e.g.

CREATE INDEX ON sometable ( (somecol > 'somevalue') );

or

CREATE INDEX ON sometable (someid) WHERE somecol > 'somevalue';

Both clearly can get corrupted if the underlying collation library changes the
result of somecol > 'somevalue', but wouldn't be detected by that query.  There
are likely a lot more cases that would be missed, you can refer to the
discussions from a couple years ago when we tried to properly track all index
collation dependencies.

> The comment above the query in the official documentation is rather assertive
> (even if may true to the letter) and may warrant some more cautionary
> wording ?   Added, perhaps, some variation of this:
> 
> > For now, the only safe way to go is either reindex everything, or everything
> > except some safe cases (non-partial indexes on plain-non-collatable datatypes
> > only).

I think the comment is very poorly worded, as it leads readers to believe that
objects with a pg_depend dependency on a collation are the only one that would
get corrupted in case of glibc/ICU upgrade.

I agree that there should be a big fat red warning saying something like
"reindex everything if there's any discrepancy between the recorded collation
version and the currently reported one unless you REALLY know what you're
doing."

Re: Q: documentation improvement re collation version mismatch

From

Ron

Date:

10 November 2022, 14:04:37

On 11/10/22 02:33, Julien Rouhaud wrote:
[snip]
> For now, the only safe way to go is either reindex everything, or everything
> except some safe cases (non-partial indexes on plain-non-collatable datatypes
> only).  Usually, those safe cases are usually enough to avoid most of useless
> reindex activity.

In this situation, I queried for all indices with text-type columns 
(including UUID) and reindexed them.

-- 
Angular momentum makes the world go 'round.

Re: Q: documentation improvement re collation version mismatch

From

Julien Rouhaud

Date:

10 November 2022, 14:33:29

On Thu, Nov 10, 2022 at 08:04:37AM -0600, Ron wrote:
> On 11/10/22 02:33, Julien Rouhaud wrote:
> [snip]
> > For now, the only safe way to go is either reindex everything, or everything
> > except some safe cases (non-partial indexes on plain-non-collatable datatypes
> > only).  Usually, those safe cases are usually enough to avoid most of useless
> > reindex activity.
> 
> In this situation, I queried for all indices with text-type columns
> (including UUID) and reindexed them.

That may work in your personal use case, but it's not generally safe.  I hope
you don't have partial indexes, expressions or functions that internally relied
on collatable datatypes, or indexes on other plain collatable datatypes.

Re: Q: documentation improvement re collation version mismatch

From

Ron

Date:

10 November 2022, 14:39:03

On 11/10/22 08:33, Julien Rouhaud wrote:
> On Thu, Nov 10, 2022 at 08:04:37AM -0600, Ron wrote:
>> On 11/10/22 02:33, Julien Rouhaud wrote:
>> [snip]
>>> For now, the only safe way to go is either reindex everything, or everything
>>> except some safe cases (non-partial indexes on plain-non-collatable datatypes
>>> only).  Usually, those safe cases are usually enough to avoid most of useless
>>> reindex activity.
>> In this situation, I queried for all indices with text-type columns
>> (including UUID) and reindexed them.
> That may work in your personal use case, but it's not generally safe.  I hope
> you don't have partial indexes,

Aren't partial indices listed in pg_indexes?

> expressions or functions that internally relied
> on collatable datatypes,

No such indices.

>   or indexes on other plain collatable datatypes.

What other data types besides the text types are collatable?

-- 
Angular momentum makes the world go 'round.

Re: Q: documentation improvement re collation version mismatch

From

Julien Rouhaud

Date:

10 November 2022, 14:50:39

On Thu, Nov 10, 2022 at 08:39:03AM -0600, Ron wrote:
> On 11/10/22 08:33, Julien Rouhaud wrote:
> > On Thu, Nov 10, 2022 at 08:04:37AM -0600, Ron wrote:
> > > On 11/10/22 02:33, Julien Rouhaud wrote:
> > > [snip]
> > > > For now, the only safe way to go is either reindex everything, or everything
> > > > except some safe cases (non-partial indexes on plain-non-collatable datatypes
> > > > only).  Usually, those safe cases are usually enough to avoid most of useless
> > > > reindex activity.
> > > In this situation, I queried for all indices with text-type columns
> > > (including UUID) and reindexed them.
> > That may work in your personal use case, but it's not generally safe.  I hope
> > you don't have partial indexes,
> 
> Aren't partial indices listed in pg_indexes?

They are, but you mentioned "indices with text-type columns".  Did that also
include an analysis of the expressions stored in pg_index.indpred (and
pg_index.indexpr), and all underlying code it could call?

> >   or indexes on other plain collatable datatypes.
> 
> What other data types besides the text types are collatable?

varchar, name, custom types, composite types, domains...

Aw: Re: Q: documentation improvement re collation version mismatch

From

Karsten Hilbert

Date:

10 November 2022, 17:36:04

> > The comment above the query in the official documentation is rather assertive
> > (even if may true to the letter) and may warrant some more cautionary
> > wording ?   Added, perhaps, some variation of this:
> >
> > > For now, the only safe way to go is either reindex everything, or everything
> > > except some safe cases (non-partial indexes on plain-non-collatable datatypes
> > > only).
>
> I think the comment is very poorly worded, as it leads readers to believe that
> objects with a pg_depend dependency on a collation are the only one that would
> get corrupted in case of glibc/ICU upgrade.
>
> I agree that there should be a big fat red warning saying something like
> "reindex everything if there's any discrepancy between the recorded collation
> version and the currently reported one unless you REALLY know what you're
> doing."

Given that it does not seem straightforward to mechanically detect objects
in need of a collation-associated rebuild I would think that such a warning
would change matters for the better, documentation-wise.

Karsten