Thread: Q: documentation improvement re collation version mismatch
Dear all, regarding changed collation versions this https://www.postgresql.org/docs/devel/sql-altercollation.html says: The following query can be used to identify all collations in the current database that need to be refreshed and the objects that depend on them: SELECT pg_describe_object(refclassid, refobjid, refobjsubid) AS "Collation", pg_describe_object(classid, objid, objsubid) AS "Object" FROM pg_depend d JOIN pg_collation c ON refclassid = 'pg_collation'::regclass AND refobjid = c.oid WHERE c.collversion <> pg_collation_actual_version(c.oid) ORDER BY 1, 2; I feel the result of that query can be slightly surprising because it does not return (to my testing) any objects depending on the database default collation, nor the database itself (as per a collation version mismatch in pg_database). Now, there is a line For the database default collation, there is an analogous command ALTER DATABASE ... REFRESH COLLATION VERSION. right above that query but the query comment does not really make it clear that the database default collation is _not_ identified to be in mismatch, if so. IOW, the database default collation may still need to be refreshed even if the query does not return any rows. Perhaps this query (taken from the net) SELECT -- get collation-change endangered indices indrelid::regclass::text, indexrelid::regclass::text, collname, pg_get_indexdef(indexrelid) FROM ( SELECT indexrelid, indrelid, indcollation[i] coll FROM pg_index, generate_subscripts(indcollation, 1) g(i) ) s JOIN pg_collation c ON coll=c.oid WHERE collprovider IN ('d', 'c') AND collname NOT IN ('C', 'POSIX'); could be added to the paragraph (or it could be folded into the first query by a UNION or some such) ? Or perhaps one could move the "ALTER DATABASE ... REFRESH ..." hint _below_ the query paragraph and add "Note: you may need to refresh the default collation even if the query above does not show any objects directly affected by a collation version change" ? Thanks for considering. Best, Karsten -- GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B
Hi, On Wed, Nov 09, 2022 at 12:45:17PM +0100, Karsten Hilbert wrote: > Dear all, > > regarding changed collation versions this > > https://www.postgresql.org/docs/devel/sql-altercollation.html > > says: > > The following query can be used to identify all > collations in the current database that need to be > refreshed and the objects that depend on them: > > SELECT pg_describe_object(refclassid, refobjid, refobjsubid) AS "Collation", > pg_describe_object(classid, objid, objsubid) AS "Object" > FROM pg_depend d JOIN pg_collation c > ON refclassid = 'pg_collation'::regclass AND refobjid = c.oid > WHERE c.collversion <> pg_collation_actual_version(c.oid) > ORDER BY 1, 2; > > I feel the result of that query can be slightly surprising > because it does not return (to my testing) any objects > depending on the database default collation, nor the database > itself (as per a collation version mismatch in pg_database). Indeed. The default collation is "pinned", so we don't record any dependency on it. But also, getting the list of direct dependency to a collation is also almost useless as there are so many other scenario where we wouldn't record an index dependency on a collation. > Now, there is a line > > Perhaps this query (taken from the net) > > SELECT -- get collation-change endangered indices > indrelid::regclass::text, > indexrelid::regclass::text, > collname, > pg_get_indexdef(indexrelid) > FROM ( > SELECT > indexrelid, > indrelid, > indcollation[i] coll > FROM > pg_index, generate_subscripts(indcollation, 1) g(i) > ) s > JOIN pg_collation c ON coll=c.oid > WHERE > collprovider IN ('d', 'c') > AND > collname NOT IN ('C', 'POSIX'); > > could be added to the paragraph (or it could be folded into > the first query by a UNION or some such) ? That query is a bit better, but unfortunately there are a lot of cases it won't detect (like some use of collation in expressions or WHERE clauses), so if you had a collation library upgrade that breaks your collations you can't use that to reliably fix your indexes. For now, the only safe way to go is either reindex everything, or everything except some safe cases (non-partial indexes on plain-non-collatable datatypes only). Usually, those safe cases are usually enough to avoid most of useless reindex activity.
Thanks, Julien, for your explanation. > > regarding changed collation versions this > > > > https://www.postgresql.org/docs/devel/sql-altercollation.html > > > > says: > > > > The following query can be used to identify all > > collations in the current database that need to be > > refreshed and the objects that depend on them: > > > > SELECT pg_describe_object(refclassid, refobjid, refobjsubid) AS "Collation", > > pg_describe_object(classid, objid, objsubid) AS "Object" > > FROM pg_depend d JOIN pg_collation c > > ON refclassid = 'pg_collation'::regclass AND refobjid = c.oid > > WHERE c.collversion <> pg_collation_actual_version(c.oid) > > ORDER BY 1, 2; > > > > I feel the result of that query can be slightly surprising > > because it does not return (to my testing) any objects > > depending on the database default collation, nor the database > > itself (as per a collation version mismatch in pg_database). > > Indeed. The default collation is "pinned", so we don't record any dependency > on it. Indirectly we do, don't we ? Or else > > WHERE > > collprovider IN ('d', 'c') would not make much sense, right ? The comment above the query in the official documentation is rather assertive (even if may true to the letter) and may warrant some more cautionary wording ? Added, perhaps, some variation of this: > For now, the only safe way to go is either reindex everything, or everything > except some safe cases (non-partial indexes on plain-non-collatable datatypes > only). Best, Karsten
On Thu, Nov 10, 2022 at 11:47:01AM +0100, Karsten Hilbert wrote: > Thanks, Julien, for your explanation. > > > > regarding changed collation versions this > > > > > > https://www.postgresql.org/docs/devel/sql-altercollation.html > > > > > > says: > > > > > > The following query can be used to identify all > > > collations in the current database that need to be > > > refreshed and the objects that depend on them: > > > > > > SELECT pg_describe_object(refclassid, refobjid, refobjsubid) AS "Collation", > > > pg_describe_object(classid, objid, objsubid) AS "Object" > > > FROM pg_depend d JOIN pg_collation c > > > ON refclassid = 'pg_collation'::regclass AND refobjid = c.oid > > > WHERE c.collversion <> pg_collation_actual_version(c.oid) > > > ORDER BY 1, 2; > > > > > > I feel the result of that query can be slightly surprising > > > because it does not return (to my testing) any objects > > > depending on the database default collation, nor the database > > > itself (as per a collation version mismatch in pg_database). > > > > Indeed. The default collation is "pinned", so we don't record any dependency > > on it. > > Indirectly we do, don't we ? Or else > > > > WHERE > > > collprovider IN ('d', 'c') > > would not make much sense, right ? What I meant is that we don't insert record in pg_depend to track dependencies on pinned object, including the default collation. The collprovider here comes from pg_index.indcollation which is a different thing. It can indeed store the default collation, but it's only a small step toward less false negative. Try that query with e.g. CREATE INDEX ON sometable ( (somecol > 'somevalue') ); or CREATE INDEX ON sometable (someid) WHERE somecol > 'somevalue'; Both clearly can get corrupted if the underlying collation library changes the result of somecol > 'somevalue', but wouldn't be detected by that query. There are likely a lot more cases that would be missed, you can refer to the discussions from a couple years ago when we tried to properly track all index collation dependencies. > The comment above the query in the official documentation is rather assertive > (even if may true to the letter) and may warrant some more cautionary > wording ? Added, perhaps, some variation of this: > > > For now, the only safe way to go is either reindex everything, or everything > > except some safe cases (non-partial indexes on plain-non-collatable datatypes > > only). I think the comment is very poorly worded, as it leads readers to believe that objects with a pg_depend dependency on a collation are the only one that would get corrupted in case of glibc/ICU upgrade. I agree that there should be a big fat red warning saying something like "reindex everything if there's any discrepancy between the recorded collation version and the currently reported one unless you REALLY know what you're doing."
On 11/10/22 02:33, Julien Rouhaud wrote: [snip] > For now, the only safe way to go is either reindex everything, or everything > except some safe cases (non-partial indexes on plain-non-collatable datatypes > only). Usually, those safe cases are usually enough to avoid most of useless > reindex activity. In this situation, I queried for all indices with text-type columns (including UUID) and reindexed them. -- Angular momentum makes the world go 'round.
On Thu, Nov 10, 2022 at 08:04:37AM -0600, Ron wrote: > On 11/10/22 02:33, Julien Rouhaud wrote: > [snip] > > For now, the only safe way to go is either reindex everything, or everything > > except some safe cases (non-partial indexes on plain-non-collatable datatypes > > only). Usually, those safe cases are usually enough to avoid most of useless > > reindex activity. > > In this situation, I queried for all indices with text-type columns > (including UUID) and reindexed them. That may work in your personal use case, but it's not generally safe. I hope you don't have partial indexes, expressions or functions that internally relied on collatable datatypes, or indexes on other plain collatable datatypes.
On 11/10/22 08:33, Julien Rouhaud wrote: > On Thu, Nov 10, 2022 at 08:04:37AM -0600, Ron wrote: >> On 11/10/22 02:33, Julien Rouhaud wrote: >> [snip] >>> For now, the only safe way to go is either reindex everything, or everything >>> except some safe cases (non-partial indexes on plain-non-collatable datatypes >>> only). Usually, those safe cases are usually enough to avoid most of useless >>> reindex activity. >> In this situation, I queried for all indices with text-type columns >> (including UUID) and reindexed them. > That may work in your personal use case, but it's not generally safe. I hope > you don't have partial indexes, Aren't partial indices listed in pg_indexes? > expressions or functions that internally relied > on collatable datatypes, No such indices. > or indexes on other plain collatable datatypes. What other data types besides the text types are collatable? -- Angular momentum makes the world go 'round.
On Thu, Nov 10, 2022 at 08:39:03AM -0600, Ron wrote: > On 11/10/22 08:33, Julien Rouhaud wrote: > > On Thu, Nov 10, 2022 at 08:04:37AM -0600, Ron wrote: > > > On 11/10/22 02:33, Julien Rouhaud wrote: > > > [snip] > > > > For now, the only safe way to go is either reindex everything, or everything > > > > except some safe cases (non-partial indexes on plain-non-collatable datatypes > > > > only). Usually, those safe cases are usually enough to avoid most of useless > > > > reindex activity. > > > In this situation, I queried for all indices with text-type columns > > > (including UUID) and reindexed them. > > That may work in your personal use case, but it's not generally safe. I hope > > you don't have partial indexes, > > Aren't partial indices listed in pg_indexes? They are, but you mentioned "indices with text-type columns". Did that also include an analysis of the expressions stored in pg_index.indpred (and pg_index.indexpr), and all underlying code it could call? > > or indexes on other plain collatable datatypes. > > What other data types besides the text types are collatable? varchar, name, custom types, composite types, domains...
> > The comment above the query in the official documentation is rather assertive > > (even if may true to the letter) and may warrant some more cautionary > > wording ? Added, perhaps, some variation of this: > > > > > For now, the only safe way to go is either reindex everything, or everything > > > except some safe cases (non-partial indexes on plain-non-collatable datatypes > > > only). > > I think the comment is very poorly worded, as it leads readers to believe that > objects with a pg_depend dependency on a collation are the only one that would > get corrupted in case of glibc/ICU upgrade. > > I agree that there should be a big fat red warning saying something like > "reindex everything if there's any discrepancy between the recorded collation > version and the currently reported one unless you REALLY know what you're > doing." Given that it does not seem straightforward to mechanically detect objects in need of a collation-associated rebuild I would think that such a warning would change matters for the better, documentation-wise. Karsten