Thread: pg15b2: large objects lost on upgrade
I noticed this during beta1, but dismissed the issue when it wasn't easily reproducible. Now, I saw the same problem while upgrading from beta1 to beta2, so couldn't dismiss it. It turns out that LOs are lost if VACUUM FULL was run. | /usr/pgsql-15b1/bin/initdb --no-sync -D pg15b1.dat -k | /usr/pgsql-15b1/bin/postgres -D pg15b1.dat -c logging_collector=no -p 5678 -k /tmp& | psql -h /tmp postgres -p 5678 -c '\lo_import /etc/shells' -c 'VACUUM FULL pg_largeobject' | rm -fr pg15b2.dat && /usr/pgsql-15b2/bin/initdb --no-sync -k -D pg15b2.dat && /usr/pgsql-15b2/bin/pg_upgrade -d pg15b1.dat-D pg15b2.dat -b /usr/pgsql-15b1/bin | /usr/pgsql-15b2/bin/postgres -D pg15b2.dat -c logging_collector=no -p 5678 -k /tmp& Or, for your convenience, with paths in tmp_install: | ./tmp_install/usr/local/pgsql/bin/initdb --no-sync -D pg15b1.dat -k | ./tmp_install/usr/local/pgsql/bin/postgres -D pg15b1.dat -c logging_collector=no -p 5678 -k /tmp& | psql -h /tmp postgres -p 5678 -c '\lo_import /etc/shells' -c 'VACUUM FULL pg_largeobject' | rm -fr pg15b2.dat && ./tmp_install/usr/local/pgsql/bin/initdb --no-sync -k -D pg15b2.dat && ./tmp_install/usr/local/pgsql/bin/pg_upgrade-d pg15b1.dat -D pg15b2.dat -b ./tmp_install/usr/local/pgsql/bin | ./tmp_install/usr/local/pgsql/bin/postgres -D pg15b2.dat -c logging_collector=no -p 5678 -k /tmp& postgres=# table pg_largeobject_metadata ; 16384 | 10 | postgres=# \lo_list 16384 | pryzbyj | postgres=# \lo_export 16384 /dev/stdout lo_export postgres=# table pg_largeobject; postgres=# \dt+ pg_largeobject pg_catalog | pg_largeobject | table | pryzbyj | permanent | heap | 0 bytes | I reproduced the problem at 9a974cbcba but not its parent commit. commit 9a974cbcba005256a19991203583a94b4f9a21a9 Author: Robert Haas <rhaas@postgresql.org> Date: Mon Jan 17 13:32:44 2022 -0500 pg_upgrade: Preserve relfilenodes and tablespace OIDs. -- Justin
On Fri, Jul 01, 2022 at 06:14:13PM -0500, Justin Pryzby wrote: > I reproduced the problem at 9a974cbcba but not its parent commit. > > commit 9a974cbcba005256a19991203583a94b4f9a21a9 > Author: Robert Haas <rhaas@postgresql.org> > Date: Mon Jan 17 13:32:44 2022 -0500 > > pg_upgrade: Preserve relfilenodes and tablespace OIDs. Oops. Robert? This reproduces as well when using pg_upgrade with the same version as origin and target, meaning that this extra step in the TAP test is able to reproduce the issue (extra VACUUM FULL before chdir'ing): --- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl +++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl @@ -208,6 +208,8 @@ if (defined($ENV{oldinstall})) } } +$oldnode->safe_psql("regression", "VACUUM FULL pg_largeobject;"); + # In a VPATH build, we'll be started in the source directory, but we want -- Michael
Attachment
On Fri, Jul 1, 2022 at 7:14 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > I noticed this during beta1, but dismissed the issue when it wasn't easily > reproducible. Now, I saw the same problem while upgrading from beta1 to beta2, > so couldn't dismiss it. It turns out that LOs are lost if VACUUM FULL was run. Yikes. That's really bad, and I have no idea what might be causing it, either. I'll plan to investigate this on Tuesday unless someone gets to it before then. -- Robert Haas EDB: http://www.enterprisedb.com
On Sat, Jul 02, 2022 at 08:34:04AM -0400, Robert Haas wrote: > On Fri, Jul 1, 2022 at 7:14 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > I noticed this during beta1, but dismissed the issue when it wasn't easily > > reproducible. Now, I saw the same problem while upgrading from beta1 to beta2, > > so couldn't dismiss it. It turns out that LOs are lost if VACUUM FULL was run. > > Yikes. That's really bad, and I have no idea what might be causing it, > either. I'll plan to investigate this on Tuesday unless someone gets > to it before then. I suppose it's like Bruce said, here. https://www.postgresql.org/message-id/20210601140949.GC22012%40momjian.us |One tricky case is pg_largeobject, which is copied from the old to new |cluster since it has user data. To preserve that relfilenode, you would |need to have pg_upgrade perform cluster surgery in each database to |renumber its relfilenode to match since it is created by initdb. I |can't think of a case where pg_upgrade already does something like that. Rather than setting the filenode of the next object as for user tables, pg-upgrade needs to UPDATE the relfilenode. This patch "works for me" but feel free to improve on it.
Attachment
On Sat, Jul 02, 2022 at 08:34:04AM -0400, Robert Haas wrote: > On Fri, Jul 1, 2022 at 7:14 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > I noticed this during beta1, but dismissed the issue when it wasn't easily > > reproducible. Now, I saw the same problem while upgrading from beta1 to beta2, > > so couldn't dismiss it. It turns out that LOs are lost if VACUUM FULL was run. > > Yikes. That's really bad, and I have no idea what might be causing it, > either. I'll plan to investigate this on Tuesday unless someone gets > to it before then. As far as I can see the data is still there, it's just that the new cluster keeps its default relfilenode instead of preserving the old cluster's value: regression=# table pg_largeobject; loid | pageno | data ------+--------+------ (0 rows) regression=# select oid, relfilenode from pg_class where relname = 'pg_largeobject'; oid | relfilenode ------+------------- 2613 | 2613 (1 row) -- using the value from the old cluster regression=# update pg_class set relfilenode = 39909 where oid = 2613; UPDATE 1 regression=# table pg_largeobject; loid | pageno | -------+--------+----------------- 33211 | 0 | \x0a4920776[...] 34356 | 0 | \xdeadbeef (2 rows)
I was able to reproduce the issue. Also, the issue does not occur with code before to preserve relfilenode commit.
I tested your patch and it fixes the problem.
I am currently analyzing a few things related to the issue. I will come back once my analysis is completed.
On Sat, Jul 2, 2022 at 9:19 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Sat, Jul 02, 2022 at 08:34:04AM -0400, Robert Haas wrote:
> On Fri, Jul 1, 2022 at 7:14 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > I noticed this during beta1, but dismissed the issue when it wasn't easily
> > reproducible. Now, I saw the same problem while upgrading from beta1 to beta2,
> > so couldn't dismiss it. It turns out that LOs are lost if VACUUM FULL was run.
>
> Yikes. That's really bad, and I have no idea what might be causing it,
> either. I'll plan to investigate this on Tuesday unless someone gets
> to it before then.
I suppose it's like Bruce said, here.
https://www.postgresql.org/message-id/20210601140949.GC22012%40momjian.us
|One tricky case is pg_largeobject, which is copied from the old to new
|cluster since it has user data. To preserve that relfilenode, you would
|need to have pg_upgrade perform cluster surgery in each database to
|renumber its relfilenode to match since it is created by initdb. I
|can't think of a case where pg_upgrade already does something like that.
Rather than setting the filenode of the next object as for user tables,
pg-upgrade needs to UPDATE the relfilenode.
This patch "works for me" but feel free to improve on it.
On Sat, Jul 2, 2022 at 11:49 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > I suppose it's like Bruce said, here. > > https://www.postgresql.org/message-id/20210601140949.GC22012%40momjian.us Well, I feel dumb. I remember reading that email back when Bruce sent it, but it seems that it slipped out of my head between then and when I committed. I think your patch is fine, except that I think maybe we should adjust this dump comment: -- For binary upgrade, set pg_largeobject relfrozenxid, relminmxid and relfilenode Perhaps: -- For binary upgrade, preserve values for pg_largeobject and its index Listing the exact properties preserved seems less important to me than mentioning that the second UPDATE statement is for its index -- because if you look at the SQL that's generated, you can see what's being preserved, but you don't automatically know why there are two UPDATE statements or what the rows with those OIDs are. I had a moment of panic this morning where I thought maybe the whole patch needed to be reverted. I was worried that we might need to preserve the OID of every system table and index. Otherwise, what happens if the OID counter in the old cluster wraps around and some user-created object gets an OID that the system tables are using in the new cluster? However, I think this can't happen, because when the OID counter wraps around, it wraps around to 16384, and the relfilenode values for newly created system tables and indexes are all less than 16384. So I believe we only need to fix pg_largeobject and its index, and I think your patch does that. Anyone else see it differently? -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jul 05, 2022 at 12:43:54PM -0400, Robert Haas wrote: > On Sat, Jul 2, 2022 at 11:49 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > I suppose it's like Bruce said, here. > > > > https://www.postgresql.org/message-id/20210601140949.GC22012%40momjian.us > > Well, I feel dumb. I remember reading that email back when Bruce sent > it, but it seems that it slipped out of my head between then and when > I committed. I think your patch is fine, except that I think maybe we My patch also leaves a 0 byte file around from initdb, which is harmless, but dirty. I've seen before where a bunch of 0 byte files are abandoned in an otherwise-empty tablespace, with no associated relation, and I have to "rm" them to be able to drop the tablespace. Maybe that's a known issue, maybe it's due to crashes or other edge case, maybe it's of no consequence, and maybe it's already been fixed or being fixed already. But it'd be nice to avoid another way to have a 0 byte files - especially ones named with system OIDs. > Listing the exact properties preserved seems less important to me than > mentioning that the second UPDATE statement is for its index -- +1 -- Justin
On Tue, Jul 5, 2022 at 12:56 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > My patch also leaves a 0 byte file around from initdb, which is harmless, but > dirty. > > I've seen before where a bunch of 0 byte files are abandoned in an > otherwise-empty tablespace, with no associated relation, and I have to "rm" > them to be able to drop the tablespace. Maybe that's a known issue, maybe it's > due to crashes or other edge case, maybe it's of no consequence, and maybe it's > already been fixed or being fixed already. But it'd be nice to avoid another > way to have a 0 byte files - especially ones named with system OIDs. Do you want to add something to the patch to have pg_upgrade remove the stray file? -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jul 05, 2022 at 02:40:21PM -0400, Robert Haas wrote: > On Tue, Jul 5, 2022 at 12:56 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > My patch also leaves a 0 byte file around from initdb, which is harmless, but > > dirty. > > > > I've seen before where a bunch of 0 byte files are abandoned in an > > otherwise-empty tablespace, with no associated relation, and I have to "rm" > > them to be able to drop the tablespace. Maybe that's a known issue, maybe it's > > due to crashes or other edge case, maybe it's of no consequence, and maybe it's > > already been fixed or being fixed already. But it'd be nice to avoid another > > way to have a 0 byte files - especially ones named with system OIDs. > > Do you want to add something to the patch to have pg_upgrade remove > the stray file? I'm looking into it, but it'd help to hear suggestions about where to put it. My current ideas aren't very good. -- Justin
On Wed, Jul 6, 2022 at 7:56 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > I'm looking into it, but it'd help to hear suggestions about where to put it. > My current ideas aren't very good. In main() there is a comment that begins "Most failures happen in create_new_objects(), which has just completed at this point." I am thinking you might want to insert a new function call just before that comment, like remove_orphaned_files() or tidy_up_new_cluster(). Another option could be to do something at the beginning of transfer_all_new_tablespaces(). -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Jul 06, 2022 at 08:25:04AM -0400, Robert Haas wrote: > On Wed, Jul 6, 2022 at 7:56 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > I'm looking into it, but it'd help to hear suggestions about where to put it. > > My current ideas aren't very good. > > In main() there is a comment that begins "Most failures happen in > create_new_objects(), which has just completed at this point." I am > thinking you might want to insert a new function call just before that > comment, like remove_orphaned_files() or tidy_up_new_cluster(). > > Another option could be to do something at the beginning of > transfer_all_new_tablespaces(). That seems like the better option, since it has access to the custer's filenodes. I checked upgrades from 9.2, upgrades with/out vacuum full, and upgrades with a DB tablespace. Maybe it's a good idea to check that the file is empty before unlinking... -- Justin
Attachment
On Thu, Jul 7, 2022 at 1:10 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > Maybe it's a good idea to check that the file is empty before unlinking... If we want to verify that there are no large objects in the cluster, we could do that in check_new_cluster_is_empty(). However, even if there aren't, the length of the file could still be more than 0, if there were some large objects previously and then they were removed. So it's not entirely obvious to me that we should refuse to remove a non-empty file. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jul 7, 2022 at 01:38:44PM -0400, Robert Haas wrote: > On Thu, Jul 7, 2022 at 1:10 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > Maybe it's a good idea to check that the file is empty before unlinking... > > If we want to verify that there are no large objects in the cluster, > we could do that in check_new_cluster_is_empty(). However, even if > there aren't, the length of the file could still be more than 0, if > there were some large objects previously and then they were removed. > So it's not entirely obvious to me that we should refuse to remove a > non-empty file. Uh, that initdb-created pg_largeobject file should not have any data in it ever, as far as I know at that point in pg_upgrade. How would values have gotten in there? Via pg_dump? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
On Thu, Jul 7, 2022 at 2:24 PM Bruce Momjian <bruce@momjian.us> wrote: > On Thu, Jul 7, 2022 at 01:38:44PM -0400, Robert Haas wrote: > > On Thu, Jul 7, 2022 at 1:10 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > Maybe it's a good idea to check that the file is empty before unlinking... > > > > If we want to verify that there are no large objects in the cluster, > > we could do that in check_new_cluster_is_empty(). However, even if > > there aren't, the length of the file could still be more than 0, if > > there were some large objects previously and then they were removed. > > So it's not entirely obvious to me that we should refuse to remove a > > non-empty file. > > Uh, that initdb-created pg_largeobject file should not have any data in > it ever, as far as I know at that point in pg_upgrade. How would values > have gotten in there? Via pg_dump? I was thinking if the user had done it manually before running pg_upgrade. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jul 07, 2022 at 02:38:44PM -0400, Robert Haas wrote: > On Thu, Jul 7, 2022 at 2:24 PM Bruce Momjian <bruce@momjian.us> wrote: > > On Thu, Jul 7, 2022 at 01:38:44PM -0400, Robert Haas wrote: > > > On Thu, Jul 7, 2022 at 1:10 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > > Maybe it's a good idea to check that the file is empty before unlinking... > > > > > > If we want to verify that there are no large objects in the cluster, > > > we could do that in check_new_cluster_is_empty(). However, even if > > > there aren't, the length of the file could still be more than 0, if > > > there were some large objects previously and then they were removed. > > > So it's not entirely obvious to me that we should refuse to remove a > > > non-empty file. > > > > Uh, that initdb-created pg_largeobject file should not have any data in > > it ever, as far as I know at that point in pg_upgrade. How would values > > have gotten in there? Via pg_dump? > > I was thinking if the user had done it manually before running pg_upgrade. We're referring to the new cluster which should have been initdb'd more or less immediately before running pg_upgrade [0]. It'd be weird to me if someone were to initdb a new cluster, then create some large objects, and then maybe delete them, and then run pg_upgrade. Why wouldn't they do their work on the old cluster before upgrading, or on the new cluster afterwards ? Does anybody actually do anything significant between initdb and pg_upgrade ? Is that considered to be supported ? It seems like pg_upgrade could itself run initdb, with the correct options for locale, checksum, block size, etc (although it probably has to support the existing way to handle "compatible encodings"). Actually, I think check_new_cluster_is_empty() ought to prohibit doing work between initdb and pg_upgrade by checking that no objects have *ever* been created in the new cluster, by checking that NextOid == 16384. But I have a separate thread about "pg-upgrade allows itself to be re-run", and this has more to do with that than about whether to check that the file is empty before removing it. -- Justin
Justin Pryzby <pryzby@telsasoft.com> writes: > On Thu, Jul 07, 2022 at 02:38:44PM -0400, Robert Haas wrote: >> On Thu, Jul 7, 2022 at 2:24 PM Bruce Momjian <bruce@momjian.us> wrote: >>> Uh, that initdb-created pg_largeobject file should not have any data in >>> it ever, as far as I know at that point in pg_upgrade. How would values >>> have gotten in there? Via pg_dump? >> I was thinking if the user had done it manually before running pg_upgrade. > We're referring to the new cluster which should have been initdb'd more or less > immediately before running pg_upgrade [0]. > It'd be weird to me if someone were to initdb a new cluster, then create some > large objects, and then maybe delete them, and then run pg_upgrade. AFAIK you're voiding the warranty if you make any changes at all in the destination cluster before pg_upgrade'ing. As an example, if you created a table there you'd be risking an OID and/or relfilenode collision with something due to be imported from the source cluster. > Actually, I think check_new_cluster_is_empty() ought to prohibit doing work > between initdb and pg_upgrade by checking that no objects have *ever* been > created in the new cluster, by checking that NextOid == 16384. It would be good to have some such check; I'm not sure that that one in particular is the best option. > But I have a > separate thread about "pg-upgrade allows itself to be re-run", and this has > more to do with that than about whether to check that the file is empty before > removing it. Yeah, that's another foot-gun in the same area. regards, tom lane
+ /* Keep track of whether a filenode matches the OID */ + if (maps[mapnum].relfilenumber == LargeObjectRelationId) + *has_lotable = true; + if (maps[mapnum].relfilenumber == LargeObjectLOidPNIndexId) + *has_loindex = true; On Thu, Jul 7, 2022 at 2:44 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > We're referring to the new cluster which should have been initdb'd more or less > immediately before running pg_upgrade [0]. > > It'd be weird to me if someone were to initdb a new cluster, then create some > large objects, and then maybe delete them, and then run pg_upgrade. Why > wouldn't they do their work on the old cluster before upgrading, or on the new > cluster afterwards ? > > Does anybody actually do anything significant between initdb and pg_upgrade ? > Is that considered to be supported ? It seems like pg_upgrade could itself run > initdb, with the correct options for locale, checksum, block size, etc > (although it probably has to support the existing way to handle "compatible > encodings"). > > Actually, I think check_new_cluster_is_empty() ought to prohibit doing work > between initdb and pg_upgrade by checking that no objects have *ever* been > created in the new cluster, by checking that NextOid == 16384. But I have a > separate thread about "pg-upgrade allows itself to be re-run", and this has > more to do with that than about whether to check that the file is empty before > removing it. I think you're getting at a really good point here which is also my point: we assume that nothing significant has happened between when the cluster was created and when pg_upgrade is run, but we don't check it. Either we shouldn't assume it, or we should check it. So, is such activity ever legitimate? I think there are people doing it. The motivation is that maybe you have a dump from the old database that doesn't quite restore on the new version, but by doing something to the new cluster, you can make it restore. For instance, maybe there are some functions that used to be part of core and are now only available in an extension. That's going to make pg_upgrade's dump-and-restore workflow fail, but if you install that extension onto the new cluster, perhaps you can work around the problem. It doesn't have to be an extension, even. Maybe some function in core just got an extra argument, and you're using it, so the calls to that function cause dump-and-restore to fail. You might try overloading it in the new database with the old calling sequence to get things working. Now, are these kinds of things considered to be supported? Well, I don't know that we've made any statement about that one way or the other. Perhaps they are not. But I can see why people want to use workarounds like this. The alternative is having to dump-and-restore instead of an in-place upgrade, and that's painfully slow by comparison. pg_upgrade itself doesn't give you any tools to deal with this kind of situation, but the process is just loose enough to allow people to insert their own workarounds, so they do. I'm sure I'd do the same, in their shoes. My view on this is that, while we probably don't want to make such things officially supported, I don't think we should ban it outright, either. We probably can't support an upgrade after the next cluster has been subjected to arbitrary amounts of tinkering, but we're making a mistake if we write code that has fragile assumptions for no really good reason. I think we can do better than this excerpt from your patch, for example: + /* Keep track of whether a filenode matches the OID */ + if (maps[mapnum].relfilenumber == LargeObjectRelationId) + *has_lotable = true; + if (maps[mapnum].relfilenumber == LargeObjectLOidPNIndexId) + *has_loindex = true; I spent a while struggling to understand this because it seems to me that every database has an LO table and an LO index, so what's up with these names? I think what these names are really tracking is whether the relfilenumber of pg_largeobject and its index in the old database had their default values. But this makes the assumption that the LO table and LO index in the new database have never been subjected to VACUUM FULL or CLUSTER and, while there's no real reason to do that, I can't quite see what the point of such an unnecessary and fragile assumption might be. If Dilip's patch to make relfilenodes 56 bits gets committed, this is going to break anyway, because with that patch, the relfilenode for a table or index doesn't start out being equal to its OID. Perhaps a better solution to this particular problem is to remove the backing files for the large object table and index *before* restoring the dump, deciding what files to remove by asking the running server for the file path. It might seem funny to allow for dangling pg_class entries, but we're going to create that situation for all other user rels anyway, and pg_upgrade treats pg_largeobject as a user rel. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jul 5, 2022 at 12:43:54PM -0400, Robert Haas wrote: > On Sat, Jul 2, 2022 at 11:49 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > I suppose it's like Bruce said, here. > > > > https://www.postgresql.org/message-id/20210601140949.GC22012%40momjian.us > > Well, I feel dumb. I remember reading that email back when Bruce sent > it, but it seems that it slipped out of my head between then and when > I committed. I think your patch is fine, except that I think maybe we It happens to us all. > I had a moment of panic this morning where I thought maybe the whole Yes, I have had those panics too. > patch needed to be reverted. I was worried that we might need to > preserve the OID of every system table and index. Otherwise, what > happens if the OID counter in the old cluster wraps around and some > user-created object gets an OID that the system tables are using in > the new cluster? However, I think this can't happen, because when the > OID counter wraps around, it wraps around to 16384, and the > relfilenode values for newly created system tables and indexes are all > less than 16384. So I believe we only need to fix pg_largeobject and > its index, and I think your patch does that. So, let me explain how I look at this. There are two number-spaces, oid and relfilenode. In each number-space, there are system-assigned ones less than 16384, and higher ones for post-initdb use. What we did in pre-PG15 was to preserve only oids, and have the relfilenode match the oid, and we have discussed the negatives of this. For PG 15+, we preserve relfilenodes too. These number assignment cases only work if we handle _all_ numbering, except for non-pg_largeobject system tables. In pre-PG15, pg_largeobject was easily handled because initdb already assigned the oid and relfilenode to be the same for pg_largeobject, so a simple copy worked fine. pg_largeobject is an anomaly in PG 15 because it is assigned a relfilenode in the system number space by initdb, but then it needs to be potentially renamed into the relfilenode user number space. This is the basis for my email as already posted: https://www.postgresql.org/message-id/20210601140949.GC22012%40momjian.us You are right to be concerned since you are spanning number spaces, but I think you are fine because the relfilenode in the user-space cannot have been used since it already was being used in each database. It is true we never had a per-database rename like this before. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
On Thu, Jul 7, 2022 at 4:16 PM Bruce Momjian <bruce@momjian.us> wrote: > You are right to be concerned since you are spanning number spaces, but > I think you are fine because the relfilenode in the user-space cannot > have been used since it already was being used in each database. It is > true we never had a per-database rename like this before. Thanks for checking over the reasoning, and the kind words in general. I just committed Justin's fix for the bug, without fixing the fact that the new cluster's original pg_largeobject files will be left orphaned afterward. That's a relatively minor problem by comparison, and it seemed best to me not to wait too long to get the main issue addressed. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jul 07, 2022 at 03:11:38PM -0400, Robert Haas wrote: > point: we assume that nothing significant has happened between when > the cluster was created and when pg_upgrade is run, but we don't check > it. Either we shouldn't assume it, or we should check it. > > So, is such activity ever legitimate? I think there are people doing > it. The motivation is that maybe you have a dump from the old database > that doesn't quite restore on the new version, but by doing something > to the new cluster, you can make it restore. For instance, maybe there > are some functions that used to be part of core and are now only > available in an extension. That's going to make pg_upgrade's > dump-and-restore workflow fail, but if you install that extension onto > the new cluster, perhaps you can work around the problem. It doesn't > have to be an extension, even. Maybe some function in core just got an > extra argument, and you're using it, so the calls to that function > cause dump-and-restore to fail. You might try overloading it in the > new database with the old calling sequence to get things working. I don't think that's even possible. pg_upgrade drops template1 and postgres before upgrading: * template1 database will already exist in the target installation, * so tell pg_restore to drop and recreate it; otherwise we would fail * to propagate its database-level properties. * postgres database will already exist in the target installation, so * tell pg_restore to drop and recreate it; otherwise we would fail to * propagate its database-level properties. For any other DBs, you'd hit an error if, after initdb'ing, you started the new cluster, connected to it, created a DB (?!) and then tried to upgrade: pg_restore: error: could not execute query: ERROR: database "pryzbyj" already exists So if people start, connect, and then futz with a cluster before upgrading it, it must be for global stuff (roles, tablespaces), and not per-DB stuff. Also, pg_upgrade refuses to run if additional roles are defined... So I'm not seeing what someone could do on the new cluster. That supports the idea that it'd be okay to refuse to upgrade anything other than a pristine cluster. > Now, are these kinds of things considered to be supported? Well, I > don't know that we've made any statement about that one way or the > other. Perhaps they are not. But I can see why people want to use > workarounds like this. The alternative is having to dump-and-restore > instead of an in-place upgrade, and that's painfully slow by > comparison. The alternative in cases that I know about is to fix the old DB to allow it to be upgraded. check.c has a list of the things that aren't upgradable, and The fixes are some things like ALTER TABLE DROP OIDs. We just added another one to handle v14 aggregates (09878cdd4). > My view on this is that, while we probably don't want to make such > things officially supported, I don't think we should ban it outright, > either. We probably can't support an upgrade after the next cluster > has been subjected to arbitrary amounts of tinkering, but we're making > a mistake if we write code that has fragile assumptions for no really > good reason. I think we can do better than this excerpt from your > patch, for example: > > + /* Keep track of whether a filenode matches the OID */ > + if (maps[mapnum].relfilenumber == LargeObjectRelationId) > + *has_lotable = true; > + if (maps[mapnum].relfilenumber == LargeObjectLOidPNIndexId) > + *has_loindex = true; > > I spent a while struggling to understand this because it seems to me > that every database has an LO table and an LO index, so what's up with > these names? I think what these names are really tracking is whether > the relfilenumber of pg_largeobject and its index in the old database > had their default values. Yes, has_lotable means "has a LO table whose filenode matches the OID". I will solicit suggestions for a better name. > But this makes the assumption that the LO > table and LO index in the new database have never been subjected to > VACUUM FULL or CLUSTER and, while there's no real reason to do that, I > can't quite see what the point of such an unnecessary and fragile > assumption might be. The idea of running initdb, starting the cluster, and connecting to it to run VACUUM FULL scares me. Now that I think about it, it might be almost inconsequential, since the initial DBs are dropped, and the upgrade will fail if any non-template DB exists. But .. maybe something exciting happens if you vacuum full a shared catalog... Yup. With my patch: ./tmp_install/usr/local/pgsql/bin/initdb --no-sync -D pg15b1.dat -k ./tmp_install/usr/local/pgsql/bin/postgres -D pg15b1.dat -c logging_collector=no -p 5678 -k /tmp& postgres=# \lo_import /etc/shells postgres=# VACUUM FULL pg_largeobject; ./tmp_install/usr/local/pgsql/bin/initdb --no-sync -D pg15b2.dat -k ./tmp_install/usr/local/pgsql/bin/postgres -D pg15b2.dat -c logging_collector=no -p 5678 -k /tmp& postgres=# VACUUM FULL pg_database; tmp_install/usr/local/pgsql/bin/pg_upgrade -D pg15b2.dat -b tmp_install/usr/local/pgsql/bin -d pg15b1.dat postgres=# SELECT COUNT(1), pg_relation_filenode(oid), array_agg(relname) FROM pg_class WHERE pg_relation_filenode(oid) ISNOT NULL GROUP BY 2 HAVING COUNT(1)>1 ORDER BY 1 DESC ; count | pg_relation_filenode | array_agg -------+----------------------+---------------------------------------------------- 2 | 16388 | {pg_toast_1262_index,pg_largeobject_loid_pn_index} I don't have a deep understanding why the DB hasn't imploded at this point, maybe related to the filenode map file, but it seems very close to being catastrophic. It seems like pg_upgrade should at least check that the new cluster has no objects with either OID or relfilenodes in the user range.. You could blame my patch, since I the issue is limited to pg_largeobject. > Perhaps a better solution to this particular problem is to remove the > backing files for the large object table and index *before* restoring > the dump, deciding what files to remove by asking the running server > for the file path. It might seem funny to allow for dangling pg_class > entries, but we're going to create that situation for all other user > rels anyway, and pg_upgrade treats pg_largeobject as a user rel. I'll think about it more later. -- Justin
On Fri, Jul 8, 2022 at 11:53 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > pg_upgrade drops template1 and postgres before upgrading: Hmm, but I bet you could fiddle with template0. Indeed what's the difference between a user fiddling with template0 and me committing a patch that bumps catversion? If the latter doesn't prevent pg_upgrade from working when the release comes out, why should the former? > I don't have a deep understanding why the DB hasn't imploded at this point, > maybe related to the filenode map file, but it seems very close to being > catastrophic. Yeah, good example. > It seems like pg_upgrade should at least check that the new cluster has no > objects with either OID or relfilenodes in the user range.. Well... I think it's not quite that simple. There's an argument for that rule, to be sure, but in some sense it's far too strict. We only preserve the OIDs of tablespaces, types, enums, roles, and now relations. So if you create any other type of object in the new cluster, like say a function, it's totally fine. You could still fail if the old cluster happens to contain a function with the same signature, but that's kind of a different issue. An OID collision for any of the many object types for which OIDs are not preserved is no problem at all. But even if we restrict ourselves to talking about an object type for which OIDs are preserved, it's still not quite that simple. For example, if I create a relation in the new cluster, its OID or relfilenode might be the same as a relation that exists in the old cluster. In such a case, a failure is inevitable. We're definitely in big trouble, and the question is only whether pg_upgrade will notice. But it's also possible that, either by planning or by sure dumb luck, neither the relation nor the OID that I've created in the new cluster is in use in the old cluster. In such a case, the upgrade can succeed without breaking anything, or at least nothing other than our sense of order in the universe. Without a doubt, there are holes in pg_upgrade's error checking that need to be plugged, but I think there is room to debate exactly what size plug we want to use. I can't really say that it's definitely stupid to use a plug that's definitely big enough to catch all the scenarios that might break stuff, but I think my own preference would be to try to craft it so that it isn't too much larger than necessary. That's partly because I do think there are some scenarios in which modifying the new cluster might be the easiest way of working around some problem, but also because, as a matter of principle, I like the idea of making rules that correspond to the real dangers. If we write a rule that says essentially "it's no good if there are two relations sharing a relfilenode," nobody with any understanding of how the system works can argue that bypassing it is a sensible thing to do, and probably nobody will even try, because it's so obviously bonkers to do so. It's a lot less obviously bonkers to try to bypass the broader prohibition which you suggest should never be bypassed, so someone may do it, and get themselves in trouble. Now I'm not saying such a person will get any sympathy from this list. If for example you #if out the sanity check and hose yourself, people here are, including me, are going to suggest that you've hosed yourself and it's not our problem. But ... the world is full of warnings about problems that aren't really that serious, and sometimes those have the effect of discouraging people from taking warnings about very serious problems as seriously as they should. I know that I no longer panic when the national weather service texts me to say that there's a tornado or a flash flood in my area. They've just done that too many times when there was no real issue with which I needed to be concerned. If I get caught out by a tornado at some point, they're probably going to say "well that's why you should always take our warnings seriously," but I'm going to say "well that's why you shouldn't send spurious warnings." > > Perhaps a better solution to this particular problem is to remove the > > backing files for the large object table and index *before* restoring > > the dump, deciding what files to remove by asking the running server > > for the file path. It might seem funny to allow for dangling pg_class > > entries, but we're going to create that situation for all other user > > rels anyway, and pg_upgrade treats pg_largeobject as a user rel. > > I'll think about it more later. Sounds good. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jul 08, 2022 at 10:44:07AM -0400, Robert Haas wrote: > Thanks for checking over the reasoning, and the kind words in general. Thanks for fixing the main issue. > I just committed Justin's fix for the bug, without fixing the fact > that the new cluster's original pg_largeobject files will be left > orphaned afterward. That's a relatively minor problem by comparison, > and it seemed best to me not to wait too long to get the main issue > addressed. Hmm. That would mean that the more LOs a cluster has, the more bloat there will be in the new cluster once the upgrade is done. That could be quite a few gigs worth of data laying around depending on the data inserted in the source cluster, and we don't have a way to know which files to remove post-upgrade, do we? -- Michael
Attachment
On Sun, Jul 10, 2022 at 9:31 PM Michael Paquier <michael@paquier.xyz> wrote: > Hmm. That would mean that the more LOs a cluster has, the more bloat > there will be in the new cluster once the upgrade is done. That could > be quite a few gigs worth of data laying around depending on the data > inserted in the source cluster, and we don't have a way to know which > files to remove post-upgrade, do we? The files that are being leaked here are the files backing the pg_largeobject table and the corresponding index as they existed in the new cluster just prior to the upgrade. Hopefully, the table is a zero-length file and the index is just one block, because you're supposed to use a newly-initdb'd cluster as the target for a pg_upgrade operation. Now, you don't actually have to do that: as we've been discussing, there aren't as many sanity checks in this code as there probably should be. But it would still be surprising to initdb a new cluster, load gigabytes of large objects into it, and then use it as the target cluster for a pg_upgrade. As for whether it's possible to know which files to remove post-upgrade, that's the same problem as trying to figure out whether their are orphaned files in any other PostgreSQL cluster. We don't have a tool for it, but if you're sure that the system is more or less quiescent - no uncommitted DDL, in particular - you can compare pg_class.relfilenode to the actual filesystem contents and figure out what extra stuff is present on the filesystem level. I am not saying we shouldn't try to fix this up more thoroughly, just that I think you are overestimating the consequences. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Jul 11, 2022 at 9:16 AM Robert Haas <robertmhaas@gmail.com> wrote: > I am not saying we shouldn't try to fix this up more thoroughly, just > that I think you are overestimating the consequences. I spent a bunch of time looking at this today and I have more sympathy for Justin's previous proposal now. I found it somewhat hacky that he was relying on the hard-coded value of LargeObjectRelationId and LargeObjectLOidPNIndexId, but I discovered that it's harder to do better than I had assumed. Suppose we don't want to compare against a hard-coded constant but against the value that is actually present before the dump overwrites the pg_class row's relfilenode. Well, we can't get that value from the database in question before restoring the dump, because restoring either the dump creates or recreates the database in all cases. The CREATE DATABASE command that will be part of the dump always specifies TEMPLATE template0, so if we want something other than a hard-coded constant, we need the pg_class.relfilenode values from template0 for pg_largeobject and pg_largeobject_loid_pn_index. But we can't connect to that database to query those values, because it has datallowconn = false. Oops. I have a few more ideas to try here. It occurs to me that we could fix this more cleanly if we could get the dump itself to set the relfilenode for pg_largeobject to the desired value. Right now, it's just overwriting the relfilenode stored in the catalog without actually doing anything that would cause a change on disk. But if we could make it change the relfilenode in a more principled way that would actually cause an on-disk change, then the orphaned-file problem would be fixed, because we'd always be installing the new file over top of the old file. I'm going to investigate how hard it would be to make that work. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jul 12, 2022 at 04:51:44PM -0400, Robert Haas wrote: > I spent a bunch of time looking at this today and I have more sympathy > for Justin's previous proposal now. I found it somewhat hacky that he > was relying on the hard-coded value of LargeObjectRelationId and > LargeObjectLOidPNIndexId, but I discovered that it's harder to do > better than I had assumed. Suppose we don't want to compare against a > hard-coded constant but against the value that is actually present > before the dump overwrites the pg_class row's relfilenode. Well, we > can't get that value from the database in question before restoring > the dump, because restoring either the dump creates or recreates the > database in all cases. The CREATE DATABASE command that will be part > of the dump always specifies TEMPLATE template0, so if we want > something other than a hard-coded constant, we need the > pg_class.relfilenode values from template0 for pg_largeobject and > pg_largeobject_loid_pn_index. But we can't connect to that database to > query those values, because it has datallowconn = false. Oops. > > I have a few more ideas to try here. It occurs to me that we could fix > this more cleanly if we could get the dump itself to set the > relfilenode for pg_largeobject to the desired value. Right now, it's > just overwriting the relfilenode stored in the catalog without > actually doing anything that would cause a change on disk. But if we > could make it change the relfilenode in a more principled way that > would actually cause an on-disk change, then the orphaned-file problem > would be fixed, because we'd always be installing the new file over > top of the old file. I'm going to investigate how hard it would be to > make that work. Thanks for all the details here. This originally sounded like the new cluster was keeping around some orphaned relation files with the old LOs still stored in it. But as that's just the freshly initdb'd relfilenodes of pg_largeobject, that does not strike me as something absolutely critical to fix for v15 as orphaned relfilenodes are an existing problem. If we finish with a solution rather simple in design, I'd be fine to stick a fix in REL_15_STABLE, but playing with this stable branch more than necessary may be risky after beta2. At the end, I would be fine to drop the open item now that the main issue has been fixed. -- Michael
Attachment
On Tue, Jul 12, 2022 at 4:51 PM Robert Haas <robertmhaas@gmail.com> wrote: > I have a few more ideas to try here. It occurs to me that we could fix > this more cleanly if we could get the dump itself to set the > relfilenode for pg_largeobject to the desired value. Right now, it's > just overwriting the relfilenode stored in the catalog without > actually doing anything that would cause a change on disk. But if we > could make it change the relfilenode in a more principled way that > would actually cause an on-disk change, then the orphaned-file problem > would be fixed, because we'd always be installing the new file over > top of the old file. I'm going to investigate how hard it would be to > make that work. Well, it took a while to figure out how to make that work, but I believe I've got it now. Attached please find a couple of patches that should get the job done. They might need a bit of polish, but I think the basic concepts are sound. My first thought was to have the dump issue VACUUM FULL pg_largeobject after first calling binary_upgrade_set_next_heap_relfilenode() and binary_upgrade_set_next_index_relfilenode(), and have the VACUUM FULL use the values configured by those calls for the new heap and index OID. I got this working in standalone testing, only to find that this doesn't work inside pg_upgrade. The complaint is "ERROR: VACUUM cannot run inside a transaction block", and I think that happens because pg_restore sends the entire TOC entry for a single object to the server as a single query, and here it contains multiple SQL commands. That multi-command string ends up being treated like an implicit transaction block. So my second thought was to have the dump still call binary_upgrade_set_next_heap_relfilenode() and binary_upgrade_set_next_index_relfilenode(), but then afterwards call TRUNCATE rather than VACUUM FULL. I found that a simple change to RelationSetNewRelfilenumber() did the trick: I could then easily change the heap and index relfilenodes for pg_largeobject to any new values I liked. However, I realized that this approach had a problem: what if the pg_largeobject relation had never been rewritten in the old cluster? Then the heap and index relfilenodes would be unchanged. It's also possible that someone might have run REINDEX in the old cluster, in which case it might happen that the heap relfilenode was unchanged, but the index relfilenode had changed. I spent some time fumbling around with trying to get the non-transactional truncate path to cover these cases, but the fact that we might need to change the relfilenode for the index but not the heap makes this approach seem pretty awful. But it occurred to me that in the case of a pg_upgrade, we don't really need to keep the old storage around until commit time. We can unlink it first, before creating the new storage, and then if the old and new storage happen to be the same, it still works. I can think of two possible objections to this way forward. First, it means that the operation isn't properly transactional. However, if the upgrade fails at this stage, the new cluster is going to have to be nuked and recreated anyway, so the fact that things might be in an unclean state after an ERROR is irrelevant. Second, one might wonder whether such a fix is really sufficient. For example, what happens if the relfilenode allocated to pg_largebject in the old cluster is assigned to its index in the new cluster, and vice versa? To make that work, we would need to remove all storage for all relfilenodes first, and then recreate them all afterward. However, I don't think we need to make that work. If an object in the old cluster has a relfilenode < 16384, then that should mean it's never been rewritten and therefore its relfilenode in the new cluster should be the same. The only way this wouldn't be true is if we shuffled around the initial relfilenode assignments in a new version of PG so that the same values were used but now for different objects, which would be a really dumb idea. And on the other hand, if the object in the old cluster has a relfilenode > 16384, then that relfilenode value should be unused in the new cluster. If not, the user has tinkered with the new cluster more than they ought. So I tried implementing this but I didn't get it quite right the first time. It's not enough to call smgrdounlinkall() instead of RelationDropStorage(), because just as RelationDropStorage() does not actually drop the storage but only schedules it to be dropped later, so also smgrdounlinkall() does not in fact unlink all, but only some. It leaves the first segment of the main relation fork around to guard against the hazards discussed in the header comments for mdunlink(). But those hazards don't really matter here either, because, again, any failure will necessarily require that the entire new cluster be destroyed, and also because there shouldn't be any concurrent activity in the new cluster while pg_upgrade is running. So I adjusted smgrdounlinkall() to actually remove everything when IsBinaryUpgrade = true. And then it all seems to work: pg_upgrade of a cluster that has had a rewrite of pg_largeobject works, and pg_upgrade of a cluster that has not had such a rewrite works too. Wa-hoo. As to whether this is a good fix, I think someone could certainly argue otherwise. This is all a bit grotty. However, I don't find it all that bad. As long as we're moving files from between one PG cluster and another using an external tool rather than logic inside the server itself, I think we're bound to have some hacks someplace to make it all work. To me, extending them to a few more places to avoid leaving files behind on disk seems like a good trade-off. Your mileage may vary. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
Hi, On 2022-07-18 14:57:40 -0400, Robert Haas wrote: > As to whether this is a good fix, I think someone could certainly > argue otherwise. This is all a bit grotty. However, I don't find it > all that bad. As long as we're moving files from between one PG > cluster and another using an external tool rather than logic inside > the server itself, I think we're bound to have some hacks someplace to > make it all work. To me, extending them to a few more places to avoid > leaving files behind on disk seems like a good trade-off. Your mileage > may vary. How about adding a new binary_upgrade_* helper function for this purpose instead, instead of tying it into truncate? Greetings, Andres Freund
On Mon, Jul 18, 2022 at 02:57:40PM -0400, Robert Haas wrote: > So I tried implementing this but I didn't get it quite right the first > time. It's not enough to call smgrdounlinkall() instead of > RelationDropStorage(), because just as RelationDropStorage() does not > actually drop the storage but only schedules it to be dropped later, > so also smgrdounlinkall() does not in fact unlink all, but only some. > It leaves the first segment of the main relation fork around to guard > against the hazards discussed in the header comments for mdunlink(). > But those hazards don't really matter here either, because, again, any > failure will necessarily require that the entire new cluster be > destroyed, and also because there shouldn't be any concurrent activity > in the new cluster while pg_upgrade is running. So I adjusted > smgrdounlinkall() to actually remove everything when IsBinaryUpgrade = > true. And then it all seems to work: pg_upgrade of a cluster that has > had a rewrite of pg_largeobject works, and pg_upgrade of a cluster > that has not had such a rewrite works too. Wa-hoo. Using the IsBinaryUpgrade flag makes sense to me. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
On Mon, Jul 18, 2022 at 4:06 PM Andres Freund <andres@anarazel.de> wrote: > How about adding a new binary_upgrade_* helper function for this purpose > instead, instead of tying it into truncate? I considered that briefly, but it would need to do a lot of things that TRUNCATE already knows how to do, so it does not seem like a good idea. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Jul 18, 2022 at 2:57 PM Robert Haas <robertmhaas@gmail.com> wrote: > Well, it took a while to figure out how to make that work, but I > believe I've got it now. Attached please find a couple of patches that > should get the job done. They might need a bit of polish, but I think > the basic concepts are sound. So, would people like these patches (1) committed to master only, (2) committed to master and back-patched into v15, or (3) not committed at all? Michael argued upthread that it was too risky to be tinkering with things at this stage in the release cycle and, certainly, the more time goes by, the more true that gets. But I'm not convinced that these patches involve an inordinate degree of risk, and using beta as a time to fix things that turned out to be buggy is part of the point of the whole thing. On the other hand, the underlying issue isn't that serious either, and nobody seems to have reviewed the patches in detail, either. I don't mind committing them on my own recognizance if that's what people would prefer; I can take responsibility for fixing anything that is further broken, as I suppose would be expected even if someone else did review. But, I don't want to do something that other people feel is the wrong thing to have done. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jul 26, 2022 at 03:45:11PM -0400, Robert Haas wrote: > On Mon, Jul 18, 2022 at 2:57 PM Robert Haas <robertmhaas@gmail.com> wrote: > > Well, it took a while to figure out how to make that work, but I > > believe I've got it now. Attached please find a couple of patches that > > should get the job done. They might need a bit of polish, but I think > > the basic concepts are sound. > > So, would people like these patches (1) committed to master only, (2) > committed to master and back-patched into v15, or (3) not committed at > all? Michael argued upthread that it was too risky to be tinkering > with things at this stage in the release cycle and, certainly, the > more time goes by, the more true that gets. But I'm not convinced that > these patches involve an inordinate degree of risk, and using beta as > a time to fix things that turned out to be buggy is part of the point > of the whole thing. On the other hand, the underlying issue isn't that > serious either, and nobody seems to have reviewed the patches in > detail, either. I don't mind committing them on my own recognizance if > that's what people would prefer; I can take responsibility for fixing > anything that is further broken, as I suppose would be expected even > if someone else did review. But, I don't want to do something that > other people feel is the wrong thing to have done. This behavior is new in PG 15, and I would be worried to have one new behavior in PG 15 and another one in PG 16. Therefore, I would like to see it in PG 15 and master. I also think not doing anything and leaving these zero-length files around would also be risky. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
On Tue, Jul 26, 2022 at 9:09 PM Bruce Momjian <bruce@momjian.us> wrote: > This behavior is new in PG 15, and I would be worried to have one new > behavior in PG 15 and another one in PG 16. Therefore, I would like to > see it in PG 15 and master. That's also my preference, so committed and back-patched to v15. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > That's also my preference, so committed and back-patched to v15. crake has been failing its cross-version-upgrade tests [1] since this went in: log files for step xversion-upgrade-REL9_4_STABLE-HEAD: ==~_~===-=-===~_~== /home/andrew/bf/root/upgrade.crake/HEAD/REL9_4_STABLE-amcheck-1.log ==~_~===-=-===~_~== heap table "regression.pg_catalog.pg_largeobject", block 0, offset 7: xmin 7707 precedes relation freeze threshold 0:14779 heap table "regression.pg_catalog.pg_largeobject", block 201, offset 5: xmin 8633 precedes relation freeze threshold 0:14779 I'm not very sure what to make of that, but it's failed identically four times in four attempts. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2022-07-28%2020%3A33%3A20
On Fri, Jul 29, 2022 at 1:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > crake has been failing its cross-version-upgrade tests [1] since > this went in: > > log files for step xversion-upgrade-REL9_4_STABLE-HEAD: > ==~_~===-=-===~_~== /home/andrew/bf/root/upgrade.crake/HEAD/REL9_4_STABLE-amcheck-1.log ==~_~===-=-===~_~== > heap table "regression.pg_catalog.pg_largeobject", block 0, offset 7: > xmin 7707 precedes relation freeze threshold 0:14779 > heap table "regression.pg_catalog.pg_largeobject", block 201, offset 5: > xmin 8633 precedes relation freeze threshold 0:14779 > > I'm not very sure what to make of that, but it's failed identically > four times in four attempts. That's complaining about two tuples in the pg_largeobject table with xmin values that precedes relfrozenxid -- which suggests that even after 80d6907219, relfrozenxid isn't being correctly preserved in this test case, since the last run still failed the same way. But what exactly is this test case testing? I've previously complained about buildfarm outputs not being as labelled as well as they need to be in order to be easily understood by, well, me anyway. It'd sure help if the commands that led up to this problem were included in the output. I downloaded latest-client.tgz from the build farm server and am looking at TestUpgradeXversion.pm, but there's no mention of -amcheck-1.log in there, just -analyse.log, -copy.log, and following. So I suppose this is running some different code or special configuration... -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jul 29, 2022 at 2:35 PM Robert Haas <robertmhaas@gmail.com> wrote: > But what exactly is this test case testing? I've previously complained > about buildfarm outputs not being as labelled as well as they need to > be in order to be easily understood by, well, me anyway. It'd sure > help if the commands that led up to this problem were included in the > output. I downloaded latest-client.tgz from the build farm server and > am looking at TestUpgradeXversion.pm, but there's no mention of > -amcheck-1.log in there, just -analyse.log, -copy.log, and following. > So I suppose this is running some different code or special > configuration... I was able to reproduce the problem by running 'make installcheck' against a 9.4 instance and then doing a pg_upgrade to 16devel (which took many tries because it told me about many different kinds of things that it didn't like one at a time; I just dropped objects from the regression DB until it worked). The dump output looks like this: -- For binary upgrade, set pg_largeobject relfrozenxid and relminmxid UPDATE pg_catalog.pg_class SET relfrozenxid = '0', relminmxid = '0' WHERE oid = 2683; UPDATE pg_catalog.pg_class SET relfrozenxid = '990', relminmxid = '1' WHERE oid = 2613; -- For binary upgrade, preserve pg_largeobject and index relfilenodes SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('12364'::pg_catalog.oid); SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('12362'::pg_catalog.oid); TRUNCATE pg_catalog.pg_largeobject; However, the catalogs show the relfilenode being correct, and the relfrozenxid set to a larger value. I suspect the problem here is that this needs to be done in the other order, with the TRUNCATE first and the update to the pg_class columns afterward. I think I better look into improving the TAP tests for this, too. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jul 29, 2022 at 3:10 PM Robert Haas <robertmhaas@gmail.com> wrote: > However, the catalogs show the relfilenode being correct, and the > relfrozenxid set to a larger value. I suspect the problem here is that > this needs to be done in the other order, with the TRUNCATE first and > the update to the pg_class columns afterward. That fix appears to be correct. Patch attached. > I think I better look into improving the TAP tests for this, too. TAP test enhancement also included in the attached patch. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Jul 29, 2022 at 3:10 PM Robert Haas <robertmhaas@gmail.com> wrote: >> However, the catalogs show the relfilenode being correct, and the >> relfrozenxid set to a larger value. I suspect the problem here is that >> this needs to be done in the other order, with the TRUNCATE first and >> the update to the pg_class columns afterward. > That fix appears to be correct. Patch attached. Looks plausible. regards, tom lane
On Fri, Jul 29, 2022 at 4:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Looks plausible. Committed. -- Robert Haas EDB: http://www.enterprisedb.com
On 2022-07-29 Fr 14:35, Robert Haas wrote: > On Fri, Jul 29, 2022 at 1:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> crake has been failing its cross-version-upgrade tests [1] since >> this went in: >> >> log files for step xversion-upgrade-REL9_4_STABLE-HEAD: >> ==~_~===-=-===~_~== /home/andrew/bf/root/upgrade.crake/HEAD/REL9_4_STABLE-amcheck-1.log ==~_~===-=-===~_~== >> heap table "regression.pg_catalog.pg_largeobject", block 0, offset 7: >> xmin 7707 precedes relation freeze threshold 0:14779 >> heap table "regression.pg_catalog.pg_largeobject", block 201, offset 5: >> xmin 8633 precedes relation freeze threshold 0:14779 >> >> I'm not very sure what to make of that, but it's failed identically >> four times in four attempts. > That's complaining about two tuples in the pg_largeobject table with > xmin values that precedes relfrozenxid -- which suggests that even > after 80d6907219, relfrozenxid isn't being correctly preserved in this > test case, since the last run still failed the same way. > > But what exactly is this test case testing? I've previously complained > about buildfarm outputs not being as labelled as well as they need to > be in order to be easily understood by, well, me anyway. It'd sure > help if the commands that led up to this problem were included in the > output. I downloaded latest-client.tgz from the build farm server and > am looking at TestUpgradeXversion.pm, but there's no mention of > -amcheck-1.log in there, just -analyse.log, -copy.log, and following. > So I suppose this is running some different code or special > configuration... Not really, but it is running git bleeding edge. This code comes from <https://github.com/PGBuildFarm/client-code/commit/191df23bd25eb5546b0989d71ae92747151f9f39> at lines 704-705 cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Fri, Jul 29, 2022 at 5:13 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jul 29, 2022 at 4:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Looks plausible. > > Committed. wrasse just failed the new test: [00:09:44.167](0.001s) not ok 16 - old and new horizons match after pg_upgrade [00:09:44.167](0.001s) [00:09:44.167](0.000s) # Failed test 'old and new horizons match after pg_upgrade' # at t/002_pg_upgrade.pl line 345. [00:09:44.168](0.000s) # got: '1' # expected: '0' === diff of /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_D3cJ/horizon1.txt and /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_D3cJ/horizon2.txt === stdout === 1c1 < pg_backend_pid|21767 --- > pg_backend_pid|22045=== stderr === === EOF === I'm slightly befuddled as to how we're ending up with a table named pg_backend_pid. That said, perhaps this is just a case of needing to prevent autovacuum from running on the new cluster before we've had a chance to record the horizons? But I'm not very confident in that explanation at this point. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > wrasse just failed the new test: > [00:09:44.167](0.001s) not ok 16 - old and new horizons match after pg_upgrade > [00:09:44.167](0.001s) > [00:09:44.167](0.000s) # Failed test 'old and new horizons match > after pg_upgrade' > # at t/002_pg_upgrade.pl line 345. > [00:09:44.168](0.000s) # got: '1' > # expected: '0' > === diff of /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_D3cJ/horizon1.txt > and /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_D3cJ/horizon2.txt > === stdout === > 1c1 > < pg_backend_pid|21767 > --- > > pg_backend_pid|22045=== stderr === > === EOF === > I'm slightly befuddled as to how we're ending up with a table named > pg_backend_pid. That's not the only thing weird about this printout: there should be three columns not two in that query's output, and what happened to the trailing newline? I don't think we're looking at desired output at all. I am suspicious that the problem stems from the nonstandard way you've invoked psql to collect the horizon data. At the very least this code is duplicating bits of Cluster::psql that it'd be better not to; and I wonder if the issue is that it's not duplicating enough. The lack of -X and the lack of use of installed_command() are red flags. Do you really need to do it like this? regards, tom lane
On Fri, Jul 29, 2022 at 7:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > That's not the only thing weird about this printout: there should be > three columns not two in that query's output, and what happened to > the trailing newline? I don't think we're looking at desired > output at all. Well, that's an awfully good point. > I am suspicious that the problem stems from the nonstandard > way you've invoked psql to collect the horizon data. At the very > least this code is duplicating bits of Cluster::psql that it'd be > better not to; and I wonder if the issue is that it's not duplicating > enough. The lack of -X and the lack of use of installed_command() > are red flags. Do you really need to do it like this? Well, I just copied the pg_dump block which occurs directly beforehand and modified it. I think that must take care of setting the path properly, else we'd have things blowing up all over the place. But the lack of -X could be an issue. The missing newline thing happens for me locally too, if I revert the bug fix portion of that commit, but I do seem to get the right columns in the output. It looks like this: 19:24:16.057](0.000s) not ok 16 - old and new horizons match after pg_upgrade [19:24:16.058](0.000s) [19:24:16.058](0.000s) # Failed test 'old and new horizons match after pg_upgrade' # at t/002_pg_upgrade.pl line 345. [19:24:16.058](0.000s) # got: '1' # expected: '0' === diff of /Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/tmp_test_K8Fs/horizon1.txt and /Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/tmp_test_K8Fs/horizon2.txt === stdout === 1c1 < pg_largeobject|718|1 --- > pg_largeobject|17518|3=== stderr === === EOF === [19:24:16.066](0.008s) 1..16 I can't account for the absence of a newline there, because hexdump says that both horizon1.txt and horizon2.txt end with one, and if I run diff on those two files and pipe the output into hexdump, it sees a newline at the end of that output too. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Jul 29, 2022 at 7:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I am suspicious that the problem stems from the nonstandard >> way you've invoked psql to collect the horizon data. > Well, I just copied the pg_dump block which occurs directly beforehand > and modified it. I think that must take care of setting the path > properly, else we'd have things blowing up all over the place. But the > lack of -X could be an issue. Hmm. Now that I look, I do see two pre-existing "naked" invocations of psql in 002_pg_upgrade.pl, ie $oldnode->command_ok([ 'psql', '-X', '-f', $olddumpfile, 'postgres' ], 'loaded old dump file'); $oldnode->command_ok( [ 'psql', '-X', '-f', "$srcdir/src/bin/pg_upgrade/upgrade_adapt.sql", 'regression' ], 'ran adapt script'); Those suggest that maybe all you need is -X. However, I don't think either of those calls is reached by the majority of buildfarm animals, only ones that are doing cross-version-upgrade tests. So there could be more secret sauce needed to get this to pass everywhere. Personally I'd try to replace the two horizon-collection steps with $newnode->psql calls, using extra_params to inject the '-o' and target filename command line words. But if you want to try adding -X as a quicker answer, maybe that will be enough. regards, tom lane
On Fri, Jul 29, 2022 at 8:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Personally I'd try to replace the two horizon-collection steps with > $newnode->psql calls, using extra_params to inject the '-o' and target > filename command line words. But if you want to try adding -X as > a quicker answer, maybe that will be enough. Here's a patch that uses a variant of that approach: it just runs safe_psql straight up and gets the output, then writes it out to temp files if the output doesn't match and we need to run diff. Let me know what you think of this. While working on this, I noticed a few other problems. One is that the query doesn't have an ORDER BY clause, which it really should, or the output won't be stable. And the other is that I think we should be testing against the regression database, not the postgres database, because it's got a bunch of user tables in it, not just pg_largeobject. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > Here's a patch that uses a variant of that approach: it just runs > safe_psql straight up and gets the output, then writes it out to temp > files if the output doesn't match and we need to run diff. Let me know > what you think of this. That looks good to me, although obviously I don't know for sure if it will make wrasse happy. > While working on this, I noticed a few other problems. One is that the > query doesn't have an ORDER BY clause, which it really should, or the > output won't be stable. And the other is that I think we should be > testing against the regression database, not the postgres database, > because it's got a bunch of user tables in it, not just > pg_largeobject. Both of those sound like "d'oh" observations to me. +1 regards, tom lane
On Fri, Jul 29, 2022 at 07:16:34PM -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > wrasse just failed the new test: > > > [00:09:44.167](0.001s) not ok 16 - old and new horizons match after pg_upgrade > > [00:09:44.167](0.001s) > > [00:09:44.167](0.000s) # Failed test 'old and new horizons match > > after pg_upgrade' > > # at t/002_pg_upgrade.pl line 345. > > [00:09:44.168](0.000s) # got: '1' > > # expected: '0' > > === diff of /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_D3cJ/horizon1.txt > > and /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_D3cJ/horizon2.txt > > === stdout === > > 1c1 > > < pg_backend_pid|21767 > > --- > > > pg_backend_pid|22045=== stderr === > > === EOF === > > > I'm slightly befuddled as to how we're ending up with a table named > > pg_backend_pid. > The lack of -X and the lack of use of installed_command() > are red flags. The pg_backend_pid is from "SELECT pg_catalog.pg_backend_pid();" in ~/.psqlrc, so the lack of -X caused that. The latest commit fixes things on a normal GNU/Linux box, so I bet it will fix wrasse. (thorntail managed not to fail that way. For unrelated reasons, I override thorntail's $HOME to a mostly-empty directory.)
Noah Misch <noah@leadboat.com> writes: > The pg_backend_pid is from "SELECT pg_catalog.pg_backend_pid();" in ~/.psqlrc, > so the lack of -X caused that. The latest commit fixes things on a normal > GNU/Linux box, so I bet it will fix wrasse. Yup, looks like we're all good now. Thanks! regards, tom lane
On 7/30/22 10:40 AM, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: >> The pg_backend_pid is from "SELECT pg_catalog.pg_backend_pid();" in ~/.psqlrc, >> so the lack of -X caused that. The latest commit fixes things on a normal >> GNU/Linux box, so I bet it will fix wrasse. > > Yup, looks like we're all good now. Thanks! Given this appears to be resolved, I have removed this from "Open Items". Thanks! Jonathan
Attachment
"Jonathan S. Katz" <jkatz@postgresql.org> writes: > Given this appears to be resolved, I have removed this from "Open > Items". Thanks! Sadly, we're still not out of the woods. I see three buildfarm failures in this test since Robert resolved the "-X" problem [1][2][3]: grassquit reported [19:34:15.249](0.001s) not ok 14 - old and new horizons match after pg_upgrade [19:34:15.249](0.001s) [19:34:15.249](0.000s) # Failed test 'old and new horizons match after pg_upgrade' # at t/002_pg_upgrade.pl line 336. === diff of /mnt/resource/bf/build/grassquit/REL_15_STABLE/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_z3zV/horizon1.txtand /mnt/resource/bf/build/grassquit/REL_15_STABLE/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_z3zV/horizon2.txt === stdout === 785c785 < spgist_point_tbl|7213|1 --- > spgist_point_tbl|7356|3 787c787 < spgist_text_tbl|7327|1 --- > spgist_text_tbl|8311|3=== stderr === === EOF === wrasse reported [06:36:35.834](0.001s) not ok 14 - old and new horizons match after pg_upgrade [06:36:35.835](0.001s) [06:36:35.835](0.000s) # Failed test 'old and new horizons match after pg_upgrade' # at t/002_pg_upgrade.pl line 336. === diff of /export/home/nm/farm/studio64v12_6/REL_15_STABLE/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_mLle/horizon1.txtand /export/home/nm/farm/studio64v12_6/REL_15_STABLE/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_mLle/horizon2.txt === stdout === 138c138 < delete_test_table|7171|1 --- > delete_test_table|7171|3=== stderr === Warning: missing newline at end of file /export/home/nm/farm/studio64v12_6/REL_15_STABLE/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_mLle/horizon1.txt Warning: missing newline at end of file /export/home/nm/farm/studio64v12_6/REL_15_STABLE/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_mLle/horizon2.txt=== EOF=== conchuela doesn't seem to have preserved the detailed log, but it failed at the same place: # Failed test 'old and new horizons match after pg_upgrade' # at t/002_pg_upgrade.pl line 336. Not sure what to make of this, except that maybe the test is telling us about an actual bug of exactly the kind it's designed to expose. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2022-08-01%2019%3A25%3A43 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-08-02%2004%3A18%3A18 [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2022-08-02%2014%3A56%3A49
On 8/2/22 1:12 PM, Tom Lane wrote: > "Jonathan S. Katz" <jkatz@postgresql.org> writes: >> Given this appears to be resolved, I have removed this from "Open >> Items". Thanks! > > Sadly, we're still not out of the woods. I see three buildfarm > failures in this test since Robert resolved the "-X" problem [1][2][3]: > > Not sure what to make of this, except that maybe the test is telling > us about an actual bug of exactly the kind it's designed to expose. Looking at the test code, is there anything that could have changed the relfrozenxid or relminxid independently of the test on these systems? That said, I do think we should reopen the item to figure out what's going on. Doing so now. Jonathan
Attachment
On Tue, Aug 2, 2022 at 1:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Not sure what to make of this, except that maybe the test is telling > us about an actual bug of exactly the kind it's designed to expose. That could be, but what would the bug be exactly? It's hard to think of a more direct way of setting relminmxid and relfrozenxid than updating pg_class. It doesn't seem realistic to suppose that we have a bug where setting a column in a system table to an integer value sometimes sets it to a slightly larger integer instead. If the values on the new cluster seemed like they had never been set, or if it seemed like they had been set to completely random values, then I'd suspect a bug in the mechanism, but here it seems more believable to me to think that we're actually setting the correct values and then something - maybe autovacuum - bumps them again before we have a chance to look at them. I'm not quite sure how to rule that theory in or out, though. -- Robert Haas EDB: http://www.enterprisedb.com
On 8/2/22 3:23 PM, Robert Haas wrote: > On Tue, Aug 2, 2022 at 1:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Not sure what to make of this, except that maybe the test is telling >> us about an actual bug of exactly the kind it's designed to expose. > > That could be, but what would the bug be exactly? It's hard to think > of a more direct way of setting relminmxid and relfrozenxid than > updating pg_class. It doesn't seem realistic to suppose that we have a > bug where setting a column in a system table to an integer value > sometimes sets it to a slightly larger integer instead. If the values > on the new cluster seemed like they had never been set, or if it > seemed like they had been set to completely random values, then I'd > suspect a bug in the mechanism, but here it seems more believable to > me to think that we're actually setting the correct values and then > something - maybe autovacuum - bumps them again before we have a > chance to look at them. FWIW (and I have not looked deeply at the code), I was thinking it could be something along those lines, given 1. the randomness of the underlying systems of the impacted farm animals and 2. it was only the three mentioned. > I'm not quite sure how to rule that theory in or out, though. Without overcomplicating this, are we able to check to see if autovacuum ran during the course of the test? Jonathan
Attachment
"Jonathan S. Katz" <jkatz@postgresql.org> writes: > On 8/2/22 1:12 PM, Tom Lane wrote: >> Sadly, we're still not out of the woods. I see three buildfarm >> failures in this test since Robert resolved the "-X" problem [1][2][3]: > Looking at the test code, is there anything that could have changed the > relfrozenxid or relminxid independently of the test on these systems? Hmmm ... now that you mention it, I see nothing in 002_pg_upgrade.pl that attempts to turn off autovacuum on either the source server or the destination. So one plausible theory is that autovac moved the numbers since we checked. If that is the explanation, then it leaves us with few good options. I am not in favor of disabling autovacuum in the test: ordinary users are not going to do that while pg_upgrade'ing, so it'd make the test less representative of real-world usage, which seems like a bad idea. We could either drop this particular check again, or weaken it to allow new relfrozenxid >= old relfrozenxid, likewise relminxid. regards, tom lane
"Jonathan S. Katz" <jkatz@postgresql.org> writes: > On 8/2/22 3:23 PM, Robert Haas wrote: >> I'm not quite sure how to rule that theory in or out, though. > Without overcomplicating this, are we able to check to see if autovacuum > ran during the course of the test? Looks like we're all thinking along the same lines. grassquit shows this at the end of the old server's log, immediately after the query to retrieve the old horizons: 2022-08-01 19:33:41.608 UTC [1487114][postmaster][:0] LOG: received fast shutdown request 2022-08-01 19:33:41.611 UTC [1487114][postmaster][:0] LOG: aborting any active transactions 2022-08-01 19:33:41.643 UTC [1487114][postmaster][:0] LOG: background worker "logical replication launcher" (PID 1487132)exited with exit code 1 2022-08-01 19:33:41.643 UTC [1493875][autovacuum worker][5/6398:0] FATAL: terminating autovacuum process due to administratorcommand 2022-08-01 19:33:41.932 UTC [1487121][checkpointer][:0] LOG: checkpoint complete: wrote 1568 buffers (9.6%); 0 WAL file(s)added, 0 removed, 33 recycled; write=31.470 s, sync=0.156 s, total=31.711 s; sync files=893, longest=0.002 s, average=0.001s; distance=33792 kB, estimate=34986 kB 2022-08-01 19:33:41.933 UTC [1487121][checkpointer][:0] LOG: shutting down and wrasse shows this: 2022-08-02 06:35:01.974 CEST [5606:6] LOG: received fast shutdown request 2022-08-02 06:35:01.974 CEST [5606:7] LOG: aborting any active transactions 2022-08-02 06:35:01.975 CEST [6758:1] FATAL: terminating autovacuum process due to administrator command 2022-08-02 06:35:01.975 CEST [6758:2] CONTEXT: while vacuuming index "spgist_point_idx" of relation "public.spgist_point_tbl" 2022-08-02 06:35:01.981 CEST [5606:8] LOG: background worker "logical replication launcher" (PID 5612) exited with exitcode 1 2022-08-02 06:35:01.995 CEST [5607:42] LOG: shutting down While not smoking guns, these definitely prove that autovac was active. regards, tom lane
On 8/2/22 3:39 PM, Tom Lane wrote: > "Jonathan S. Katz" <jkatz@postgresql.org> writes: >> On 8/2/22 3:23 PM, Robert Haas wrote: >>> I'm not quite sure how to rule that theory in or out, though. > >> Without overcomplicating this, are we able to check to see if autovacuum >> ran during the course of the test? > > Looks like we're all thinking along the same lines. > > While not smoking guns, these definitely prove that autovac was active. > If that is the explanation, then it leaves us with few good options. > I am not in favor of disabling autovacuum in the test: ordinary > users are not going to do that while pg_upgrade'ing, so it'd make > the test less representative of real-world usage, which seems like > a bad idea. We could either drop this particular check again, or > weaken it to allow new relfrozenxid >= old relfrozenxid, likewise > relminxid. The test does look helpful and it would catch regressions. Loosely quoting Robert on a different point upthread, we don't want to turn off the alarm just because it's spuriously going off. I think the weakened check is OK (and possibly mimics the real-world where autovacuum runs), unless you see a major drawback to it? Jonathan
Attachment
"Jonathan S. Katz" <jkatz@postgresql.org> writes: > On 8/2/22 3:39 PM, Tom Lane wrote: >>> I am not in favor of disabling autovacuum in the test: ordinary >>> users are not going to do that while pg_upgrade'ing, so it'd make >>> the test less representative of real-world usage, which seems like >>> a bad idea. We could either drop this particular check again, or >>> weaken it to allow new relfrozenxid >= old relfrozenxid, likewise >>> relminxid. > The test does look helpful and it would catch regressions. Loosely > quoting Robert on a different point upthread, we don't want to turn off > the alarm just because it's spuriously going off. > I think the weakened check is OK (and possibly mimics the real-world > where autovacuum runs), unless you see a major drawback to it? I also think that ">=" is a sufficient requirement. It'd be a bit painful to test if we had to cope with potential XID wraparound, but we know that these installations haven't been around nearly long enough for that, so a plain ">=" test ought to be good enough. (Replacing the simple "eq" code with something that can handle that doesn't look like much fun, though.) regards, tom lane
On 8/2/22 3:51 PM, Tom Lane wrote: > "Jonathan S. Katz" <jkatz@postgresql.org> writes: >> On 8/2/22 3:39 PM, Tom Lane wrote: >>>> I am not in favor of disabling autovacuum in the test: ordinary >>>> users are not going to do that while pg_upgrade'ing, so it'd make >>>> the test less representative of real-world usage, which seems like >>>> a bad idea. We could either drop this particular check again, or >>>> weaken it to allow new relfrozenxid >= old relfrozenxid, likewise >>>> relminxid. > >> The test does look helpful and it would catch regressions. Loosely >> quoting Robert on a different point upthread, we don't want to turn off >> the alarm just because it's spuriously going off. >> I think the weakened check is OK (and possibly mimics the real-world >> where autovacuum runs), unless you see a major drawback to it? > > I also think that ">=" is a sufficient requirement. It'd be a > bit painful to test if we had to cope with potential XID wraparound, > but we know that these installations haven't been around nearly > long enough for that, so a plain ">=" test ought to be good enough. > (Replacing the simple "eq" code with something that can handle that > doesn't look like much fun, though.) ...if these systems are hitting XID wraparound, we have another issue to worry about. I started modifying the test to support this behavior, but thought that because 1. we want to ensure the OID is still equal and 2. in the examples you showed, both relfrozenxid or relminxid could increment, we may want to have the individual checks on each column. I may be able to conjure something up that does the above, but it's been a minute since I wrote anything in Perl. Jonathan
Attachment
On 8/2/22 4:20 PM, Jonathan S. Katz wrote: > On 8/2/22 3:51 PM, Tom Lane wrote: >> "Jonathan S. Katz" <jkatz@postgresql.org> writes: >>> On 8/2/22 3:39 PM, Tom Lane wrote: >>>>> I am not in favor of disabling autovacuum in the test: ordinary >>>>> users are not going to do that while pg_upgrade'ing, so it'd make >>>>> the test less representative of real-world usage, which seems like >>>>> a bad idea. We could either drop this particular check again, or >>>>> weaken it to allow new relfrozenxid >= old relfrozenxid, likewise >>>>> relminxid. >> >>> The test does look helpful and it would catch regressions. Loosely >>> quoting Robert on a different point upthread, we don't want to turn off >>> the alarm just because it's spuriously going off. >>> I think the weakened check is OK (and possibly mimics the real-world >>> where autovacuum runs), unless you see a major drawback to it? >> >> I also think that ">=" is a sufficient requirement. It'd be a >> bit painful to test if we had to cope with potential XID wraparound, >> but we know that these installations haven't been around nearly >> long enough for that, so a plain ">=" test ought to be good enough. >> (Replacing the simple "eq" code with something that can handle that >> doesn't look like much fun, though.) > > ...if these systems are hitting XID wraparound, we have another issue to > worry about. > > I started modifying the test to support this behavior, but thought that > because 1. we want to ensure the OID is still equal and 2. in the > examples you showed, both relfrozenxid or relminxid could increment, we > may want to have the individual checks on each column. > > I may be able to conjure something up that does the above, but it's been > a minute since I wrote anything in Perl. Please see attached patch that does the above. Tests pass on my local environment (though I did not trigger autovacuum). Jonathan
Attachment
On Tue, Aug 2, 2022 at 3:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > The test does look helpful and it would catch regressions. Loosely > > quoting Robert on a different point upthread, we don't want to turn off > > the alarm just because it's spuriously going off. > > I think the weakened check is OK (and possibly mimics the real-world > > where autovacuum runs), unless you see a major drawback to it? > > I also think that ">=" is a sufficient requirement. It'd be a > bit painful to test if we had to cope with potential XID wraparound, > but we know that these installations haven't been around nearly > long enough for that, so a plain ">=" test ought to be good enough. > (Replacing the simple "eq" code with something that can handle that > doesn't look like much fun, though.) I don't really like this approach. Imagine that the code got broken in such a way that relfrozenxid and relminmxid were set to a value chosen at random - say, the contents of 4 bytes of unallocated memory that contained random garbage. Well, right now, the chances that this would cause a test failure are nearly 100%. With this change, they'd be nearly 0%. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Aug 2, 2022 at 3:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I also think that ">=" is a sufficient requirement. > I don't really like this approach. Imagine that the code got broken in > such a way that relfrozenxid and relminmxid were set to a value chosen > at random - say, the contents of 4 bytes of unallocated memory that > contained random garbage. Well, right now, the chances that this would > cause a test failure are nearly 100%. With this change, they'd be > nearly 0%. If you have a different solution that you can implement by, say, tomorrow, then go for it. But I want to see some fix in there within about 24 hours, because 15beta3 wraps on Monday and we will need at least a few days to see if the buildfarm is actually stable with whatever solution is applied. A possible compromise is to allow new values that are between old value and old-value-plus-a-few-dozen. regards, tom lane
> On Aug 3, 2022, at 10:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: >>> On Tue, Aug 2, 2022 at 3:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I also think that ">=" is a sufficient requirement. > >> I don't really like this approach. Imagine that the code got broken in >> such a way that relfrozenxid and relminmxid were set to a value chosen >> at random - say, the contents of 4 bytes of unallocated memory that >> contained random garbage. Well, right now, the chances that this would >> cause a test failure are nearly 100%. With this change, they'd be >> nearly 0%. > > If you have a different solution that you can implement by, say, > tomorrow, then go for it. But I want to see some fix in there > within about 24 hours, because 15beta3 wraps on Monday and we > will need at least a few days to see if the buildfarm is actually > stable with whatever solution is applied. Yeah, I would argue that the current proposal guards against the false positives as they currently stand. I do think Robert raises a fair point, but I wonder if another test would catch that? I don’t want to say “this would never happen” because, well, it could happen. But AIUI this would probably manifest itself in other places too? > A possible compromise is to allow new values that are between > old value and old-value-plus-a-few-dozen. Well, that’s kind of deterministic :-) I’m OK with that tweak, where “OK” means not thrilled, but I don’t see a better way to get more granular details (at least through my phone searches). I can probably have a tweak for this in a couple of hours if and when I’m on plane wifi. Jonathan
On Wed, Aug 3, 2022 at 10:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > If you have a different solution that you can implement by, say, > tomorrow, then go for it. But I want to see some fix in there > within about 24 hours, because 15beta3 wraps on Monday and we > will need at least a few days to see if the buildfarm is actually > stable with whatever solution is applied. I doubt that I can come up with something that quickly, so I guess we need some stopgap for now. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Aug 2, 2022 at 12:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Hmmm ... now that you mention it, I see nothing in 002_pg_upgrade.pl > that attempts to turn off autovacuum on either the source server or > the destination. So one plausible theory is that autovac moved the > numbers since we checked. It's very easy to believe that my work in commit 0b018fab could make that happen, which is only a few months old. It's now completely routine for non-aggressive autovacuums to advance relfrozenxid by at least a small amount. For example, when autovacuum runs against either the tellers table or the branches table during a pgbench run, it will now advance relfrozenxid, every single time. And to a very recent value. This will happen in spite of the fact that no freezing ever takes place -- it's just a consequence of the oldest extant XID consistently being quite young each time, due to workload characteristics. -- Peter Geoghegan
On Wed, Aug 3, 2022 at 6:59 AM Robert Haas <robertmhaas@gmail.com> wrote: > I don't really like this approach. Imagine that the code got broken in > such a way that relfrozenxid and relminmxid were set to a value chosen > at random - say, the contents of 4 bytes of unallocated memory that > contained random garbage. Well, right now, the chances that this would > cause a test failure are nearly 100%. With this change, they'd be > nearly 0%. If that kind of speculative bug existed, and somehow triggered before the concurrent autovacuum ran (which seems very likely to be the source of the test flappiness), then it would still be caught, most likely. VACUUM itself has the following defenses: * The defensive "can't happen" errors added to heap_prepare_freeze_tuple() and related freezing routines by commit 699bf7d0 in 2017, as hardening following the "freeze the dead" bug. That'll catch XIDs that are before the relfrozenxid at the start of the VACUUM (ditto for MXIDs/relminmxid). * The assertion added in my recent commit 0b018fab, which verifies that we're about to set relfrozenxid to something sane. * VACUUM now warns when it sees a *previous* relfrozenxid that's apparently "in the future", following recent commit e83ebfe6. This problem scenario is associated with several historic bugs in pg_upgrade, where for one reason or another it failed to carry forward correct relfrozenxid and/or relminmxid values for a table (see the commit message for references to those old pg_upgrade bugs). It might make sense to run a manual VACUUM right at the end of the test, so that you reliably get this kind of coverage, even without autovacuum. -- Peter Geoghegan
Hi, On 2022-08-03 09:59:40 -0400, Robert Haas wrote: > On Tue, Aug 2, 2022 at 3:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > The test does look helpful and it would catch regressions. Loosely > > > quoting Robert on a different point upthread, we don't want to turn off > > > the alarm just because it's spuriously going off. > > > I think the weakened check is OK (and possibly mimics the real-world > > > where autovacuum runs), unless you see a major drawback to it? > > > > I also think that ">=" is a sufficient requirement. It'd be a > > bit painful to test if we had to cope with potential XID wraparound, > > but we know that these installations haven't been around nearly > > long enough for that, so a plain ">=" test ought to be good enough. > > (Replacing the simple "eq" code with something that can handle that > > doesn't look like much fun, though.) > > I don't really like this approach. Imagine that the code got broken in > such a way that relfrozenxid and relminmxid were set to a value chosen > at random - say, the contents of 4 bytes of unallocated memory that > contained random garbage. Well, right now, the chances that this would > cause a test failure are nearly 100%. With this change, they'd be > nearly 0%. Can't that pretty easily be addressed by subsequently querying txid_current(), and checking that the value isn't newer than that? Greetings, Andres Freund
On Wed, Aug 3, 2022 at 1:20 PM Andres Freund <andres@anarazel.de> wrote: > > I don't really like this approach. Imagine that the code got broken in > > such a way that relfrozenxid and relminmxid were set to a value chosen > > at random - say, the contents of 4 bytes of unallocated memory that > > contained random garbage. Well, right now, the chances that this would > > cause a test failure are nearly 100%. With this change, they'd be > > nearly 0%. > > Can't that pretty easily be addressed by subsequently querying txid_current(), > and checking that the value isn't newer than that? It couldn't hurt to do that as well, in passing (at the same time as testing that newrelfrozenxid >= oldrelfrozenxid directly). But deliberately running VACUUM afterwards seems like a good idea. We really ought to expect VACUUM to catch cases where relfrozenxid/relminmxid is faulty, at least in cases where it can be proven wrong by noticing some kind of inconsistency. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > It couldn't hurt to do that as well, in passing (at the same time as > testing that newrelfrozenxid >= oldrelfrozenxid directly). But > deliberately running VACUUM afterwards seems like a good idea. We > really ought to expect VACUUM to catch cases where > relfrozenxid/relminmxid is faulty, at least in cases where it can be > proven wrong by noticing some kind of inconsistency. That doesn't seem like it'd be all that thorough: we expect VACUUM to skip pages whenever possible. I'm also a bit concerned about the expense, though admittedly this test is ridiculously expensive already. regards, tom lane
On Wed, Aug 3, 2022 at 4:20 PM Andres Freund <andres@anarazel.de> wrote: > > I don't really like this approach. Imagine that the code got broken in > > such a way that relfrozenxid and relminmxid were set to a value chosen > > at random - say, the contents of 4 bytes of unallocated memory that > > contained random garbage. Well, right now, the chances that this would > > cause a test failure are nearly 100%. With this change, they'd be > > nearly 0%. > > Can't that pretty easily be addressed by subsequently querying txid_current(), > and checking that the value isn't newer than that? Hmm, maybe. The old cluster shouldn't have wrapped around ever, since we just created it. So the value in the new cluster should be >= that value and <= the result of txid_curent() ignoring wraparound. Or we could disable autovacuum on the new cluster, which I think is a better solution. I like it when things match exactly; it makes me feel that the universe is well-ordered. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Aug 3, 2022 at 1:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > That doesn't seem like it'd be all that thorough: we expect VACUUM > to skip pages whenever possible. I'm also a bit concerned about > the expense, though admittedly this test is ridiculously expensive > already. I bet the SKIP_PAGES_THRESHOLD stuff will be enough to make VACUUM visit every heap page in practice for a test case like this. That is all it takes to be able to safely advance relfrozenxid to whatever the oldest extant XID happened to be. However, I'm no fan of the SKIP_PAGES_THRESHOLD behavior, and already have plans to get rid of it -- so I wouldn't rely on that continuing to be true forever. It's probably not really necessary to have that kind of coverage in this particular test case. VACUUM will complain about weird relfrozenxid values in a large variety of contexts, even without assertions enabled. Mostly I was just saying: if we really do need test coverage of relfrozenxid in this context, then VACUUM is probably the way to go. -- Peter Geoghegan
Robert Haas <robertmhaas@gmail.com> writes: > Or we could disable autovacuum on the new cluster, which I think is a > better solution. I like it when things match exactly; it makes me feel > that the universe is well-ordered. Again, this seems to me to be breaking the test's real-world applicability for a (false?) sense of stability. regards, tom lane
On 2022-08-03 16:46:57 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > Or we could disable autovacuum on the new cluster, which I think is a > > better solution. I like it when things match exactly; it makes me feel > > that the universe is well-ordered. > > Again, this seems to me to be breaking the test's real-world applicability > for a (false?) sense of stability. Yea, that doesn't seem like an improvement. I e.g. found the issues around relfilenode reuse in 15 due to autovacuum running in the pg_upgrade target cluster. And I recall other bugs in the area... Greetings, Andres Freund
On Wed, Aug 3, 2022 at 1:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Again, this seems to me to be breaking the test's real-world applicability > for a (false?) sense of stability. I agree. A lot of the VACUUM test flappiness issues we've had to deal with in the past now seem like problems with VACUUM itself, the test's design, or both. For example, why should we get a totally different pg_class.reltuples because we couldn't get a cleanup lock on some page? Why not just make sure to give the same answer either way, which happens to be the most useful behavior to the user? That way the test isn't just targeting implementation details. -- Peter Geoghegan
On 8/3/22 2:08 PM, Peter Geoghegan wrote: > On Wed, Aug 3, 2022 at 1:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Again, this seems to me to be breaking the test's real-world applicability >> for a (false?) sense of stability. > > I agree. > > A lot of the VACUUM test flappiness issues we've had to deal with in > the past now seem like problems with VACUUM itself, the test's design, > or both. For example, why should we get a totally different > pg_class.reltuples because we couldn't get a cleanup lock on some > page? Why not just make sure to give the same answer either way, > which happens to be the most useful behavior to the user? That way > the test isn't just targeting implementation details. After catching up (and reviewing approaches that could work while on poor wifi), it does make me wonder if we can have a useful test ready before beta 3. I did rule out wanting to do the "xid + $X" check after reviewing some of the output. I think that both $X could end up varying, and it really feels like a bandaid. Andres suggested upthread using "txid_current()" -- for the comparison, that's one thing I looked at. Would any of the XID info from "pg_control_checkpoint()" also serve for this test? If yes to the above, I should be able to modify this fairly quickly. Jonathan
"Jonathan S. Katz" <jkatz@postgresql.org> writes: > I did rule out wanting to do the "xid + $X" check after reviewing some > of the output. I think that both $X could end up varying, and it really > feels like a bandaid. It is that. I wouldn't feel comfortable with $X less than 100 or so, which is probably sloppy enough to draw Robert's ire. Still, realizing that what we want right now is a band-aid for 15beta3, I don't think it's an unreasonable short-term option. > Andres suggested upthread using "txid_current()" -- for the comparison, > that's one thing I looked at. Would any of the XID info from > "pg_control_checkpoint()" also serve for this test? I like the idea of txid_current(), but we have no comparable function for mxid do we? While you could get both numbers from pg_control_checkpoint(), I doubt that's sufficiently up-to-date. regards, tom lane
On 8/3/22 4:19 PM, Tom Lane wrote: > "Jonathan S. Katz" <jkatz@postgresql.org> writes: >> I did rule out wanting to do the "xid + $X" check after reviewing some >> of the output. I think that both $X could end up varying, and it really >> feels like a bandaid. > > It is that. I wouldn't feel comfortable with $X less than 100 or so, > which is probably sloppy enough to draw Robert's ire. Still, realizing > that what we want right now is a band-aid for 15beta3, I don't think > it's an unreasonable short-term option. Attached is the "band-aid / sloppy" version of the patch. Given from the test examples I kept seeing deltas over 100 for relfrozenxid, I chose 1000. The mxid delta was less, but I kept it at 1000 for consistency (and because I hope this test is short lived in this state), but can be talked into otherwise. >> Andres suggested upthread using "txid_current()" -- for the comparison, >> that's one thing I looked at. Would any of the XID info from >> "pg_control_checkpoint()" also serve for this test? > > I like the idea of txid_current(), but we have no comparable > function for mxid do we? While you could get both numbers from > pg_control_checkpoint(), I doubt that's sufficiently up-to-date. ...unless we force a checkpoint in the test? Jonathan
Attachment
On Wed, Aug 3, 2022 at 7:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Jonathan S. Katz" <jkatz@postgresql.org> writes: > > I did rule out wanting to do the "xid + $X" check after reviewing some > > of the output. I think that both $X could end up varying, and it really > > feels like a bandaid. > > It is that. I wouldn't feel comfortable with $X less than 100 or so, > which is probably sloppy enough to draw Robert's ire. Still, realizing > that what we want right now is a band-aid for 15beta3, I don't think > it's an unreasonable short-term option. 100 << 2^32, so it's not terrible, but I'm honestly coming around to the view that we ought to just nuke this test case. From my point of view, the assertion that disabling autovacuum during this test case would make the test case useless seems to be incorrect. The original purpose of the test was to make sure that the pre-upgrade schema matched the post-upgrade schema. If having autovacuum running or not running affects that, we have a serious problem, but this test case isn't especially likely to find it, because whether autovacuum runs or not during the brief window where the test is running is totally unpredictable. Furthermore, if we do have such a problem, it would probably indicate that vacuum is using the wrong horizons to prune or test the visibility of the tuples. To find that out, we might want to compare values upon which the behavior of vacuum might depend, like relfrozenxid. But to do that, we have to disable autovacuum, so that the value can't change under us. From my point of view, that's making test coverage better, not worse, because any bugs in this area that can be found without explicit testing of relevant horizons are dependent on low-probability race conditions materializing in the buildfarm. If we disable autovacuum and then compare relfrozenxid and whatever else we care about explicitly, we can find bugs in that category reliably. However, if people don't accept that argument, then this new test case is kind of silly. It's not the worst idea in the world to use a threshold of 100 XIDs or something, but without disabling autovacuum, we're basically comparing two things that can't be expected to be equal, so we test and see if they're approximately equal and then call that good enough. I don't know that I believe we'll ever find a bug that way, though. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Thu, Aug 4, 2022 at 10:02 AM Jonathan S. Katz <jkatz@postgresql.org> wrote: > Attached is the "band-aid / sloppy" version of the patch. Given from the > test examples I kept seeing deltas over 100 for relfrozenxid, I chose > 1000. The mxid delta was less, but I kept it at 1000 for consistency > (and because I hope this test is short lived in this state), but can be > talked into otherwise. ISTM that you'd need to loop over the rows and do this for each row. Otherwise I think you're just comparing results for the first relation and ignoring all the rest. -- Robert Haas EDB: http://www.enterprisedb.com
"Jonathan S. Katz" <jkatz@postgresql.org> writes: > On 8/3/22 4:19 PM, Tom Lane wrote: >> I like the idea of txid_current(), but we have no comparable >> function for mxid do we? While you could get both numbers from >> pg_control_checkpoint(), I doubt that's sufficiently up-to-date. > ...unless we force a checkpoint in the test? Hmm ... maybe if you take a snapshot and hold that open while forcing the checkpoint and doing the subsequent checks. That seems messy though. Also, while that should serve to hold back global xmin, I'm not at all clear on whether that has a similar effect on minmxid. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > 100 << 2^32, so it's not terrible, but I'm honestly coming around to > the view that we ought to just nuke this test case. I'd hesitated to suggest that, but I think that's a fine solution. Especially since we can always put it back in later if we think of a more robust way. regards, tom lane
On Thu, Aug 4, 2022 at 10:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > 100 << 2^32, so it's not terrible, but I'm honestly coming around to > > the view that we ought to just nuke this test case. > > I'd hesitated to suggest that, but I think that's a fine solution. > Especially since we can always put it back in later if we think > of a more robust way. IMHO it's 100% clear how to make it robust. If you want to check that two values are the same, you can't let one of them be overwritten by an unrelated event in the middle of the check. There are many specific things we could do here, a few of which I proposed in my previous email, but they all boil down to "don't let autovacuum screw up the results". But if you don't want to do that, and you also don't want to have random failures, the only alternatives are weakening the check and removing the test. It's kind of hard to say which is better, but I'm inclined to think that if we just weaken the test we're going to think we've got coverage for this kind of problem when we really don't. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2022-08-04 12:43:49 -0400, Robert Haas wrote: > On Thu, Aug 4, 2022 at 10:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > > > 100 << 2^32, so it's not terrible, but I'm honestly coming around to > > > the view that we ought to just nuke this test case. > > > > I'd hesitated to suggest that, but I think that's a fine solution. > > Especially since we can always put it back in later if we think > > of a more robust way. > > IMHO it's 100% clear how to make it robust. If you want to check that > two values are the same, you can't let one of them be overwritten by > an unrelated event in the middle of the check. There are many specific > things we could do here, a few of which I proposed in my previous > email, but they all boil down to "don't let autovacuum screw up the > results". > > But if you don't want to do that, and you also don't want to have > random failures, the only alternatives are weakening the check and > removing the test. It's kind of hard to say which is better, but I'm > inclined to think that if we just weaken the test we're going to think > we've got coverage for this kind of problem when we really don't. Why you think it's better to not have the test than to have a very limited amount of fuzziness (by using the next xid as an upper limit). What's the bug that will reliably pass the nextxid fuzzy comparison, but not an exact comparison? Greetings, Andres Freund
Robert Haas <robertmhaas@gmail.com> writes: > IMHO it's 100% clear how to make it robust. If you want to check that > two values are the same, you can't let one of them be overwritten by > an unrelated event in the middle of the check. There are many specific > things we could do here, a few of which I proposed in my previous > email, but they all boil down to "don't let autovacuum screw up the > results". It doesn't really matter how robust a test case is, if it isn't testing the thing you need to have tested. So I remain unwilling to disable autovac in a way that won't match real-world usage. Note that the patch you proposed at [1] will not fix anything. It turns off autovac in the new node, but the buildfarm failures we've seen appear to be due to autovac running on the old node. (I believe that autovac in the new node is *also* a hazard, but it seems to be a lot less of one, presumably because of timing considerations.) To make it work, we'd have to shut off autovac in the old node before starting pg_upgrade, and that would make it unacceptably (IMHO) different from what real users will do. Conceivably, we could move all of this processing into pg_upgrade itself --- autovac disable/re-enable and capturing of the horizon data --- and that would address my complaint. I don't really want to go there though, especially when in the final analysis IT IS NOT A BUG if a rel's horizons advance a bit during pg_upgrade. It's only a bug if they become inconsistent with the rel's data, which is not what this test is testing for. regards, tom lane [1] https://www.postgresql.org/message-id/CA%2BTgmoZkBcMi%2BNikxfc54dgkWj41Q%3DZ4nuyHpheTcxA-qfS5Qg%40mail.gmail.com
On Thu, Aug 4, 2022 at 9:44 AM Robert Haas <robertmhaas@gmail.com> wrote: > But if you don't want to do that, and you also don't want to have > random failures, the only alternatives are weakening the check and > removing the test. It's kind of hard to say which is better, but I'm > inclined to think that if we just weaken the test we're going to think > we've got coverage for this kind of problem when we really don't. Perhaps amcheck's verify_heapam() function can be used here. What could be better than exhaustively verifying that the relfrozenxid (and relminmxid) invariants hold for every single tuple in the table? Those are the exact conditions that we care about, as far as relfrozenxid/relminmxid goes. My sense is that that has a much better chance of detecting a real bug at some point. This approach is arguably an example of property-based testing. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > Perhaps amcheck's verify_heapam() function can be used here. What > could be better than exhaustively verifying that the relfrozenxid (and > relminmxid) invariants hold for every single tuple in the table? How much will that add to the test's runtime? I could get behind this idea if it's not exorbitantly expensive. regards, tom lane
On Thu, Aug 4, 2022 at 11:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > How much will that add to the test's runtime? I could get behind this > idea if it's not exorbitantly expensive. I'm not sure offhand, but I suspect it wouldn't be too bad. -- Peter Geoghegan
On Thu, Aug 4, 2022 at 1:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Note that the patch you proposed at [1] will not fix anything. > It turns off autovac in the new node, but the buildfarm failures > we've seen appear to be due to autovac running on the old node. > (I believe that autovac in the new node is *also* a hazard, but > it seems to be a lot less of one, presumably because of timing > considerations.) To make it work, we'd have to shut off autovac > in the old node before starting pg_upgrade, Yeah, that's a fair point. > and that would make it > unacceptably (IMHO) different from what real users will do. I don't agree with that, but as you say, it is a matter of opinion. In any case, what exactly do you want to do now? Jonathon Katz has proposed a patch to do the fuzzy comparison which I believe to be incorrect because I think it compares, at most, the horizons for one table in the database. I could go work on a better version of that, or he could, or you could, but it seems like we're running out of time awfully quick here, given that you wanted to have this resolved today and it's almost the end of today. I think the most practical alternative is to put this file back to the way it was before I started tinkering with it, and revisit this issue after the release. If you want to do something else, that's fine, but I'm not going to be available to work on this issue over the weekend, so if you want to do something else, you or someone else is going to have to take responsibility for whatever further stabilization that other approach may require between now and the release. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > I think the most practical alternative is to put this file back to the > way it was before I started tinkering with it, and revisit this issue > after the release. Yeah, that seems like the right thing. We are running too low on time to have any confidence that a modified version of the test will be reliable. regards, tom lane
On Thu, Aug 4, 2022 at 12:59 PM Andres Freund <andres@anarazel.de> wrote: > Why you think it's better to not have the test than to have a very limited > amount of fuzziness (by using the next xid as an upper limit). What's the bug > that will reliably pass the nextxid fuzzy comparison, but not an exact > comparison? I don't know. I mean, I guess one possibility is that the nextXid value could be wrong too, because I doubt we have any separate test that would catch that. But more generally, I don't have a lot of confidence in fuzzy tests. It's too easy for things to look like they're working when they really aren't. Let's say that the value in the old cluster was 100 and the nextXid in the new cluster is 1000. Well, it's not like every value between 100 and 1000 is OK. The overwhelming majority of those values could never be produced, neither from the old cluster nor from any subsequent vacuum. Given that the old cluster is suffering no new write transactions, there's probably exactly two values that are legal: one being the value from the old cluster, which we know, and the other being whatever a vacuum of that table would produce, which we don't know, although we do know that it's somewhere in that range. Let's flip the question on its head: why should some hypothetical future bug that we have in this area produce a value outside that range? If it's a failure to set the value at all, or if it generates a value at random, we'd likely still catch it. And those are pretty likely, so maybe the value of such a test is not zero. On the other hand, subtle breakage might be more likely to survive developer testing than complete breakage. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Aug 4, 2022 at 12:15 PM Robert Haas <robertmhaas@gmail.com> wrote: > Given that the old cluster is suffering no new write > transactions, there's probably exactly two values that are legal: one > being the value from the old cluster, which we know, and the other > being whatever a vacuum of that table would produce, which we don't > know, although we do know that it's somewhere in that range. What about autoanalyze? -- Peter Geoghegan
On Thu, Aug 4, 2022 at 3:23 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Thu, Aug 4, 2022 at 12:15 PM Robert Haas <robertmhaas@gmail.com> wrote: > > Given that the old cluster is suffering no new write > > transactions, there's probably exactly two values that are legal: one > > being the value from the old cluster, which we know, and the other > > being whatever a vacuum of that table would produce, which we don't > > know, although we do know that it's somewhere in that range. > > What about autoanalyze? What about it? -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Aug 4, 2022 at 3:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > I think the most practical alternative is to put this file back to the > > way it was before I started tinkering with it, and revisit this issue > > after the release. > > Yeah, that seems like the right thing. We are running too low on time > to have any confidence that a modified version of the test will be > reliable. Done. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Aug 4, 2022 at 12:31 PM Robert Haas <robertmhaas@gmail.com> wrote: > > What about autoanalyze? > > What about it? It has a tendency to consume an XID, here or there, quite unpredictably. I've noticed that this often involves an analyze of pg_statistic. Have you accounted for that? You said upthread that you don't like "fuzzy" tests, because it's too easy for things to look like they're working when they really aren't. I suppose that there may be some truth to that, but ISTM that there is also a lot to be said for a test that can catch failures that weren't specifically anticipated. Users won't be running pg_upgrade with autovacuum disabled. And so ISTM that just testing that relfrozenxid has been carried forward is more precise about one particular detail (more precise than alternative approaches to testing), but less precise about the thing that we actually care about. -- Peter Geoghegan
On Tue, Aug 2, 2022 at 03:32:05PM -0400, Tom Lane wrote: > "Jonathan S. Katz" <jkatz@postgresql.org> writes: > > On 8/2/22 1:12 PM, Tom Lane wrote: > >> Sadly, we're still not out of the woods. I see three buildfarm > >> failures in this test since Robert resolved the "-X" problem [1][2][3]: > > > Looking at the test code, is there anything that could have changed the > > relfrozenxid or relminxid independently of the test on these systems? > > Hmmm ... now that you mention it, I see nothing in 002_pg_upgrade.pl > that attempts to turn off autovacuum on either the source server or > the destination. So one plausible theory is that autovac moved the > numbers since we checked. Uh, pg_upgrade assumes autovacuum is not running, and tries to enforce this: start_postmaster() ... /* * Use -b to disable autovacuum. * * Turn off durability requirements to improve object creation speed, and * we only modify the new cluster, so only use it there. If there is a * crash, the new cluster has to be recreated anyway. fsync=off is a big * win on ext4. * * Force vacuum_defer_cleanup_age to 0 on the new cluster, so that * vacuumdb --freeze actually freezes the tuples. */ snprintf(cmd, sizeof(cmd), "\"%s/pg_ctl\" -w -l \"%s/%s\" -D \"%s\" -o \"-p %d -b%s %s%s\" start", cluster->bindir, log_opts.logdir, SERVER_LOG_FILE, cluster->pgconfig, cluster->port, (cluster == &new_cluster) ? " -c synchronous_commit=off -c fsync=off -c full_page_writes=off -c vacuum_defer_cleanup_age=0" : "", cluster->pgopts ? cluster->pgopts : "", socket_string); Perhaps the test script should do something similar, or this method doesn't work anymore. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Bruce Momjian <bruce@momjian.us> writes: >> Hmmm ... now that you mention it, I see nothing in 002_pg_upgrade.pl >> that attempts to turn off autovacuum on either the source server or >> the destination. So one plausible theory is that autovac moved the >> numbers since we checked. > Uh, pg_upgrade assumes autovacuum is not running, and tries to enforce > this: The problems come from autovac running before or after pg_upgrade. > Perhaps the test script should do something similar, I'm not on board with that, for the reasons I gave upthread. regards, tom lane
On Mon, Aug 8, 2022 at 09:51:46PM -0400, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > >> Hmmm ... now that you mention it, I see nothing in 002_pg_upgrade.pl > >> that attempts to turn off autovacuum on either the source server or > >> the destination. So one plausible theory is that autovac moved the > >> numbers since we checked. > > > Uh, pg_upgrade assumes autovacuum is not running, and tries to enforce > > this: > > The problems come from autovac running before or after pg_upgrade. > > > Perhaps the test script should do something similar, > > I'm not on board with that, for the reasons I gave upthread. Uh, I assume it is this paragraph: > If that is the explanation, then it leaves us with few good options. > I am not in favor of disabling autovacuum in the test: ordinary > users are not going to do that while pg_upgrade'ing, so it'd make > the test less representative of real-world usage, which seems like > a bad idea. We could either drop this particular check again, or > weaken it to allow new relfrozenxid >= old relfrozenxid, likewise > relminxid. I thought the test was setting up a configuration that would never be used by normal servers. Is that false? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Bruce Momjian <bruce@momjian.us> writes: > I thought the test was setting up a configuration that would never be > used by normal servers. Is that false? *If* we made it disable autovac before starting pg_upgrade, then that would be a process not used by normal users. I don't care whether pg_upgrade disables autovac during its run; that's not what's at issue here. regards, tom lane