Re: pg15b2: large objects lost on upgrade - Mailing list pgsql-hackers

From Robert Haas
Subject Re: pg15b2: large objects lost on upgrade
Date
Msg-id CA+TgmoYYMXGUJO5GZk1-MByJGu_bB8CbOL6GJQC8=Bzt6x6vDg@mail.gmail.com
Whole thread Raw
In response to Re: pg15b2: large objects lost on upgrade  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: pg15b2: large objects lost on upgrade
Re: pg15b2: large objects lost on upgrade
Re: pg15b2: large objects lost on upgrade
List pgsql-hackers
On Tue, Jul 12, 2022 at 4:51 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I have a few more ideas to try here. It occurs to me that we could fix
> this more cleanly if we could get the dump itself to set the
> relfilenode for pg_largeobject to the desired value. Right now, it's
> just overwriting the relfilenode stored in the catalog without
> actually doing anything that would cause a change on disk. But if we
> could make it change the relfilenode in a more principled way that
> would actually cause an on-disk change, then the orphaned-file problem
> would be fixed, because we'd always be installing the new file over
> top of the old file. I'm going to investigate how hard it would be to
> make that work.

Well, it took a while to figure out how to make that work, but I
believe I've got it now. Attached please find a couple of patches that
should get the job done. They might need a bit of polish, but I think
the basic concepts are sound.

My first thought was to have the dump issue VACUUM FULL pg_largeobject
after first calling binary_upgrade_set_next_heap_relfilenode() and
binary_upgrade_set_next_index_relfilenode(), and have the VACUUM FULL
use the values configured by those calls for the new heap and index
OID. I got this working in standalone testing, only to find that this
doesn't work inside pg_upgrade. The complaint is "ERROR:  VACUUM
cannot run inside a transaction block", and I think that happens
because pg_restore sends the entire TOC entry for a single object to
the server as a single query, and here it contains multiple SQL
commands. That multi-command string ends up being treated like an
implicit transaction block.

So my second thought was to have the dump still call
binary_upgrade_set_next_heap_relfilenode() and
binary_upgrade_set_next_index_relfilenode(), but then afterwards call
TRUNCATE rather than VACUUM FULL. I found that a simple change to
RelationSetNewRelfilenumber() did the trick: I could then easily
change the heap and index relfilenodes for pg_largeobject to any new
values I liked. However, I realized that this approach had a problem:
what if the pg_largeobject relation had never been rewritten in the
old cluster? Then the heap and index relfilenodes would be unchanged.
It's also possible that someone might have run REINDEX in the old
cluster, in which case it might happen that the heap relfilenode was
unchanged, but the index relfilenode had changed. I spent some time
fumbling around with trying to get the non-transactional truncate path
to cover these cases, but the fact that we might need to change the
relfilenode for the index but not the heap makes this approach seem
pretty awful.

But it occurred to me that in the case of a pg_upgrade, we don't
really need to keep the old storage around until commit time. We can
unlink it first, before creating the new storage, and then if the old
and new storage happen to be the same, it still works. I can think of
two possible objections to this way forward. First, it means that the
operation isn't properly transactional. However, if the upgrade fails
at this stage, the new cluster is going to have to be nuked and
recreated anyway, so the fact that things might be in an unclean state
after an ERROR is irrelevant. Second, one might wonder whether such a
fix is really sufficient. For example, what happens if the relfilenode
allocated to pg_largebject in the old cluster is assigned to its index
in the new cluster, and vice versa? To make that work, we would need
to remove all storage for all relfilenodes first, and then recreate
them all afterward. However, I don't think we need to make that work.
If an object in the old cluster has a relfilenode < 16384, then that
should mean it's never been rewritten and therefore its relfilenode in
the new cluster should be the same. The only way this wouldn't be true
is if we shuffled around the initial relfilenode assignments in a new
version of PG so that the same values were used but now for different
objects, which would be a really dumb idea. And on the other hand, if
the object in the old cluster has a relfilenode > 16384, then that
relfilenode value should be unused in the new cluster. If not, the
user has tinkered with the new cluster more than they ought.

So I tried implementing this but I didn't get it quite right the first
time. It's not enough to call smgrdounlinkall() instead of
RelationDropStorage(), because just as RelationDropStorage() does not
actually drop the storage but only schedules it to be dropped later,
so also smgrdounlinkall() does not in fact unlink all, but only some.
It leaves the first segment of the main relation fork around to guard
against the hazards discussed in the header comments for mdunlink().
But those hazards don't really matter here either, because, again, any
failure will necessarily require that the entire new cluster be
destroyed, and also because there shouldn't be any concurrent activity
in the new cluster while pg_upgrade is running. So I adjusted
smgrdounlinkall() to actually remove everything when IsBinaryUpgrade =
true. And then it all seems to work: pg_upgrade of a cluster that has
had a rewrite of pg_largeobject works, and pg_upgrade of a cluster
that has not had such a rewrite works too. Wa-hoo.

As to whether this is a good fix, I think someone could certainly
argue otherwise. This is all a bit grotty. However, I don't find it
all that bad. As long as we're moving files from between one PG
cluster and another using an external tool rather than logic inside
the server itself, I think we're bound to have some hacks someplace to
make it all work. To me, extending them to a few more places to avoid
leaving files behind on disk seems like a good trade-off. Your mileage
may vary.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: proposal: possibility to read dumped table's name from file
Next
From: Jacob Champion
Date:
Subject: Re: Commitfest Update