Re: pg15b2: large objects lost on upgrade - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: pg15b2: large objects lost on upgrade |
Date | |
Msg-id | CA+TgmoYYMXGUJO5GZk1-MByJGu_bB8CbOL6GJQC8=Bzt6x6vDg@mail.gmail.com Whole thread Raw |
In response to | Re: pg15b2: large objects lost on upgrade (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: pg15b2: large objects lost on upgrade
Re: pg15b2: large objects lost on upgrade Re: pg15b2: large objects lost on upgrade |
List | pgsql-hackers |
On Tue, Jul 12, 2022 at 4:51 PM Robert Haas <robertmhaas@gmail.com> wrote: > I have a few more ideas to try here. It occurs to me that we could fix > this more cleanly if we could get the dump itself to set the > relfilenode for pg_largeobject to the desired value. Right now, it's > just overwriting the relfilenode stored in the catalog without > actually doing anything that would cause a change on disk. But if we > could make it change the relfilenode in a more principled way that > would actually cause an on-disk change, then the orphaned-file problem > would be fixed, because we'd always be installing the new file over > top of the old file. I'm going to investigate how hard it would be to > make that work. Well, it took a while to figure out how to make that work, but I believe I've got it now. Attached please find a couple of patches that should get the job done. They might need a bit of polish, but I think the basic concepts are sound. My first thought was to have the dump issue VACUUM FULL pg_largeobject after first calling binary_upgrade_set_next_heap_relfilenode() and binary_upgrade_set_next_index_relfilenode(), and have the VACUUM FULL use the values configured by those calls for the new heap and index OID. I got this working in standalone testing, only to find that this doesn't work inside pg_upgrade. The complaint is "ERROR: VACUUM cannot run inside a transaction block", and I think that happens because pg_restore sends the entire TOC entry for a single object to the server as a single query, and here it contains multiple SQL commands. That multi-command string ends up being treated like an implicit transaction block. So my second thought was to have the dump still call binary_upgrade_set_next_heap_relfilenode() and binary_upgrade_set_next_index_relfilenode(), but then afterwards call TRUNCATE rather than VACUUM FULL. I found that a simple change to RelationSetNewRelfilenumber() did the trick: I could then easily change the heap and index relfilenodes for pg_largeobject to any new values I liked. However, I realized that this approach had a problem: what if the pg_largeobject relation had never been rewritten in the old cluster? Then the heap and index relfilenodes would be unchanged. It's also possible that someone might have run REINDEX in the old cluster, in which case it might happen that the heap relfilenode was unchanged, but the index relfilenode had changed. I spent some time fumbling around with trying to get the non-transactional truncate path to cover these cases, but the fact that we might need to change the relfilenode for the index but not the heap makes this approach seem pretty awful. But it occurred to me that in the case of a pg_upgrade, we don't really need to keep the old storage around until commit time. We can unlink it first, before creating the new storage, and then if the old and new storage happen to be the same, it still works. I can think of two possible objections to this way forward. First, it means that the operation isn't properly transactional. However, if the upgrade fails at this stage, the new cluster is going to have to be nuked and recreated anyway, so the fact that things might be in an unclean state after an ERROR is irrelevant. Second, one might wonder whether such a fix is really sufficient. For example, what happens if the relfilenode allocated to pg_largebject in the old cluster is assigned to its index in the new cluster, and vice versa? To make that work, we would need to remove all storage for all relfilenodes first, and then recreate them all afterward. However, I don't think we need to make that work. If an object in the old cluster has a relfilenode < 16384, then that should mean it's never been rewritten and therefore its relfilenode in the new cluster should be the same. The only way this wouldn't be true is if we shuffled around the initial relfilenode assignments in a new version of PG so that the same values were used but now for different objects, which would be a really dumb idea. And on the other hand, if the object in the old cluster has a relfilenode > 16384, then that relfilenode value should be unused in the new cluster. If not, the user has tinkered with the new cluster more than they ought. So I tried implementing this but I didn't get it quite right the first time. It's not enough to call smgrdounlinkall() instead of RelationDropStorage(), because just as RelationDropStorage() does not actually drop the storage but only schedules it to be dropped later, so also smgrdounlinkall() does not in fact unlink all, but only some. It leaves the first segment of the main relation fork around to guard against the hazards discussed in the header comments for mdunlink(). But those hazards don't really matter here either, because, again, any failure will necessarily require that the entire new cluster be destroyed, and also because there shouldn't be any concurrent activity in the new cluster while pg_upgrade is running. So I adjusted smgrdounlinkall() to actually remove everything when IsBinaryUpgrade = true. And then it all seems to work: pg_upgrade of a cluster that has had a rewrite of pg_largeobject works, and pg_upgrade of a cluster that has not had such a rewrite works too. Wa-hoo. As to whether this is a good fix, I think someone could certainly argue otherwise. This is all a bit grotty. However, I don't find it all that bad. As long as we're moving files from between one PG cluster and another using an external tool rather than logic inside the server itself, I think we're bound to have some hacks someplace to make it all work. To me, extending them to a few more places to avoid leaving files behind on disk seems like a good trade-off. Your mileage may vary. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
pgsql-hackers by date: