Re: pg15b2: large objects lost on upgrade - Mailing list pgsql-hackers

From Robert Haas
Subject Re: pg15b2: large objects lost on upgrade
Date
Msg-id CA+TgmoYwaXh_wRRa2CqL4XpM4r6YEbq1+ec=+8b7Dr7-T_tT+Q@mail.gmail.com
Whole thread Raw
In response to Re: pg15b2: large objects lost on upgrade  (Justin Pryzby <pryzby@telsasoft.com>)
Responses Re: pg15b2: large objects lost on upgrade
List pgsql-hackers
+               /* Keep track of whether a filenode matches the OID */
+               if (maps[mapnum].relfilenumber == LargeObjectRelationId)
+                       *has_lotable = true;
+               if (maps[mapnum].relfilenumber == LargeObjectLOidPNIndexId)
+                       *has_loindex = true;
On Thu, Jul 7, 2022 at 2:44 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> We're referring to the new cluster which should have been initdb'd more or less
> immediately before running pg_upgrade [0].
>
> It'd be weird to me if someone were to initdb a new cluster, then create some
> large objects, and then maybe delete them, and then run pg_upgrade.  Why
> wouldn't they do their work on the old cluster before upgrading, or on the new
> cluster afterwards ?
>
> Does anybody actually do anything significant between initdb and pg_upgrade ?
> Is that considered to be supported ?  It seems like pg_upgrade could itself run
> initdb, with the correct options for locale, checksum, block size, etc
> (although it probably has to support the existing way to handle "compatible
> encodings").
>
> Actually, I think check_new_cluster_is_empty() ought to prohibit doing work
> between initdb and pg_upgrade by checking that no objects have *ever* been
> created in the new cluster, by checking that NextOid == 16384.  But I have a
> separate thread about "pg-upgrade allows itself to be re-run", and this has
> more to do with that than about whether to check that the file is empty before
> removing it.

I think you're getting at a really good point here which is also my
point: we assume that nothing significant has happened between when
the cluster was created and when pg_upgrade is run, but we don't check
it. Either we shouldn't assume it, or we should check it.

So, is such activity ever legitimate? I think there are people doing
it. The motivation is that maybe you have a dump from the old database
that doesn't quite restore on the new version, but by doing something
to the new cluster, you can make it restore. For instance, maybe there
are some functions that used to be part of core and are now only
available in an extension. That's going to make pg_upgrade's
dump-and-restore workflow fail, but if you install that extension onto
the new cluster, perhaps you can work around the problem. It doesn't
have to be an extension, even. Maybe some function in core just got an
extra argument, and you're using it, so the calls to that function
cause dump-and-restore to fail. You might try overloading it in the
new database with the old calling sequence to get things working.

Now, are these kinds of things considered to be supported? Well, I
don't know that we've made any statement about that one way or the
other. Perhaps they are not. But I can see why people want to use
workarounds like this. The alternative is having to dump-and-restore
instead of an in-place upgrade, and that's painfully slow by
comparison. pg_upgrade itself doesn't give you any tools to deal with
this kind of situation, but the process is just loose enough to allow
people to insert their own workarounds, so they do. I'm sure I'd do
the same, in their shoes.

My view on this is that, while we probably don't want to make such
things officially supported, I don't think we should ban it outright,
either. We probably can't support an upgrade after the next cluster
has been subjected to arbitrary amounts of tinkering, but we're making
a mistake if we write code that has fragile assumptions for no really
good reason. I think we can do better than this excerpt from your
patch, for example:

+               /* Keep track of whether a filenode matches the OID */
+               if (maps[mapnum].relfilenumber == LargeObjectRelationId)
+                       *has_lotable = true;
+               if (maps[mapnum].relfilenumber == LargeObjectLOidPNIndexId)
+                       *has_loindex = true;

I spent a while struggling to understand this because it seems to me
that every database has an LO table and an LO index, so what's up with
these names? I think what these names are really tracking is whether
the relfilenumber of pg_largeobject and its index in the old database
had their default values. But this makes the assumption that the LO
table and LO index in the new database have never been subjected to
VACUUM FULL or CLUSTER and, while there's no real reason to do that, I
can't quite see what the point of such an unnecessary and fragile
assumption might be. If Dilip's patch to make relfilenodes 56 bits
gets committed, this is going to break anyway, because with that
patch, the relfilenode for a table or index doesn't start out being
equal to its OID.

Perhaps a better solution to this particular problem is to remove the
backing files for the large object table and index *before* restoring
the dump, deciding what files to remove by asking the running server
for the file path. It might seem funny to allow for dangling pg_class
entries, but we're going to create that situation for all other user
rels anyway, and pg_upgrade treats pg_largeobject as a user rel.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: pg15b2: large objects lost on upgrade
Next
From: Joe Conway
Date:
Subject: Re: pg_parameter_aclcheck() and trusted extensions