Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce) - Mailing list pgsql-hackers

On Fri, Aug 20, 2021 at 1:36 PM Shruthi Gowda <gowdashru@gmail.com> wrote:
> Thanks Robert for your comments.
> I have split the patch into two portions. One that handles DB OID and
> the other that
> handles tablespace OID and relfilenode OID.

It's pretty clear from the discussion, I think, that the database OID
one is going to need rework to be considered.

Regarding the other one:

- The comment in binary_upgrade_set_pg_class_oids() is still not
accurate. You removed the sentence which says "Indexes cannot have
toast tables, so we need not make this probe in the index code path"
but the immediately preceding sentence is still inaccurate in at least
two ways. First, it only talks about tables, but the code now applies
to indexes. Second, it only talks about OIDs, but now also deals with
refilenodes. It's really important to fully update every comment that
might be affected by your changes!

- The SQL query in that function isn't completely correct. There is a
left join from pg_class to pg_index whose ON clause includes
"c.reltoastrelid = i.indrelid AND i.indisvalid." The reason it's
likely that is because it is possible, in corner cases, for a TOAST
table to have multiple TOAST indexes. I forget exactly how that
happens, but I think it might be like if a REINDEX CONCURRENTLY on the
TOAST table fails midway through, or something of that sort. Now if
that happens, the LEFT JOIN you added is going to cause the output to
contain multiple rows, because you didn't replicate the i.indisvalid
condition into that ON clause. And then it will fail. Apparently we
don't have a pg_upgrade test case for this scenario; we probably
should. Actually what I think would be even better than putting
i.indisvalid into that ON clause would be to join off of i.indrelid
rather than c.reltoastrelid.

- The code that decodes the various columns of this query does so in a
slightly different order than the query itself. It would be better to
make it match. Perhaps put relkind first in both cases. I might also
think about trying to make the column naming a bit more consistent,
e.g. relkind, relfilenode, toast_oid, toast_relfilenode,
toast_index_oid, toast_index_relfilenode.

- In heap_create(), the wording of the error messages is not quite
consistent. You have "relfilenode value not set when in binary upgrade
mode", "toast relfilenode value not set when in binary upgrade mode",
and "pg_class index relfilenode value not set when in binary upgrade
mode". Why does the last one mention pg_class when the other two
don't?

- The code in heap_create() now has no comments whatsoever, which is a
shame, because it's actually kind of a tricky bit of logic. Someone
might wonder why we override the relfilenode inside that function
instead of doing it at the same places where we absorb
binary_upgrade_next_{heap,index,toast}_pg_class_oid and the passing
down the relfilenode. I think the answer is that passing down the
relfilenode from the caller would result in storage not actually being
created, whereas in this case we want it to be created but just with
the value we specify, and the reason we want that is because we need
later DDL that happens after these statements but before the old
cluster's relations are moved over to execute successfully, which it
won't if the storage is altogether absent.

However, that raises the question of whether this patch has even got
the basic design right. Maybe we ought to actually be absorbing the
relfilenode setting at the same places where we're doing so for the
OID, and then passing an additional parameter to heap_create() like
bool suppress_storage or something like that. Maybe, taking it even
further, we ought to be changing the signatures of
binary_upgrade_next_heap_pg_class_oid and friends to be two-argument
functions, and pass down the OID and the relfilenode in the same call,
rather than calling two separate functions. I'm not so much concerned
about the cost of calling two functions as the potential for
confusion. I'm not honestly sure that either of these changes are the
right thing to do, but I am pretty strongly inclined to do at least
the first part - trying to absorb reloid and relfilenode in the same
places. If we're not going to do that we certainly need to explain why
we're doing it the way we are in the comments.

It's not really this patch's fault, but it would sure be nice if we
had some better testing for this area. Suppose this patch somehow
changed nothing from the present behavior. How would we know? Or
suppose it managed to somehow set all the relfilenodes in the new
cluster to random values rather than the intended one? There's no
automated testing that would catch any of that, and it's not obvious
how it could be added to test.sh. I suppose what we really need to do
at some point is rewrite that as a TAP test, but that seems like a
separate project from this patch.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)
Next
From: Chapman Flack
Date:
Subject: Re: Mark all GUC variable as PGDLLIMPORT