Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce) - Mailing list pgsql-hackers

From Shruthi Gowda
Subject Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
Date
Msg-id CAASxf_OjBEVrkWCvwKkoT8nyYQ6Uk1MHDhhVSyZ_EXCQG7J8Dg@mail.gmail.com
Whole thread Raw
In response to Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
List pgsql-hackers
On Tue, Aug 24, 2021 at 2:27 AM Robert Haas <robertmhaas@gmail.com> wrote:
> It's pretty clear from the discussion, I think, that the database OID
> one is going to need rework to be considered.
>
> Regarding the other one:
>
> - The comment in binary_upgrade_set_pg_class_oids() is still not
> accurate. You removed the sentence which says "Indexes cannot have
> toast tables, so we need not make this probe in the index code path"
> but the immediately preceding sentence is still inaccurate in at least
> two ways. First, it only talks about tables, but the code now applies
> to indexes. Second, it only talks about OIDs, but now also deals with
> refilenodes. It's really important to fully update every comment that
> might be affected by your changes!

The comment is updated.

> - The SQL query in that function isn't completely correct. There is a
> left join from pg_class to pg_index whose ON clause includes
> "c.reltoastrelid = i.indrelid AND i.indisvalid." The reason it's
> likely that is because it is possible, in corner cases, for a TOAST
> table to have multiple TOAST indexes. I forget exactly how that
> happens, but I think it might be like if a REINDEX CONCURRENTLY on the
> TOAST table fails midway through, or something of that sort. Now if
> that happens, the LEFT JOIN you added is going to cause the output to
> contain multiple rows, because you didn't replicate the i.indisvalid
> condition into that ON clause. And then it will fail. Apparently we
> don't have a pg_upgrade test case for this scenario; we probably
> should. Actually what I think would be even better than putting
> i.indisvalid into that ON clause would be to join off of i.indrelid
> rather than c.reltoastrelid.

The SQL query will not result in duplicate rows because the first join
filters the duplicate rows if any with the on clause ' i.indisvalid'
on it. The result of the first join is further left joined with pg_class and
pg_class will not have duplicate rows for a given oid.

> - The code that decodes the various columns of this query does so in a
> slightly different order than the query itself. It would be better to
> make it match. Perhaps put relkind first in both cases. I might also
> think about trying to make the column naming a bit more consistent,
> e.g. relkind, relfilenode, toast_oid, toast_relfilenode,
> toast_index_oid, toast_index_relfilenode.

Fixed.

> - In heap_create(), the wording of the error messages is not quite
> consistent. You have "relfilenode value not set when in binary upgrade
> mode", "toast relfilenode value not set when in binary upgrade mode",
> and "pg_class index relfilenode value not set when in binary upgrade
> mode". Why does the last one mention pg_class when the other two
> don't?

The error message is made consistent. This code chuck is moved to a different
place as a part of another review comment fix.

> - The code in heap_create() now has no comments whatsoever, which is a
> shame, because it's actually kind of a tricky bit of logic. Someone
> might wonder why we override the relfilenode inside that function
> instead of doing it at the same places where we absorb
> binary_upgrade_next_{heap,index,toast}_pg_class_oid and the passing
> down the relfilenode. I think the answer is that passing down the
> relfilenode from the caller would result in storage not actually being
> created, whereas in this case we want it to be created but just with
> the value we specify, and the reason we want that is because we need
> later DDL that happens after these statements but before the old
> cluster's relations are moved over to execute successfully, which it
> won't if the storage is altogether absent.

> However, that raises the question of whether this patch has even got
> the basic design right. Maybe we ought to actually be absorbing the
> relfilenode setting at the same places where we're doing so for the
> OID, and then passing an additional parameter to heap_create() like
> bool suppress_storage or something like that. Maybe, taking it even
> further, we ought to be changing the signatures of
> binary_upgrade_next_heap_pg_class_oid and friends to be two-argument
> functions, and pass down the OID and the relfilenode in the same call,
> rather than calling two separate functions. I'm not so much concerned
> about the cost of calling two functions as the potential for
> confusion. I'm not honestly sure that either of these changes are the
> right thing to do, but I am pretty strongly inclined to do at least
> the first part - trying to absorb reloid and relfilenode in the same
> places. If we're not going to do that we certainly need to explain why
> we're doing it the way we are in the comments.

As per your suggestion, reloid and relfilenode are absorbed in the same place.
An additional parameter called 'suppress_storage' is passed to heap_create()
which indicates whether or not to create the storage when the caller
passed a valid relfilenode.

I did not make the changes to set the oid and relfilenode in the same call.
I feel the uniformity w.r.t the other function signatures in
pg_upgrade_support.c will be lost because currently each function sets
only one attribute.
Also, renaming the applicable function names to represent that they
set both oid and relfilenode will make the function name even longer.
We may opt to not include the relfilenode in the function name instead
use a generic name like binary_upgrade_set_next_xxx_pg_class_oid() but
then
we will end up with some functions that set two attributes and some
functions that set one attribute.

> It's not really this patch's fault, but it would sure be nice if we
> had some better testing for this area. Suppose this patch somehow
> changed nothing from the present behavior. How would we know? Or
> suppose it managed to somehow set all the relfilenodes in the new
> cluster to random values rather than the intended one? There's no
> automated testing that would catch any of that, and it's not obvious
> how it could be added to test.sh. I suppose what we really need to do
> at some point is rewrite that as a TAP test, but that seems like a
> separate project from this patch.

I have verified the table, index, toast table and toast index
relfilenode and DB oid in old and new cluster manually and it is
working as expected.

I have also attached the patch to preserve the DB oid. As discussed,
template0 will be created with a fixed oid during initdb. I am using
OID 2 for template0. Even though oid 2 is already in use for the
'pg_am' catalog I see no harm in using it for template0 DB because oid
doesn’t have to be unique across the database - it has to be unique
for the particular catalog table. Kindly let me know if I am missing
something?
Apparently, if we did decide to pick an unused oid for template0 then
I see a challenge in removing that oid from the unused oid list. I
could not come up with a feasible solution for handling it.

Regards,
Shruthi KC
EnterpriseDB: http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert
Next
From: Robert Haas
Date:
Subject: Re: extensible options syntax for replication parser?