Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce) - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce) |
Date | |
Msg-id | CA+TgmoYHCzQChf2K4BfGcczXA6DNy=fwih+NcaidsWdDUkvN_A@mail.gmail.com Whole thread Raw |
In response to | Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce) (Shruthi Gowda <gowdashru@gmail.com>) |
Responses |
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce) Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce) |
List | pgsql-hackers |
On Fri, Aug 20, 2021 at 1:36 PM Shruthi Gowda <gowdashru@gmail.com> wrote: > Thanks Robert for your comments. > I have split the patch into two portions. One that handles DB OID and > the other that > handles tablespace OID and relfilenode OID. It's pretty clear from the discussion, I think, that the database OID one is going to need rework to be considered. Regarding the other one: - The comment in binary_upgrade_set_pg_class_oids() is still not accurate. You removed the sentence which says "Indexes cannot have toast tables, so we need not make this probe in the index code path" but the immediately preceding sentence is still inaccurate in at least two ways. First, it only talks about tables, but the code now applies to indexes. Second, it only talks about OIDs, but now also deals with refilenodes. It's really important to fully update every comment that might be affected by your changes! - The SQL query in that function isn't completely correct. There is a left join from pg_class to pg_index whose ON clause includes "c.reltoastrelid = i.indrelid AND i.indisvalid." The reason it's likely that is because it is possible, in corner cases, for a TOAST table to have multiple TOAST indexes. I forget exactly how that happens, but I think it might be like if a REINDEX CONCURRENTLY on the TOAST table fails midway through, or something of that sort. Now if that happens, the LEFT JOIN you added is going to cause the output to contain multiple rows, because you didn't replicate the i.indisvalid condition into that ON clause. And then it will fail. Apparently we don't have a pg_upgrade test case for this scenario; we probably should. Actually what I think would be even better than putting i.indisvalid into that ON clause would be to join off of i.indrelid rather than c.reltoastrelid. - The code that decodes the various columns of this query does so in a slightly different order than the query itself. It would be better to make it match. Perhaps put relkind first in both cases. I might also think about trying to make the column naming a bit more consistent, e.g. relkind, relfilenode, toast_oid, toast_relfilenode, toast_index_oid, toast_index_relfilenode. - In heap_create(), the wording of the error messages is not quite consistent. You have "relfilenode value not set when in binary upgrade mode", "toast relfilenode value not set when in binary upgrade mode", and "pg_class index relfilenode value not set when in binary upgrade mode". Why does the last one mention pg_class when the other two don't? - The code in heap_create() now has no comments whatsoever, which is a shame, because it's actually kind of a tricky bit of logic. Someone might wonder why we override the relfilenode inside that function instead of doing it at the same places where we absorb binary_upgrade_next_{heap,index,toast}_pg_class_oid and the passing down the relfilenode. I think the answer is that passing down the relfilenode from the caller would result in storage not actually being created, whereas in this case we want it to be created but just with the value we specify, and the reason we want that is because we need later DDL that happens after these statements but before the old cluster's relations are moved over to execute successfully, which it won't if the storage is altogether absent. However, that raises the question of whether this patch has even got the basic design right. Maybe we ought to actually be absorbing the relfilenode setting at the same places where we're doing so for the OID, and then passing an additional parameter to heap_create() like bool suppress_storage or something like that. Maybe, taking it even further, we ought to be changing the signatures of binary_upgrade_next_heap_pg_class_oid and friends to be two-argument functions, and pass down the OID and the relfilenode in the same call, rather than calling two separate functions. I'm not so much concerned about the cost of calling two functions as the potential for confusion. I'm not honestly sure that either of these changes are the right thing to do, but I am pretty strongly inclined to do at least the first part - trying to absorb reloid and relfilenode in the same places. If we're not going to do that we certainly need to explain why we're doing it the way we are in the comments. It's not really this patch's fault, but it would sure be nice if we had some better testing for this area. Suppose this patch somehow changed nothing from the present behavior. How would we know? Or suppose it managed to somehow set all the relfilenodes in the new cluster to random values rather than the intended one? There's no automated testing that would catch any of that, and it's not obvious how it could be added to test.sh. I suppose what we really need to do at some point is rewrite that as a TAP test, but that seems like a separate project from this patch. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: