Re: making relfilenodes 56 bits - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: making relfilenodes 56 bits |
Date | |
Msg-id | CAA4eK1K1BmKN61MJTodsJprHvvmaN6thPNLHNXY6tEZxZ3KEUQ@mail.gmail.com Whole thread Raw |
In response to | Re: making relfilenodes 56 bits (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: making relfilenodes 56 bits
|
List | pgsql-hackers |
On Tue, Aug 23, 2022 at 11:36 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Aug 23, 2022 at 8:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, Aug 23, 2022 at 1:46 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > > > On Mon, Aug 22, 2022 at 3:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > To solve that problem, how about rewriting the system table in the new > > > > cluster which has a conflicting relfilenode? I think we can probably > > > > do this conflict checking before processing the tables from the old > > > > cluster. > > > > > > Thanks for chiming in. > > > > > > Right now, there are two parts to the relfilenumber preservation > > > system, and this scheme doesn't quite fit into either of them. First, > > > the dump includes commands to set pg_class.relfilenode in the new > > > cluster to the same value that it had in the old cluster. The dump > > > can't include any SQL commands that depend on what's happening in the > > > new cluster because pg_dump(all) only connects to a single cluster, > > > which in this case is the old cluster. Second, pg_upgrade itself > > > copies the files from the old cluster to the new cluster. This doesn't > > > involve a database connection at all. So there's no part of the > > > current relfilenode preservation mechanism that can look at the old > > > AND the new database and decide on some SQL to execute against the new > > > database. > > > > > > I thought for a while that we could use the information that's already > > > gathered by get_rel_infos() to do what you're suggesting here, but it > > > doesn't quite work, because that function excludes system tables, and > > > we can't afford to do that here. We'd either need to modify that query > > > to include system tables - at least for the new cluster - or run a > > > separate one to gather information about system tables in the new > > > cluster. Then, we could put all the pg_class.relfilenode values we > > > found in the new cluster into a hash table, loop over the list of rels > > > this function found in the old cluster, and for each one, probe into > > > the hash table. If we find a match, that's a system table that needs > > > to be moved out of the way before calling create_new_objects(), or > > > maybe inside that function but before it runs pg_restore. > > > > > > That doesn't seem too crazy, I think. It's a little bit of new > > > mechanism, but it doesn't sound horrific. It's got the advantage of > > > being significantly cheaper than my proposal of moving everything out > > > of the way unconditionally, and at the same time it retains one of the > > > key advantages of that proposal - IMV, anyway - which is that we don't > > > need separate relfilenode ranges for user and system objects any more. > > > So I guess on balance I kind of like it, but maybe I'm missing > > > something. > > > > Okay, so this seems exactly the same as your previous proposal but > > instead of unconditionally rewriting all the system tables we will > > rewrite only those conflict with the user table or pg_largeobject from > > the previous cluster. Although it might have additional > > implementation complexity on the pg upgrade side, it seems cheaper > > than rewriting everything. > > OTOH, if we keep the two separate ranges for the user and system table > then we don't need all this complex logic of conflict checking. From > the old cluster, we can just remember the relfilenumbr of the > pg_largeobject, and in the new cluster before trying to restore we can > just query the new cluster pg_class and find out whether it is used by > any system table and if so then we can just rewrite that system table. > Before re-write of that system table, I think we need to set NextRelFileNumber to a number greater than the max relfilenumber from the old cluster, otherwise, it can later conflict with some user table. > And I think using two ranges might not be that complicated because as > soon as we are done with the initdb we can just set NextRelFileNumber > to the first user range relfilenumber so I think this could be the > simplest solution. And I think what Amit is suggesting is something > on this line? > Yeah, I had thought of checking only pg_largeobject. I think the advantage of having separate ranges is that we have a somewhat simpler logic in the upgrade but OTOH the other scheme has the advantage of having a single allocation scheme. Do we see any other pros/cons of one over the other? One more thing we may want to think about is what if there are tables created by extension? For example, I think BDR creates some tables like node_group, conflict_history, etc. Now, I think if such an extension is created by default, both old and new clusters will have those tables. Isn't there a chance of relfilenumber conflict in such cases? -- With Regards, Amit Kapila.
pgsql-hackers by date: