Re: pg_upgrade and logical replication - Mailing list pgsql-hackers
From | vignesh C |
---|---|
Subject | Re: pg_upgrade and logical replication |
Date | |
Msg-id | CALDaNm0jWM1N6U_=P75kxgWWVaLak8oxOu1XDQQhbe=1h2dy7w@mail.gmail.com Whole thread Raw |
In response to | Re: pg_upgrade and logical replication (vignesh C <vignesh21@gmail.com>) |
Responses |
Re: pg_upgrade and logical replication
|
List | pgsql-hackers |
On Fri, 27 Oct 2023 at 12:09, vignesh C <vignesh21@gmail.com> wrote: > > On Thu, 21 Sept 2023 at 11:27, Michael Paquier <michael@paquier.xyz> wrote: > > > > On Fri, Sep 15, 2023 at 03:08:21PM +0530, vignesh C wrote: > > > On Tue, 12 Sept 2023 at 14:25, Hayato Kuroda (Fujitsu) > > > <kuroda.hayato@fujitsu.com> wrote: > > >> Is there a possibility that apply worker on old cluster connects to the > > >> publisher during the upgrade? Regarding the pg_upgrade on publisher, the we > > >> refuse TCP/IP connections from remotes and port number is also changed, so we can > > >> assume that subscriber does not connect to. But IIUC such settings may not affect > > >> to the connection source, so that the apply worker may try to connect to the > > >> publisher. Also, is there any hazards if it happens? > > > > > > Yes, there is a possibility that the apply worker gets started and new > > > transaction data is being synced from the publisher. I have made a fix > > > not to start the launcher process in binary ugprade mode as we don't > > > want the launcher to start apply worker during upgrade. > > > > Hmm. I was wondering if 0001 is the right way to handle this case, > > but at the end I'm OK to paint one extra isBinaryUpgrade in the code > > path where apply launchers are registered. I don't think that the > > patch is complete, though. A comment should be added in pg_upgrade's > > server.c, exactly start_postmaster(), to tell that -b also stops apply > > workers. I am attaching a version updated as of the attached, that > > I'd be OK to apply. > > I have added comments > > > I don't really think that we need to worry about a subscriber > > connecting back to a publisher in this case, though? I mean, each > > postmaster instance started by pg_upgrade restricts the access to the > > instance with unix_socket_directories set to a custom path and > > permissions at 0700, and a subscription's connection string does not > > know the unix path used by pg_upgrade. I certainly agree that > > stopping these processes could lead to inconsistencies in the data the > > subscribers have been holding though, if we are not careful, so > > preventing them from running is a good practice anyway. > > I have made the fix similar to how upgrade publisher has done to keep > it consistent. > > > I have also reviewed 0002. As a whole, I think that I'm OK with the > > main approach of the patch in pg_dump to use a new type of dumpable > > object for subscription relations that are dumped with their upgrade > > functions after. This still needs more work, and more documentation. > > Added documentation > > > Also, perhaps we should really have an option to control if this part > > of the copy happens or not. With a --no-subscription-relations for > > pg_dump at least? > > Currently this is done by default in binary upgrade mode, I will add a > separate patch to skip dump of subscription relations from upgrade and > dump a little later. > > > > > +{ oid => '4551', descr => 'add a relation with the specified relation state to pg_subscription_rel table', > > > > During a development cycle, any new function added needs to use an OID > > in range 8000-9999. Running unused_oids will suggest new random OIDs. > > Modified > > > FWIW, I am not convinced that there is a need for two functions to add > > an entry to pg_subscription_rel, with sole difference between both the > > handling of a valid or invalid LSN. We should have only one function > > that's able to handle NULL for the LSN. So let's remove rel_state_a > > and rel_state_b, and have a single rel_state(). The description of > > the SQL functions is inconsistent with the other binary upgrade ones, > > I would suggest for the two functions > > "for use by pg_upgrade (relation for pg_subscription_rel)" > > "for use by pg_upgrade (remote_lsn for origin)" > > Removed rel_state_a and rel_state_b and updated the description accordingly > > > + i_srsublsn = PQfnumber(res, "srsublsn"); > > [...] > > + subrinfo[cur_rel].srsublsn = pg_strdup(PQgetvalue(res, i, i_srsublsn)); > > > > In getSubscriptionTables(), this should check for PQgetisnull() > > because we would have a NULL value for InvalidXLogRecPtr in the > > catalog. Using a char* for srsublsn is OK, but just assign NULL to > > it, then just pass a hardcoded NULL value to the function as we do in > > other places. So I don't quite get why this is not the same handling > > as suboriginremotelsn. > > Modified > > > > > getSubscriptionTables() is entirely skipped if we don't want any > > subscriptions, if we deal with a server of 9.6 or older or if we don't > > do binary upgrades, which is OK. > > > > +/* > > + * getSubscriptionTables > > + * get information about subscription membership for dumpable tables. > > + */ > > This commit is slightly misleading and should mention that this is an > > upgrade-only path? > > Modified > > > > > The code for dumpSubscriptionTable() is a copy-paste of > > dumpPublicationTable(), but a lot of what you are doing here is > > actually pointless if we are not in binary mode? Why should this code > > path not taken only under dataOnly? I mean, this is a code path we > > should never take except if we are in binary mode. This should have > > at least a cross-check to make sure that we never have a > > DO_SUBSCRIPTION_REL in this code path if we are in non-binary mode. > > I have added an assert in this case, as it is not expected to come > here in non binary mode > > > + if (dopt->binary_upgrade && subinfo->suboriginremotelsn) > > + { > > + appendPQExpBufferStr(query, > > + "SELECT pg_catalog.binary_upgrade_replorigin_advance("); > > + appendStringLiteralAH(query, subinfo->dobj.name, fout); > > + appendPQExpBuffer(query, ", '%s');\n", subinfo->suboriginremotelsn); > > + } > > > > Hmm.. Could it be actually useful even for debugging to still have > > this query if suboriginremotelsn is an InvalidXLogRecPtr? I think > > that this should have a comment of the kind "\n-- For binary upgrade, > > blah". At least it would not be a bad thing to enforce a correct > > state from the start, removing the NULL check for the second argument > > in binary_upgrade_replorigin_advance(). > > Modified > > > + /* We need to check for pg_replication_origin_status only once. */ > > Perhaps it would be better to explain why? > > This remote_lsn code change is actually not required, I have removed this now. > > > > > + "WHERE coalesce(remote_lsn, '0/0') = '0/0'" > > Why a COALESCE here? Cannot this stuff just use NULL? > > This remote_lsn code change is actually not required, I have removed this now. > > > + fprintf(script, "database:%s subscription:%s relation:%s in non-ready state\n", > > Could it be possible to include the schema of the relation in this log? > > Modified > > > +static void check_for_subscription_state(ClusterInfo *cluster); > > I'd be tempted to move that into a patch on its own, actually, for a > > cleaner history. > > As of now I have kept it together, I will change it later based on > more feedback from others > > > +# Copyright (c) 2022-2023, PostgreSQL Global Development Group > > New as of 2023. > > Modified > > > +# Check that after upgradation of the subscriber server, the incremental > > +# changes added to the publisher are replicated. > > [..] > > + For upgradation of the subscriptions, all the subscriptions on the old > > + cluster must have a valid <varname>remote_lsn</varname>, and all the > > > > Upgradation? I think that this should be reworded: > > "All the subscriptions of an old cluster require a valid remote_lsn > > during an upgrade." > > This remote_lsn code change is actually not required, I have removed this now. > > > > > A CI run is reporting the following compilation warnings: > > [04:21:15.290] pg_dump.c: In function ‘getSubscriptionTables’: > > [04:21:15.290] pg_dump.c:4655:29: error: ‘subinfo’ may be used > > uninitialized in this function [-Werror=maybe-uninitialized] > > [04:21:15.290] 4655 | subrinfo[cur_rel].subinfo = subinfo; > > I have initialized and checked with [-Werror=maybe-uninitialized], > let me check in the next cfbot run > > > > +ok(-d $new_sub->data_dir . "/pg_upgrade_output.d", > > + "pg_upgrade_output.d/ not removed after pg_upgrade failure"); > > Not sure that there's a need for this check. Okay, that's cheap. > > Modified > > > And, err. We are going to need an option to control if the slot data > > is copied, and a bit more documentation in pg_upgrade to explain how > > things happen when the copy happens. > Added documentation for this, we will copy the slot data by default, > we will add a separate patch to skip dump of subscription > relations/replication slot from upgrade and dump a little later. > > The attached v9 version patch has the changes for the same. > > Apart from this I'm still checking that the old cluster's subscription > relations states are READY state still, but there is a possibility > that SYNCDONE or FINISHEDCOPY could work, this needs more thought > before concluding which is the correct state to check. Let' handle > this in the upcoming version. The patch was not applying because of recent commits. Here is a rebased version of the patches. Regards, Vignesh
Attachment
pgsql-hackers by date: