Thread: allow online change primary_conninfo
Hello all Now we integrate recovery.conf into GUC system. So i would like to start discussion to make primary_conninfo and primary_slot_namePGC_SIGHUP. My work-in-progress patch attached. I think we have no futher reason to have a copy of conninfo and slot name in WalRcvData, so i propose remove these fieldsfrom pg_stat_get_wal_receiver() (and pg_stat_wal_receiver view). This data can be accesible now via regular GUC commands. Thank you for advance! regards, Sergei
Attachment
Hi, On 2018-11-26 00:25:43 +0300, Sergei Kornilov wrote: > Now we integrate recovery.conf into GUC system. So i would like to start discussion to make primary_conninfo and primary_slot_namePGC_SIGHUP. > My work-in-progress patch attached. Cool! > I think we have no futher reason to have a copy of conninfo and slot name in WalRcvData, so i propose remove these fieldsfrom pg_stat_get_wal_receiver() (and pg_stat_wal_receiver view). This data can be accesible now via regular GUC commands. Is that wise? It seems possible that wal receivers run for a while with the old connection information, and without this information that seems hard to debug. > diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml > index db1a2d4..faa8e17 100644 > --- a/doc/src/sgml/config.sgml > +++ b/doc/src/sgml/config.sgml > @@ -3797,7 +3797,6 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class=" > <varname>primary_conninfo</varname> string. > </para> > <para> > - This parameter can only be set at server start. > This setting has no effect if the server is not in standby mode. > </para> > </listitem> Should probably note something like "This parameter can only be set in the <filename>postgresql.conf</filename> file or on the server command line." > @@ -435,9 +420,33 @@ WalReceiverMain(void) > > if (got_SIGHUP) > { > + char *conninfo = pstrdup(PrimaryConnInfo); > + char *slotname = pstrdup(PrimarySlotName); > + > got_SIGHUP = false; > ProcessConfigFile(PGC_SIGHUP); > XLogWalRcvSendHSFeedback(true); > + > + /* > + * If primary_conninfo has been changed while walreceiver is running, > + * shut down walreceiver so that a new walreceiver is started and > + * initiates replication with the new connection information. > + */ > + if (strcmp(conninfo, PrimaryConnInfo) != 0) > + ereport(FATAL, > + (errcode(ERRCODE_ADMIN_SHUTDOWN), > + errmsg("closing replication connection because primary_conninfo was changed"))); > + > + /* > + * And the same for primary_slot_name. > + */ > + if (strcmp(slotname, PrimarySlotName) != 0) > + ereport(FATAL, > + (errcode(ERRCODE_ADMIN_SHUTDOWN), > + errmsg("closing replication connection because primary_slot_name was changed"))); > + > + pfree(conninfo); > + pfree(slotname); I'd strongly advocate moving this to a separate function. Greetings, Andres Freund
On Sun, Nov 25, 2018 at 01:43:13PM -0800, Andres Freund wrote: > On 2018-11-26 00:25:43 +0300, Sergei Kornilov wrote: >> I think we have no futher reason to have a copy of conninfo and slot >> name in WalRcvData, so i propose remove these fields from >> pg_stat_get_wal_receiver() (and pg_stat_wal_receiver view). This data >> can be accesible now via regular GUC commands. > > Is that wise? It seems possible that wal receivers run for a while with > the old connection information, and without this information that seems > hard to debug. No, that's unwise. One of the reasons why conninfo is around is that we know to which server a receiver is connected when specifying multiple host and/or port targets in the configuration. Please see 9a895462 for the details. Removing the slot does not look like an improvement to me either, following Andres' argument. -- Michael
Attachment
Hi >> I think we have no futher reason to have a copy of conninfo and slot name in WalRcvData, so i propose remove these fieldsfrom pg_stat_get_wal_receiver() (and pg_stat_wal_receiver view). This data can be accesible now via regular GUC commands. > > Is that wise? It seems possible that wal receivers run for a while with > the old connection information, and without this information that seems > hard to debug. Hmm... I considered SIGHUP processing was in fast loop and therefore shutdown should be fast. But i recheck code and founda possible long loop without processing SIGHUP (in case we receive new data faster than writes to disk). Ok, i willrevert back. How about write to WalRcvData only clobbered conninfo? > Should probably note something like > > "This parameter can only be set in the <filename>postgresql.conf</filename> > file or on the server command line." Seems other PGC_SIGHUP settings have such note, fixed, thank you. > I'd strongly advocate moving this to a separate function. Done regards, Sergei
Attachment
Hi, On 2018-11-26 12:30:06 +0300, Sergei Kornilov wrote: > Hmm... I considered SIGHUP processing was in fast loop and therefore shutdown should be fast. But i recheck code and founda possible long loop without processing SIGHUP (in case we receive new data faster than writes to disk). Ok, i willrevert back. > How about write to WalRcvData only clobbered conninfo? I'm not sure I understand what you mean? The way I'd solve this is that that only walreceiver, at startup, writes out its conninfo/slot_name, sourcing the values from the GUCs. That way there's no issue with values being overwritten early. - Andres
Hi >> Hmm... I considered SIGHUP processing was in fast loop and therefore shutdown should be fast. But i recheck code andfound a possible long loop without processing SIGHUP (in case we receive new data faster than writes to disk). Ok, i willrevert back. >> How about write to WalRcvData only clobbered conninfo? > > I'm not sure I understand what you mean? I am about my initial proposal with remove conninfo wrom WalRcvData - walreceiver may run some time with old conninfo and > without this information that seems hard to debug. Earlier i thought walreceiver will shutdown fast on SIGHUP. > The way I'd solve this is that > that only walreceiver, at startup, writes out its conninfo/slot_name, > sourcing the values from the GUCs. That way there's no issue with values > being overwritten early. In second patch i follow exactly this logic. regards, Sergei
On Mon, Nov 26, 2018 at 09:46:36AM -0800, Andres Freund wrote: > I'm not sure I understand what you mean? The way I'd solve this is that > that only walreceiver, at startup, writes out its conninfo/slot_name, > sourcing the values from the GUCs. That way there's no issue with values > being overwritten early. WalRcv->conninfo, as stored in the WAL receiver, clobbers some of its fields and includes a set of default parameters with its values after establishing the connection so as no sensible data shows up in pg_stat_wal_receiver so you cannot simply compare what is stored in the WAL receiver with the GUCs to do the decision-making. For the password, you'd want to do again a connection anyway even if the cloberred string is the same. What's proposed in the patch to issue an ERROR if the parameter has changed does not look that bad to me as it depends on the process context, but I think that this patch should document clearly that if primary_conninfo or primary_slot_name are changed then the WAL receiver is stopped immediately. /* - * replication slot name; is also used for walreceiver to connect with the - * primary + * replication slot name used by runned walreceiver */ Not sure that there is much point in updating those comments, because it still has the same meaning in the new context. -RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo, - const char *slotname) +RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr) That's perhaps a matter of taste, but keeping the current API as-is looks cleaner to me, particularly because those fields get copied into WalRcv, so I'd rather not change what is done in WaitForWALToBecomeAvailable and keep the patch at its simplest shape. +$node_standby_2->reload; # should have effect without restart This does not wait for the change to be effective, so I think that you introduce a race condition for slow machines with this test, making it unstable. -- Michael
Attachment
Hello thank you for review! > but I think that this patch should document clearly that if > primary_conninfo or primary_slot_name are changed then the WAL receiver > is stopped immediately. Good idea, will change. > /* > - * replication slot name; is also used for walreceiver to connect with the > - * primary > + * replication slot name used by runned walreceiver > */ > Not sure that there is much point in updating those comments, because > it still has the same meaning in the new context. "is also used" seems outdated for me. With proposed patch this field means only currently active slot, not "also". Well, depends on RequestXLogStreaming discussion > -RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo, > - const char *slotname) > +RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr) > That's perhaps a matter of taste, but keeping the current API as-is > looks cleaner to me, particularly because those fields get copied into > WalRcv, so I'd rather not change what is done in > WaitForWALToBecomeAvailable and keep the patch at its simplest shape. Unfortunally here is race condition if i leave RequestXLogStreaming infrastructure as-is and walreceiver will use this structfor startup. This case: - startup process calls RequestXLogStreaming and fill WalRcv conninfo/slotname - reload primary_conninnfo by SIGHUP - walreceiver start with new GUC but trying conninfo/slotname from WalRcv struct. If walreceiver is able to connect - itwill not restart till another error or primary_conninfo/slotname will be changed again (SIGHUP without change is not enough). I already had such failures while testing this patch. If walreceiver will use primary_conninfo GUC at startup - i see no reason to leave *conninfo in RequestXLogStreaming. Theseparameters will be misleading, we request them, but not using for anything. > +$node_standby_2->reload; # should have effect without restart > This does not wait for the change to be effective, so I think that you > introduce a race condition for slow machines with this test, making it > unstable. No, here is no race condition because just after this line we wait this replication slot in upstream pg_catalog.pg_replication_slots(get_slot_xmins routine). Before reload we have no replication slot and should reconnect here with replication slot. I can add some comment here. regards, Sergei
Hello After some thinking i can rewrite this patch in another way. This is better or worse? regards, Sergei
oops, forgot attach patch 11.12.2018, 13:14, "Sergei Kornilov" <sk@zsrv.org>: > Hello > > After some thinking i can rewrite this patch in another way. > > This is better or worse? > > regards, Sergei
Attachment
On Tue, Dec 11, 2018 at 11:59:08AM +0300, Sergei Kornilov wrote: >> but I think that this patch should document clearly that if >> primary_conninfo or primary_slot_name are changed then the WAL receiver >> is stopped immediately. > > Good idea, will change. + <para> + <application>walreceiver</application> will be restarted after + <varname>primary_conninfo</varname> was changed. + </para> Hm. It may be cleaner to use "WAL receiver" here for clarity. Perhaps that's a matter of state but that seems cleaner when referring to the actual process. The docs share both grammar, depending on the context. > If walreceiver will use primary_conninfo GUC at startup - i see no > reason to leave *conninfo in RequestXLogStreaming. These parameters > will be misleading, we request them, but not using for anything. Oh, indeed. My apologies for being confused here, I can see now your point. It seems to me that this is an extra cleanup caused by the recovery parameters switched to be GUCs, and not something that we should be changed as an effect to SIGHUP for primary_conninfo and primary_slot_name. So let's do this cleanup first. For this purpose, I am going to post a new thread with a proper patch, with in CC the folks who moved the recovery parameters to be GUCs. >> +$node_standby_2->reload; # should have effect without restart >> This does not wait for the change to be effective, so I think that you >> introduce a race condition for slow machines with this test, making it >> unstable. > > No, here is no race condition because just after this line we wait > this replication slot in upstream pg_catalog.pg_replication_slots > (get_slot_xmins routine). > Before reload we have no replication slot and should reconnect here > with replication slot. I can add some comment here. Good point, I have forgotten the call to get_slot_xmins() below. -- Michael
Attachment
At Tue, 11 Dec 2018 13:16:23 +0300, Sergei Kornilov <sk@zsrv.org> wrote in <9653601544523383@iva8-37fc2ad204cd.qloud-c.yandex.net> sk> oops, forgot attach patch sk> sk> 11.12.2018, 13:14, "Sergei Kornilov" <sk@zsrv.org>: sk> > Hello sk> > sk> > After some thinking i can rewrite this patch in another way. sk> > sk> > This is better or worse? As the whole the new version looks better for me. === Do we no longer need static version of conninfo/slotname in walreceiver.c? We can detect a change of the variables without them (as you did in the previous version.). === I don't think it is acceptable that raw conninfo is exposed into log file. > LOG: parameter "primary_conninfo" changed to "host=/tmp port=5432 password=hoge" === > errmsg("closing replication connection because primary_conninfo was changed"))); Maybe it is better being as follows according to similar messages: > errmsg("terminating walreceiver process due to change of primary_conninfo"))); And it *might* be good to have detail. > errdetail("In a moment starts streaming WAL with new configuration."); regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hello > Do we no longer need static version of conninfo/slotname in > walreceiver.c? We can detect a change of the variables without > them (as you did in the previous version.). Depends on this discussion: https://www.postgresql.org/message-id/20181212053042.GK17695@paquier.xyz If walreceiver will use GUC conninfo for startup - then yes, we can remove such static variables. If walreceiver will still use conninfo from WalRcv - we have race condition between RequestXLogStreaming from startup, configreload, and walreceiver start. In this case i need save conninfo from WalRcv and compare new GUC value with them. > I don't think it is acceptable that raw conninfo is exposed into > log file. > >> LOG: parameter "primary_conninfo" changed to "host=/tmp port=5432 password=hoge" Hmm. We have infrastructure to hide such values? I need implement something like flag GUC_HIDE_VALUE_FROM_LOG near GUC_SUPERUSER_ONLY in separate thread? Log message willbe changed to "LOG: parameter "primary_conninfo" changed" with this flag. I also think that this is not a big problem. We may already have secret data in the logs. For example, in query text fromapplication. >> errmsg("closing replication connection because primary_conninfo was changed"))); > > Maybe it is better being as follows according to similar messages: > >> errmsg("terminating walreceiver process due to change of primary_conninfo"))); > > And it *might* be good to have detail. > >> errdetail("In a moment starts streaming WAL with new configuration."); Agreed, will change. regards, Sergei
On Thu, Dec 13, 2018 at 01:51:48PM +0300, Sergei Kornilov wrote: > Depends on this discussion: https://www.postgresql.org/message-id/20181212053042.GK17695@paquier.xyz > If walreceiver will use GUC conninfo for startup - then yes, we can > remove such static variables. If walreceiver will still use conninfo > from WalRcv - we have race condition between RequestXLogStreaming from > startup, config reload, and walreceiver start. In this case I need > save conninfo from WalRcv and compare new GUC value with them. I would recommend waiting for the conclusion of other thread before moving on with this one. There are arguments for letting the startup process deciding which connection string the WAL receiver should use as well as letting the WAL receiver use directly the connection string from the GUC. > Hmm. We have infrastructure to hide such values? > I need implement something like flag GUC_HIDE_VALUE_FROM_LOG near > GUC_SUPERUSER_ONLY in separate thread? Log message will be changed to > "LOG: parameter "primary_conninfo" changed" with this flag. Good point. Passwords in logs can be considered as a security issue. Having such an option may be interesting for other things, though I am not sure if just having an option to hide logs related to a given parameter are the correct way to shape it. We could also consider using the show hook function of a parameter to print its value in such logs. -- Michael
Attachment
Hi > I would recommend waiting for the conclusion of other thread before > moving on with this one. Agreed. I mark this patch as Waiting on Author. Not quite correct for waiting another discussion, but most suitable. > We could also consider using > the show hook function of a parameter to print its value in such logs. But show hook also affects "show primary_conninfo;" command and pg_settings.value I already marked primary_conninfo as GUC_SUPERUSER_ONLY. I doubt if we need hide some connection info even from superuser. Also this hook would be a bit complicated, i found suitable code only in libpqrcv_get_conninfo after establishing connect regards, Sergei
Hi, On 2018-12-14 16:55:42 +0900, Michael Paquier wrote: > On Thu, Dec 13, 2018 at 01:51:48PM +0300, Sergei Kornilov wrote: > > Depends on this discussion: https://www.postgresql.org/message-id/20181212053042.GK17695@paquier.xyz > > If walreceiver will use GUC conninfo for startup - then yes, we can > > remove such static variables. If walreceiver will still use conninfo > > from WalRcv - we have race condition between RequestXLogStreaming from > > startup, config reload, and walreceiver start. In this case I need > > save conninfo from WalRcv and compare new GUC value with them. > > I would recommend waiting for the conclusion of other thread before > moving on with this one. There are arguments for letting the startup > process deciding which connection string the WAL receiver should use as > well as letting the WAL receiver use directly the connection string from > the GUC. I suggest continuing the development of the patch without relying on that refactoring. Greetings, Andres Freund
Hello Yeah, we have no consensus. Another open question is about logging new primary_conninfo: > LOG: parameter "primary_conninfo" changed to "host=/tmp port=5432 password=hoge" I my opinion this is not issue, database logs can have sensitive data. User queries, for example. If we not want expose such info - it is ok just hide new value from logs with new GUC flag? Or i need implement masked conninfofor this purpose? regards, Sergei
On Thu, Jan 31, 2019 at 04:13:22PM +0300, Sergei Kornilov wrote: > I my opinion this is not issue, database logs can have sensitive > data. User queries, for example. If we not want expose such info - > it is ok just hide new value from logs with new GUC flag? Or i need > implement masked conninfo for this purpose? You have problems with things in this area for any commands logged and able to show a connection string or a password, which can go down as well to CREATE/ALTER ROLE or FDWs. So for the purpose of what's discussed on this thread it does not sound like a requirement to be able to hide that. Role DDLs can take an already-hashed input to avoid that, still knowing the MD5 hash is sufficient for connection (not for SCRAM!). Now for FDWs.. -- Michael
Attachment
Hi, On 2019-01-31 16:13:22 +0300, Sergei Kornilov wrote: > Hello > > Yeah, we have no consensus. > Are you planning to update the patch? Given there's not been much progress here, I think we ough tot mark the CF entry as returned with feedback for now. > Another open question is about logging new primary_conninfo: > > LOG: parameter "primary_conninfo" changed to "host=/tmp port=5432 password=hoge" > > I my opinion this is not issue, database logs can have sensitive data. User queries, for example. > If we not want expose such info - it is ok just hide new value from logs with new GUC flag? Or i need implement maskedconninfo for this purpose? I agree that this doesn't need to be solved as part of this patch. Given the config is in the conf file, I don't think it's meaningful to hide this from the log. If necessary one can use client certs, service files, etc. Greetings, Andres Freund
Hi > I agree that this doesn't need to be solved as part of this patch. Thank you! > Are you planning to update the patch? Sorry, i was busy last month. But, well, i already did such update: https://www.postgresql.org/message-id/9653601544523383%40iva8-37fc2ad204cd.qloud-c.yandex.net v003 version does not change RequestXLogStreaming, not require "Making WAL receiver startup rely on GUC context" patch anddoes not hide new value from logs. Here is updated version (from v003 patch) with few wording changes suggested by Kyotaro HORIGUCHI and Michael Paquier regards, Sergei
Attachment
Hi, On 2019-02-03 15:33:38 +0300, Sergei Kornilov wrote: > +/* > + * Actual processing SIGHUP signal > + */ > +static void > +ProcessWalRcvSigHup(void) > +{ > + ProcessConfigFile(PGC_SIGHUP); > + > + /* > + * If primary_conninfo has been changed while walreceiver is running, > + * shut down walreceiver so that a new walreceiver is started and > + * initiates replication with the new connection information. > + */ > + if (strcmp(current_conninfo, PrimaryConnInfo) != 0) > + ereport(FATAL, > + (errcode(ERRCODE_ADMIN_SHUTDOWN), > + errmsg("terminating walreceiver process due to change of primary_conninfo"), > + errdetail("In a moment starts streaming WAL with new configuration."))); > + > + /* > + * And the same for primary_slot_name. > + */ > + if (strcmp(current_slotname, PrimarySlotName) != 0) > + ereport(FATAL, > + (errcode(ERRCODE_ADMIN_SHUTDOWN), > + errmsg("terminating walreceiver process due to change of primary_slot_name"), > + errdetail("In a moment starts streaming WAL with new configuration."))); > + > + XLogWalRcvSendHSFeedback(true); > +} I don't quite think this is the right design. IMO the startup process should signal the walreceiver to shut down, and the wal receiver should never look at the config. Otherwise there's chances for knowledge of pg.conf to differ between the processes. Greetings, Andres Freund
Hi > I don't quite think this is the right design. IMO the startup process > should signal the walreceiver to shut down, and the wal receiver should > never look at the config. Ok, i can rewrite such way. Please check attached version. > Otherwise there's chances for knowledge of > pg.conf to differ between the processes. I still not understood why this: - is not issue for startup process with all primary_conninfo logic - but issue for walreceiver with Michael Paquier patch; therefore without any primary_conninfo change logic in startup. In both cases primary_conninfo change logic is only in one process. regards, Sergei
Attachment
On Sat, Feb 16, 2019 at 02:50:35PM +0300, Sergei Kornilov wrote: > + if ((strcmp(conninfo, PrimaryConnInfo) != 0 || > + strcmp(slotname, PrimarySlotName) != 0) && > + WalRcvRunning()) > + { > + ereport(LOG, > + (errmsg("terminating walreceiver process due to change of %s", > + strcmp(conninfo, PrimaryConnInfo) != 0 ? "primary_conninfo" : "primary_slot_name"), > + errdetail("In a moment starts streaming WAL with new configuration."))); > + > + ShutdownWalRcv(); > + lastSourceFailed = true; > + } If you do that, the startup process would try to jump to a different source to fetch WAL if archiving is enabled. Is that really what we want? It would be nice to have one message for primary_conninfo being updated, and one for primary_slot_name so as if both are changed at the same time the user gets the correct information. "In a moment starts streaming WAL with new configuration." sounds strange. It would be more natural to have something like "The WAL receiver is going to be restarted with the new configuration", just a suggestion. -- Michael
Attachment
Hi > If you do that, the startup process would try to jump to a different > source to fetch WAL if archiving is enabled. Is that really what we > want? Yes, similar behavior with exit walreceiver by any reason. Same as in all previous patch versions. (not sure this can be changed in some small and elegant way) > It would be nice to have one message for primary_conninfo being > updated, and one for primary_slot_name so as if both are changed at > the same time the user gets the correct information. Hm, maybe even better would be separate message if both settings are changed? Doing this in attached patch. > "In a moment starts streaming WAL with new configuration." sounds > strange. It would be more natural to have something like "The WAL > receiver is going to be restarted with the new configuration", just a > suggestion. This phrase was proposed here: https://www.postgresql.org/message-id/20181213.184233.215171075.horiguchi.kyotaro%40lab.ntt.co.jp My english is poor, so i added message just as proposed. I replaced to your variant. regards, Sergei
Attachment
Hi, On 2019-02-16 14:50:35 +0300, Sergei Kornilov wrote: > > I don't quite think this is the right design. IMO the startup process > > should signal the walreceiver to shut down, and the wal receiver should > > never look at the config. > > Ok, i can rewrite such way. Please check attached version. > > > Otherwise there's chances for knowledge of > > pg.conf to differ between the processes. > > I still not understood why this: > - is not issue for startup process with all primary_conninfo logic > - but issue for walreceiver with Michael Paquier patch; therefore without any primary_conninfo change logic in startup. > In both cases primary_conninfo change logic is only in one process. > > regards, Sergei For one, it's not ok to just let the startup process think that the walreceiver failed - that'll make it change source of WAL to the next available method. Something we definitely do not want, as restore_command is very commonly slower. But just as importantly, I think, is that we ought to automate cluster-wide tasks like failing over more inside postgres. And that's much harder if different parts of PG, say the startup process and walreceiver, have different assumptions about the current configuration. > +void > +ProcessStartupSigHup(void) > +{ > + char *conninfo = pstrdup(PrimaryConnInfo); > + char *slotname = pstrdup(PrimarySlotName); > + > + ProcessConfigFile(PGC_SIGHUP); > + > + /* > + * we need shutdown running walreceiver if replication settings was > + * changed. walreceiver will be started with new settings in next > + * WaitForWALToBecomeAvailable iteration > + */ > + if ((strcmp(conninfo, PrimaryConnInfo) != 0 || > + strcmp(slotname, PrimarySlotName) != 0) && > + WalRcvRunning()) > + { > + ereport(LOG, > + (errmsg("terminating walreceiver process due to change of %s", > + strcmp(conninfo, PrimaryConnInfo) != 0 ? "primary_conninfo" : "primary_slot_name"), > + errdetail("In a moment starts streaming WAL with new configuration."))); > + > + ShutdownWalRcv(); > + lastSourceFailed = true; > + } > + > + pfree(conninfo); > + pfree(slotname); > +} That'd still switch to a different method, something we do not want... Greetings, Andres Freund
Hi > For one, it's not ok to just let the startup process think that the > walreceiver failed - that'll make it change source of WAL to the next > available method. Something we definitely do not want, as > restore_command is very commonly slower. Ok. This was not mentioned before Michael response yesterday. restore_command is much faster compared to database restart,also switch to a different method was proposed few years ago by Simon Riggs in original change recovery.conf proposal( https://www.postgresql.org/message-id/flat/CANP8+jLO5fmfudbB1b1iw3pTdOK1HBM=xMTaRfOa5zpDVcqzew@mail.gmail.com/). I assumedwe can start with this. Sorry for your wasted time. > That'd still switch to a different method, something we do not want... Ok, do not want means we do not want. Will try change behavior. regards, Sergei
On Mon, Feb 18, 2019 at 12:06:05AM +0300, Sergei Kornilov wrote: > Ok. This was not mentioned before Michael response > yesterday. restore_command is much faster compared to database > restart [...] The startup process would not switch back to streaming with archiving enabled until it has consumed all the segments available. I recall David Steele mentioning me that a perl command invocation can take easily 100ms. On a system which generates more than 10 segments per segment, you can guess what happens.. I think that this argument is a non-starter if we want to make the WAL receiver a maximum available. >> That'd still switch to a different method, something we do not want... > > Ok, do not want means we do not want. Will try change behavior. Thanks Sergei for considering it. The code paths touched are sensitive, so we had better be careful here. -- Michael
Attachment
Hello This might be not the right way, but I can't think of a better way to not switch to a different method than split of lastSourceFailedprocessing and starting new source. Something like refactoring in first attached patch. I move RequestXLogStreaminglogic from lastSourceFailed processing and add new pendingRestartSource indicator. Second patch is mostly the same as previous version but uses new pendingRestartSource mechanism instead of lastSourceFailed. Thanks in advance! regards, Sergei
Attachment
On Sat, Mar 02, 2019 at 01:49:51PM +0300, Sergei Kornilov wrote: > This might be not the right way, but I can't think of a better way > to not switch to a different method than split of lastSourceFailed > processing and starting new source. Something like refactoring in > first attached patch. I move RequestXLogStreaming logic from > lastSourceFailed processing and add new pendingRestartSource > indicator. OK. This needs a rather close lookup, and I am not actually sure that you even need pendingRestartSource. Please let me a couple of days before coming back to you on 0001. > Second patch is mostly the same as previous version but uses new > pendingRestartSource mechanism instead of lastSourceFailed. This, on the other hand, looks like the implementation we are looking for based on the discussions which happened until now to have the startup process handle the shutdown and restart of the WAL receiver. -- Michael
Attachment
Hi Michael, On 3/4/19 7:19 AM, Michael Paquier wrote: > On Sat, Mar 02, 2019 at 01:49:51PM +0300, Sergei Kornilov wrote: >> This might be not the right way, but I can't think of a better way >> to not switch to a different method than split of lastSourceFailed >> processing and starting new source. Something like refactoring in >> first attached patch. I move RequestXLogStreaming logic from >> lastSourceFailed processing and add new pendingRestartSource >> indicator. > > OK. This needs a rather close lookup, and I am not actually sure that > you even need pendingRestartSource. Please let me a couple of days > before coming back to you on 0001. > >> Second patch is mostly the same as previous version but uses new >> pendingRestartSource mechanism instead of lastSourceFailed. > > This, on the other hand, looks like the implementation we are looking > for based on the discussions which happened until now to have the > startup process handle the shutdown and restart of the WAL receiver. Do you know when you'll have a chance to revisit this? Regards, -- -David david@pgmasters.net
Hi, On 2019-03-25 12:32:21 +0400, David Steele wrote: > On 3/4/19 7:19 AM, Michael Paquier wrote: > > On Sat, Mar 02, 2019 at 01:49:51PM +0300, Sergei Kornilov wrote: > > > This might be not the right way, but I can't think of a better way > > > to not switch to a different method than split of lastSourceFailed > > > processing and starting new source. Something like refactoring in > > > first attached patch. I move RequestXLogStreaming logic from > > > lastSourceFailed processing and add new pendingRestartSource > > > indicator. > > > > OK. This needs a rather close lookup, and I am not actually sure that > > you even need pendingRestartSource. Please let me a couple of days > > before coming back to you on 0001. > > > > > Second patch is mostly the same as previous version but uses new > > > pendingRestartSource mechanism instead of lastSourceFailed. > > > > This, on the other hand, looks like the implementation we are looking > > for based on the discussions which happened until now to have the > > startup process handle the shutdown and restart of the WAL receiver. > > Do you know when you'll have a chance to revisit this? I think we unfortunately got to mark this as returned with feedback. I've not done so, but just switched the entry to waiting on author. Greetings, Andres Freund
Hi > I think we unfortunately got to mark this as returned with > feedback. I've not done so, but just switched the entry to waiting on > author. Why returned with feedback? Why waiting on author? I didn't receive a feedback for latest published patch version. What canI do as author? Patch still applied (thanks cf bot) Obviously too late for pg12, but why can not be target pg13 and therefore be moved to next CF? regards, Sergei
Hi, On 2019-04-03 23:49:54 +0300, Sergei Kornilov wrote: > > I think we unfortunately got to mark this as returned with > > feedback. I've not done so, but just switched the entry to waiting on > > author. > > Why returned with feedback? Why waiting on author? I didn't receive a > feedback for latest published patch version. What can I do as author? > Patch still applied (thanks cf bot) Obviously too late for pg12, but > why can not be target pg13 and therefore be moved to next CF? Well, my impression was that the patch didn't yet really address the feedback. And thus should have been marked as waiting on author for a while. Greetings, Andres Freund
Hi >> > I think we unfortunately got to mark this as returned with >> > feedback. I've not done so, but just switched the entry to waiting on >> > author. >> >> Why returned with feedback? Why waiting on author? I didn't receive a >> feedback for latest published patch version. What can I do as author? >> Patch still applied (thanks cf bot) Obviously too late for pg12, but >> why can not be target pg13 and therefore be moved to next CF? > > Well, my impression was that the patch didn't yet really address the > feedback. And thus should have been marked as waiting on author for a > while. Not agree. Latest patch version perform walreceiver restart without switch to a different method as discussed. Here is norace condition between startup process and walreceiver because conninfo passed via WalRcvData struct as currently. I misssomething important? Michael Paquier had no possibility to review latest implementation, but did not say this is totally wrong, just asked waita rather close lookup. Of cource we can close this cf entry. I would be happy if someone else post proper implementation. And I can rework my implementationagain, but I don’t know how the correct implementation should look or why latest implementation is wrong. regards, Sergei
On Thu, Apr 04, 2019 at 12:27:55AM +0300, Sergei Kornilov wrote: > Not agree. Latest patch version perform walreceiver restart without > switch to a different method as discussed. Here is no race condition > between startup process and walreceiver because conninfo passed via > WalRcvData struct as currently. I miss something important? > Michael Paquier had no possibility to review latest implementation, > but did not say this is totally wrong, just asked wait a rather > close lookup. I agree with Sergei's statement here. He has sent a patch for review, which I have looked up a bit, but not enough to provide a full review unfortunately, and this even if I was listed as a reviewer of it. So if somebody is to blame here, that's me. > Of cource we can close this cf entry. I would be happy if someone > else post proper implementation. And I can rework my implementation > again, but I don’t know how the correct implementation should look > or why latest implementation is wrong. Moving this patch to next CF is fine. -- Michael
Attachment
On 2019-04-04 11:06:05 +0900, Michael Paquier wrote: > On Thu, Apr 04, 2019 at 12:27:55AM +0300, Sergei Kornilov wrote: > > Not agree. Latest patch version perform walreceiver restart without > > switch to a different method as discussed. Here is no race condition > > between startup process and walreceiver because conninfo passed via > > WalRcvData struct as currently. I miss something important? > > Michael Paquier had no possibility to review latest implementation, > > but did not say this is totally wrong, just asked wait a rather > > close lookup. > > I agree with Sergei's statement here. He has sent a patch for review, > which I have looked up a bit, but not enough to provide a full review > unfortunately, and this even if I was listed as a reviewer of it. So > if somebody is to blame here, that's me. > > Of cource we can close this cf entry. I would be happy if someone > > else post proper implementation. And I can rework my implementation > > again, but I don’t know how the correct implementation should look > > or why latest implementation is wrong. > > Moving this patch to next CF is fine. Cool, sorry for the misunderstanding then.
Hello Updated version attached. Merge conflict was about tests count in 001_stream_rep.pl. Nothing else was changed. My approachcan be still incorrect, any redesign ideas are welcome. Thanks in advance! regards, Sergei
Attachment
On Mon, Jul 01, 2019 at 02:33:39PM +0300, Sergei Kornilov wrote: > Updated version attached. Merge conflict was about tests count in > 001_stream_rep.pl. Nothing else was changed. My approach can be > still incorrect, any redesign ideas are welcome. Thanks in advance! It has been some time, and I am finally catching up with this patch. + <para> + WAL receiver will be restarted after <varname>primary_slot_name</varname> + was changed. </para> The sentence sounds strange. Here is a suggestion: The WAL receiver is restarted after an update of primary_slot_name (or primary_conninfo). The comment at the top of the call of ProcessStartupSigHup() in HandleStartupProcInterrupts() needs to be updated as it mentions a configuration file re-read, but that's not the case anymore. pendingRestartSource's name does not represent what it does, as it is only associated with the restart of a WAL receiver when in streaming state, and that's a no-op for the archive mode and the local mode. So, the patch splits the steps taken when checking for a WAL source by adding an extra step after the failure handling that you are calling the restart step. When a failure happens for the stream mode (shutdown of WAL receiver, promotion. etc), there is a switch to the archive mode, and nothing is changed in this case in your patch. So when shutting down the WAL receiver after a parameter change, what happens is that the startup process waits for retrieve_retry_interval before moving back to the archive mode. Only after scanning again the archives do we restart a new WAL receiver. However, if the restart of the WAL receiver is planned because of an update of primary_conninfo (or slot), shouldn't the follow-up mode be XLOG_FROM_STREAM without waiting for wal_retrieve_retry_interval ms for extra WAL to become available? -- Michael
Attachment
On Tue, Jul 30, 2019 at 6:42 PM Michael Paquier <michael@paquier.xyz> wrote: > On Mon, Jul 01, 2019 at 02:33:39PM +0300, Sergei Kornilov wrote: > > Updated version attached. Merge conflict was about tests count in > > 001_stream_rep.pl. Nothing else was changed. My approach can be > > still incorrect, any redesign ideas are welcome. Thanks in advance! > > [review] Unfortunately this comes right that the end of the CF, but the good news is that there is another one just around the corner. Moved to September. -- Thomas Munro https://enterprisedb.com
On Thu, Aug 01, 2019 at 09:06:59PM +1200, Thomas Munro wrote: > Unfortunately this comes right that the end of the CF, but the good > news is that there is another one just around the corner. Moved to > September. Moving it to the next CF is fine, thanks. The author had no time to reply since my last lookup. -- Michael
Attachment
Hello > It has been some time, and I am finally catching up with this patch. Thank you! > + <para> > + WAL receiver will be restarted after <varname>primary_slot_name</varname> > + was changed. > </para> > The sentence sounds strange. Here is a suggestion: > The WAL receiver is restarted after an update of primary_slot_name (or > primary_conninfo). > > The comment at the top of the call of ProcessStartupSigHup() in > HandleStartupProcInterrupts() needs to be updated as it mentions a > configuration file re-read, but that's not the case anymore. I agree. But I want to know if this is an acceptable idea in general, before spending a few more months polishing the texton something we don’t want. > pendingRestartSource's name does not represent what it does, as it is > only associated with the restart of a WAL receiver when in streaming > state, and that's a no-op for the archive mode and the local mode. Currently we have only one source with such operation, so I can name variable pendingRestartWalreceiver and replace correspondingswitch to compact condition. If someday we need additional actions for another wal source - we can replace thiscode. Looks like some form of premature optimization from me > However, if the restart of > the WAL receiver is planned because of an update of primary_conninfo > (or slot), shouldn't the follow-up mode be XLOG_FROM_STREAM without > waiting for wal_retrieve_retry_interval ms for extra WAL to become > available? Hmm. Standby with log_min_messages = debug2 and default wal_retrieve_retry_interval = 5s gives me: 2019-08-01 12:38:31.255 MSK 15707 @ from [vxid: txid:0] [] DEBUG: sendtime 2019-08-01 12:38:31.254905+03 receipttime 2019-08-0112:38:31.254986+03 replication apply delay 0 ms transfer latency 0 ms 2019-08-01 12:38:31.255 MSK 15707 @ from [vxid: txid:0] [] DEBUG: sending write 0/3023FA8 flush 0/3023F38 apply 0/3023F38 2019-08-01 12:38:31.262 MSK 15707 @ from [vxid: txid:0] [] DEBUG: sending write 0/3023FA8 flush 0/3023FA8 apply 0/3023F38 2019-08-01 12:38:31.262 MSK 15707 @ from [vxid: txid:0] [] DEBUG: sending write 0/3023FA8 flush 0/3023FA8 apply 0/3023FA8 2019-08-01 12:38:31.457 MSK 15699 @ from [vxid: txid:0] [] LOG: received SIGHUP, reloading configuration files 2019-08-01 12:38:31.458 MSK 15699 @ from [vxid: txid:0] [] LOG: parameter "primary_conninfo" changed to " host='/tmp' port=5555" 2019-08-01 12:38:31.459 MSK 15700 @ from [vxid:1/0 txid:0] [] LOG: The WAL receiver is going to be restarted due to changeof primary_conninfo 2019-08-01 12:38:31.459 MSK 15703 @ from [vxid: txid:0] [] DEBUG: checkpointer updated shared memory configuration values 2019-08-01 12:38:31.459 MSK 15707 @ from [vxid: txid:0] [] FATAL: terminating walreceiver process due to administratorcommand 2019-08-01 12:38:31.463 MSK 15724 @ from [vxid: txid:0] [] LOG: started streaming WAL from primary at 0/3000000 on timeline1 2019-08-01 12:38:31.464 MSK 15724 @ from [vxid: txid:0] [] DEBUG: sendtime 2019-08-01 12:38:31.463954+03 receipttime 2019-08-0112:38:31.464228+03 replication apply delay 0 ms transfer latency 0 ms 2019-08-01 12:38:31.464 MSK 15724 @ from [vxid: txid:0] [] DEBUG: sendtime 2019-08-01 12:38:31.464068+03 receipttime 2019-08-0112:38:31.464375+03 replication apply delay 0 ms transfer latency 0 ms 2019-08-01 12:38:31.464 MSK 15724 @ from [vxid: txid:0] [] DEBUG: sending write 0/3023FA8 flush 0/3023FA8 apply 0/3023FA8 No source switch, no wal_retrieve_retry_interval waiting. (wal writes on primary by \watch 0.3 query) RequestXLogStreaming set walRcvState to WALRCV_STARTING - correct state for WalRcvStreaming and therefore we have no lastSourceFailed. regards, Sergei
Hello Updated patch attached. (also I merged into one file) > + <para> > + WAL receiver will be restarted after <varname>primary_slot_name</varname> > + was changed. > </para> > The sentence sounds strange. Here is a suggestion: > The WAL receiver is restarted after an update of primary_slot_name (or > primary_conninfo). Changed. > The comment at the top of the call of ProcessStartupSigHup() in > HandleStartupProcInterrupts() needs to be updated as it mentions a > configuration file re-read, but that's not the case anymore. Changed to "Process any requests or signals received recently." like in other places, e.g. syslogger > pendingRestartSource's name does not represent what it does, as it is > only associated with the restart of a WAL receiver when in streaming > state, and that's a no-op for the archive mode and the local mode. I renamed to pendingWalRcvRestart and replaced switch with simple condition. > So when shutting down the WAL receiver after a parameter change, what > happens is that the startup process waits for retrieve_retry_interval > before moving back to the archive mode. Only after scanning again the > archives do we restart a new WAL receiver. As I answered earlier, here is no switch to archive or wal_retrieve_retry_interval waiting in my patch. I recheck on currentrevision too: 2019-08-28 12:16:27.295 MSK 11180 @ from [vxid: txid:0] [] DEBUG: sending write 0/30346C8 flush 0/30346C8 apply 0/30346C8 2019-08-28 12:16:27.493 MSK 11172 @ from [vxid: txid:0] [] LOG: received SIGHUP, reloading configuration files 2019-08-28 12:16:27.494 MSK 11172 @ from [vxid: txid:0] [] LOG: parameter "primary_conninfo" changed to "host='/tmp' port=5555sslmode=disable sslcompression=0 gssencmode=disable target_session_attrs=any" 2019-08-28 12:16:27.496 MSK 11173 @ from [vxid:1/0 txid:0] [] LOG: The WAL receiver is going to be restarted due to changeof primary_conninfo 2019-08-28 12:16:27.496 MSK 11176 @ from [vxid: txid:0] [] DEBUG: checkpointer updated shared memory configuration values 2019-08-28 12:16:27.496 MSK 11180 @ from [vxid: txid:0] [] FATAL: terminating walreceiver process due to administratorcommand 2019-08-28 12:16:27.500 MSK 11335 @ from [vxid: txid:0] [] LOG: started streaming WAL from primary at 0/3000000 on timeline1 2019-08-28 12:16:27.500 MSK 11335 @ from [vxid: txid:0] [] DEBUG: sendtime 2019-08-28 12:16:27.50037+03 receipttime 2019-08-2812:16:27.500821+03 replication apply delay 0 ms transfer latency 0 ms No "DEBUG: switched WAL source from stream to archive after failure" messages, no time difference (wal_retrieve_retry_interval= 5s). regards, Sergei
Attachment
On Wed, Aug 28, 2019 at 6:50 PM Sergei Kornilov <sk@zsrv.org> wrote: > > Hello > > Updated patch attached. (also I merged into one file) Thanks for updating the patch! Here are some comments from me. #primary_conninfo = '' # connection string to sending server # (change requires restart) #primary_slot_name = '' # replication slot on sending server # (change requires restart) ISTM that you need to update the above parts in postgresql.conf.sample. /* we don't expect primary_conninfo to change */ ISTM that you need to update the above comment in walreceiver.c. + <para> + The WAL receiver is restarted after an update of <varname>primary_conninfo</varname>. + </para> If primary_conninfo is set to an empty string, walreceiver just shuts down, not restarts. The above description in the doc is not always true. So I'm thinking something like the following description is better. Thought? If primary_conninfo is changed while WAL receiver is running, the WAL receiver shuts down and then restarts with new setting, except when primary_conninfo is an empty string. There is the case where walreceiver doesn't shut down immediately after the change of primary_conninfo. If the change happens while the startup process in paused state (e.g., by pg_wal_replay_pause(), recovery_min_apply_delay, recovery conflict, etc), the startup process tries to terminate walreceiver after it gets out of such state. Shouldn't this fact be documented as a note? Regards, -- Fujii Masao
Hello Thank you for review! > ISTM that you need to update the above parts in postgresql.conf.sample. Good catch, I forgot about conf sample. > ISTM that you need to update the above comment in walreceiver.c. Changed > If primary_conninfo is set to an empty string, walreceiver just shuts down, > not restarts. The above description in the doc is not always true. > So I'm thinking something like the following description is better. > Thought? > > If primary_conninfo is changed while WAL receiver is running, > the WAL receiver shuts down and then restarts with new setting, > except when primary_conninfo is an empty string. Ok, changed. I leave primary_slot_name description as before. > There is the case where walreceiver doesn't shut down immediately > after the change of primary_conninfo. If the change happens while > the startup process in paused state (e.g., by pg_wal_replay_pause(), > recovery_min_apply_delay, recovery conflict, etc), the startup > process tries to terminate walreceiver after it gets out of such state. > Shouldn't this fact be documented as a note? Hmm. Is somewhere documented that walreceiver will receive WAL while the startup process in paused state? (didn't add such note in current version) regards, Sergei
Attachment
Hello. I looked this and have some comments. At Wed, 28 Aug 2019 12:49:46 +0300, Sergei Kornilov <sk@zsrv.org> wrote in <34084371566985786@sas1-0a6c2e2b59d7.qloud-c.yandex.net> sk> Hello sk> sk> Updated patch attached. (also I merged into one file) sk> sk> > + <para> sk> > + WAL receiver will be restarted after <varname>primary_slot_name</varname> sk> > + was changed. sk> > </para> sk> > The sentence sounds strange. Here is a suggestion: sk> > The WAL receiver is restarted after an update of primary_slot_name (or sk> > primary_conninfo). sk> sk> Changed. - This parameter can only be set at server start. + This parameter can only be set in the <filename>postgresql.conf</filename> + file or on the server command line. I'm not sure it's good to change the context of this description. This was mentioning that changing of this parameter requires server (re)start. So if we want to be on the same context after rewriting, it would be like "This parameter can be set any time and causes WAL receiver restart with the new setting if the server is in standby mode." And If I'm not missing something, I don't find an (explict) paramter of postmaster for setting primary_conninfo. sk> > The comment at the top of the call of ProcessStartupSigHup() in sk> > HandleStartupProcInterrupts() needs to be updated as it mentions a sk> > configuration file re-read, but that's not the case anymore. sk> sk> Changed to "Process any requests or signals received recently." like in other places, e.g. syslogger sk> sk> > pendingRestartSource's name does not represent what it does, as it is sk> > only associated with the restart of a WAL receiver when in streaming sk> > state, and that's a no-op for the archive mode and the local mode. sk> sk> I renamed to pendingWalRcvRestart and replaced switch with simple condition. sk> sk> > So when shutting down the WAL receiver after a parameter change, what sk> > happens is that the startup process waits for retrieve_retry_interval sk> > before moving back to the archive mode. Only after scanning again the sk> > archives do we restart a new WAL receiver. sk> sk> As I answered earlier, here is no switch to archive or wal_retrieve_retry_interval waiting in my patch. I recheck oncurrent revision too: @@ -11787,48 +11793,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, if (!StandbyMode) return false; - /* - * If primary_conninfo is set, launch walreceiver to try - * to stream the missing WAL. - * - * If fetching_ckpt is true, RecPtr points to the initial - * checkpoint location. In that case, we use RedoStartLSN Mentioning code, the movement of the code block makes the surrounding swtich almost useless and the structure and the result looks somewhat strange. Couldn't we do the same thing by just skipping the wait and setting lastSourceFailed to true in the case of intentional walreceiver restart? The attached is an incomplete (fraction) patch to sketch what is in my mind. With something like this and making ProcessStartupSigHup kill walreceiver would work. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 6876537b62..61b235f8f7 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -11881,10 +11881,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, * obtaining the requested WAL. We're going to loop back * and retry from the archive, but if it hasn't been long * since last attempt, sleep wal_retrieve_retry_interval - * milliseconds to avoid busy-waiting. + * milliseconds to avoid busy-waiting. We don't wait if + * explicitly requested to restart. */ now = GetCurrentTimestamp(); - if (!TimestampDifferenceExceeds(last_fail_time, now, + if (!pendingWalRcvRestart && + !TimestampDifferenceExceeds(last_fail_time, now, wal_retrieve_retry_interval)) { long secs, @@ -11905,6 +11907,17 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, } last_fail_time = now; currentSource = XLOG_FROM_ARCHIVE; + + /* + * If wal receiver is requested to restart, we skip the + * next XLOG_FROM_ARCHIVE to immediately starting it. + */ + if (pendingWalRcvRestart) + { + lastSourceFailed = true; + continue; + } + break; default:
Hello Thank you for review! > - This parameter can only be set at server start. > + This parameter can only be set in the <filename>postgresql.conf</filename> > + file or on the server command line. > > I'm not sure it's good to change the context of this > description. This was mentioning that changing of this parameter > requires server (re)start. So if we want to be on the same > context after rewriting, it would be like "This parameter can be > set any time and causes WAL receiver restart with the new setting > if the server is in standby mode." Was written such way after this review: https://www.postgresql.org/message-id/20181125214313.lydvmrraqjfrb3s2%40alap3.anarazel.de > And If I'm not missing something, I don't find an (explict) > paramter of postmaster for setting primary_conninfo. Well, we have common -c option: -c name=value > Couldn't we do the same thing by just skipping the wait and > setting lastSourceFailed to true in the case of intentional > walreceiver restart? Yes, it's possible. Let's see... Done in attached variant. We need check pendingWalRcvRestart before rescanLatestTimeLine lines. regards, Sergei
Attachment
Hi, On 2019-09-19 17:46:06 +0300, Sergei Kornilov wrote: > <para> > - This parameter can only be set at server start. > + This parameter can only be set in the <filename>postgresql.conf</filename> > + file or on the server command line. > This setting has no effect if the server is not in standby mode. > </para> > + <para> > + If <varname>primary_conninfo</varname> is changed while WAL receiver is running, > + the WAL receiver shuts down and then restarts with new setting, > + except when primary_conninfo is an empty string. > + </para> From the sentence structure it's not clear that "except when primary_conninfo is an empty string" only applies to "and then restarts with new setting". > +/* > + * Need for restart running WalReceiver due the configuration change. > + * Suitable only for XLOG_FROM_STREAM source > + */ > +static bool pendingWalRcvRestart = false; s/due the/due to a/ (or even "due to the"). > @@ -11862,6 +11869,17 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, > if (WalRcvStreaming()) > ShutdownWalRcv(); > > + /* > + * If wal receiver is requested to restart, we skip the > + * next XLOG_FROM_ARCHIVE to immediately starting it. > + */ > + if (pendingWalRcvRestart) > + { > + lastSourceFailed = true; > + currentSource = XLOG_FROM_ARCHIVE; > + continue; > + } I can't parse that comment. What does "skipping to starting" mean? I assume it's just about avoiding wal_retrieve_retry_interval, but I think the comment ought to be rephrased. Also, should we really check this before scanning for new timelines? Why is it the right thing to change to XLOG_FROM_ARCHIVE when we're just restarting walreceiver? The server might unnecessarily get stuck in archive based recovery for a long time this way? It seems to me that we'd need to actually trigger RequestXLogStreaming() in this case. Greetings, Andres Freund
Hello Thank you for review! Can you please also check v4 version? v5 implements design suggested by Kyotaro Horiguchi-san, whilev4 has another design. Which one do you prefer? Or are both wrong? > I can't parse that comment. What does "skipping to starting" mean? I > assume it's just about avoiding wal_retrieve_retry_interval, but I think > the comment ought to be rephrased. Design idea is to rewrite current state from working XLOG_FROM_STREAM to failed XLOG_FROM_ARCHIVE (without actually try thismethod on this iteration) and immediately go to next iteration to advance the state machine. Next iteration after failedarchive recovery is walreceiver. So walreceiver will be stopped just before this lines and started on next iteration.Walreceiver will be restarted, we do not call restore_command > Also, should we really check this before scanning for new timelines? Hmm, on the next day... No, this is not really necessary. > Why is it the right thing to change to XLOG_FROM_ARCHIVE when we're just > restarting walreceiver? The server might unnecessarily get stuck in > archive based recovery for a long time this way? It seems to me that > we'd need to actually trigger RequestXLogStreaming() in this case. I hope I clarified this in design idea description above. Thank you! regards, Sergei
Hello. At Sat, 21 Sep 2019 13:06:25 +0300, Sergei Kornilov <sk@zsrv.org> wrote in <41171569060385@myt5-b646bde4b8f3.qloud-c.yandex.net> > Hello > > Thank you for review! Can you please also check v4 version? v5 implements design suggested by Kyotaro Horiguchi-san, whilev4 has another design. Which one do you prefer? Or are both wrong? > > > I can't parse that comment. What does "skipping to starting" mean? I > > assume it's just about avoiding wal_retrieve_retry_interval, but I think > > the comment ought to be rephrased. > > Design idea is to rewrite current state from working XLOG_FROM_STREAM to failed XLOG_FROM_ARCHIVE (without actually trythis method on this iteration) and immediately go to next iteration to advance the state machine. Next iteration afterfailed archive recovery is walreceiver. So walreceiver will be stopped just before this lines and started on next iteration.Walreceiver will be restarted, we do not call restore_command Sorry, it's my bad. It meant "If wal receiver is requested to restart, change state to XLOG_FROM_STREAM immediately skipping the next XLOG_FROM_ARCHIVE state.". > > Also, should we really check this before scanning for new timelines? > > Hmm, on the next day... No, this is not really necessary. No problem when timeline is not changed when explicit restart of wal receiver. But if it happened just after the standby received new hisotry file, we suffer an extra restart of wal receiver. It seems better that we check that in the case. > > Why is it the right thing to change to XLOG_FROM_ARCHIVE when we're just > > restarting walreceiver? The server might unnecessarily get stuck in > > archive based recovery for a long time this way? It seems to me that > > we'd need to actually trigger RequestXLogStreaming() in this case. > > I hope I clarified this in design idea description above. My suggestion was just to pretend that the next XLOG_FROM_STREAM failed, but the outcome actually looks wierd as Andres commented. I think that comes from the structure of the state machine. WAL receiver is started not at the beginning of XLOG_FROM_STREAM state but at the end of XLOG_FROM_ARCHIVE. So we need to switch once to XLOG_FROM_ARCHIVE in order to start wal receiver. So, I'd like to propose to move the stuff to the second switch(). (See the attached incomplete patch.) This is rather similar to Sergei's previous proposal, but the structure of the state machine is kept. Some other comments are below. In ProcessStartupSigHup, conninfo and slotname don't need to be retained until the end of the function. The log message in the function seems to be too detailed. On the other hand, if we changed primary_conninfo to '' (stop) or vise versa (start), the message (restart) looks strange. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 6c69eb6dd7..dba0042db6 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -11755,12 +11755,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, for (;;) { int oldSource = currentSource; + bool start_wal_receiver = false; /* - * First check if we failed to read from the current source, and + * First check if we failed to read from the current source or if + * we want to restart wal receiver, and * advance the state machine if so. The failure to read might've * happened outside this function, e.g when a CRC check fails on a * record, or within this loop. + * Restart wal receiver if explicitly requested, too. */ if (lastSourceFailed) { @@ -11788,53 +11791,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, if (!StandbyMode) return false; - /* - * If primary_conninfo is set, launch walreceiver to try - * to stream the missing WAL. - * - * If fetching_ckpt is true, RecPtr points to the initial - * checkpoint location. In that case, we use RedoStartLSN - * as the streaming start position instead of RecPtr, so - * that when we later jump backwards to start redo at - * RedoStartLSN, we will have the logs streamed already. - */ - if (PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0) - { - XLogRecPtr ptr; - TimeLineID tli; - - if (fetching_ckpt) - { - ptr = RedoStartLSN; - tli = ControlFile->checkPointCopy.ThisTimeLineID; - } - else - { - ptr = RecPtr; - - /* - * Use the record begin position to determine the - * TLI, rather than the position we're reading. - */ - tli = tliOfPointInHistory(tliRecPtr, expectedTLEs); - - if (curFileTLI > 0 && tli < curFileTLI) - elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but previousrecovered WAL file came from timeline %u", - (uint32) (tliRecPtr >> 32), - (uint32) tliRecPtr, - tli, curFileTLI); - } - curFileTLI = tli; - RequestXLogStreaming(tli, ptr, PrimaryConnInfo, - PrimarySlotName); - receivedUpto = 0; - } - /* * Move to XLOG_FROM_STREAM state in either case. We'll * get immediate failure if we didn't launch walreceiver, * and move on to the next state. */ + start_wal_receiver = true; currentSource = XLOG_FROM_STREAM; break; @@ -11864,8 +11826,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, ShutdownWalRcv(); /* - * Before we sleep, re-scan for possible new timelines if - * we were requested to recover to the latest timeline. + * Re-scan for possible new timelines if we were requested + * to recover to the latest timeline. */ if (recoveryTargetTimeLineGoal == RECOVERY_TARGET_TIMELINE_LATEST) { @@ -11929,8 +11891,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, lastSourceFailed ? "failure" : "success"); /* - * We've now handled possible failure. Try to read from the chosen - * source. + * We've now handled possible failure and configuration change. Try to + * read from the chosen source. */ lastSourceFailed = false; @@ -11969,9 +11931,71 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, bool havedata; /* - * Check if WAL receiver is still active. + * shutdown wal receiver if restart is requested. */ - if (!WalRcvStreaming()) + if (!start_wal_receiver && pendingWalRcvRestart) + { + if (WalRcvRunning()) + ShutdownWalRcv(); + + /* + * Re-scan for possible new timelines if we were + * requested to recover to the latest timeline. + */ + if (recoveryTargetTimeLineGoal == + RECOVERY_TARGET_TIMELINE_LATEST) + rescanLatestTimeLine(); + + start_wal_receiver = true; + } + pendingWalRcvRestart = false; + + /* + * Launch walreceiver if needed. + * + * If fetching_ckpt is true, RecPtr points to the initial + * checkpoint location. In that case, we use RedoStartLSN + * as the streaming start position instead of RecPtr, so + * that when we later jump backwards to start redo at + * RedoStartLSN, we will have the logs streamed already. + */ + if (start_wal_receiver && + PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0) + { + XLogRecPtr ptr; + TimeLineID tli; + + if (fetching_ckpt) + { + ptr = RedoStartLSN; + tli = ControlFile->checkPointCopy.ThisTimeLineID; + } + else + { + ptr = RecPtr; + + /* + * Use the record begin position to determine the + * TLI, rather than the position we're reading. + */ + tli = tliOfPointInHistory(tliRecPtr, expectedTLEs); + + if (curFileTLI > 0 && tli < curFileTLI) + elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but previousrecovered WAL file came from timeline %u", + (uint32) (tliRecPtr >> 32), + (uint32) tliRecPtr, + tli, curFileTLI); + } + curFileTLI = tli; + RequestXLogStreaming(tli, ptr, PrimaryConnInfo, + PrimarySlotName); + receivedUpto = 0; + } + + /* + * Else, check if WAL receiver is still active. + */ + else if (!WalRcvStreaming()) { lastSourceFailed = true; break;
Hello > So, I'd like to propose to move the stuff to the second switch(). > (See the attached incomplete patch.) This is rather similar to > Sergei's previous proposal, but the structure of the state > machine is kept. Very similar to my v4 proposal (also move RequestXLogStreaming call), but closer to currentSource reading. No objectionsfrom me, attached patch is changed this way. I renamed start_wal_receiver to startWalReceiver - this style looks more consistent to near code. > + /* > + * Else, check if WAL receiver is still active. > + */ > + else if (!WalRcvStreaming()) I think we still need wait WalRcvStreaming after RequestXLogStreaming call. So I remove else branch and leave separate condition. > In ProcessStartupSigHup, conninfo and slotname don't need to be > retained until the end of the function. Agreed, I move pfree > The log message in the function seems to be too detailed. On the > other hand, if we changed primary_conninfo to '' (stop) or vise > versa (start), the message (restart) looks strange. I have no strong opinion here. These messages was changed many times during this thread lifetime, can be changed anytime.I think this is not issue since we have no consensus about overall design. Write detailed messages was proposed here: https://www.postgresql.org/message-id/20190216151025.GJ2240%40paquier.xyz > or vise versa (start) I explicitly check currentSource and WalRcvRunning, so we have no such messages if user had no walreceiver before. regards, Sergei
Attachment
Hi, I'm a bit confused by the status of this patch - it's marked as Waiting on Author (since last week), but that status was set by Sergei Kornilov who is also the patch author. And there have been no messages since the beginning of 2019-11 CF. So what's the current status of this patch? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello "Waiting on Author" is the right status. I found logical conflict with recently added wal_receiver_create_temp_slot GUC andmy tests are currently broken. Will fix it and post new version. PS: also, I surprised why it's ok for wal_receiver_create_temp_slot to be PGC_SIGHUP and ignore change of this setting untilwalreceiver will reconnect by unrelated reason. I means walreceiver does nothing special on SIGHUP. In common case changeof wal_receiver_create_temp_slot setting will have effect only during restart of walreceiver process. And thereforewe will switch to archive recovery. But such design was strongly rejected for my patch year ago. regards, Sergei
On Tue, Jan 21, 2020 at 06:03:18PM +0300, Sergei Kornilov wrote: > PS: also, I surprised why it's ok for wal_receiver_create_temp_slot > to be PGC_SIGHUP and ignore change of this setting until walreceiver > will reconnect by unrelated reason. I means walreceiver does nothing > special on SIGHUP. In common case change of > wal_receiver_create_temp_slot setting will have effect only during > restart of walreceiver process. And therefore we will switch to > archive recovery. But such design was strongly rejected for my patch > year ago. [ Looks at 3297308... ] Yeah, you are right. I was not paying much attention but something does not stick here. My understanding is that we should have the WAL receiver receive the value it needs to use from the startup process (aka via RequestXLogStreaming from xlog.c), and that we ought to make this new parameter PGC_POSTMASTER instead of PGC_SIGHUP. HEAD is inconsistent here. -- Michael
Attachment
Hello > Yeah, you are right. I was not paying much attention but something > does not stick here. My understanding is that we should have the WAL > receiver receive the value it needs to use from the startup process > (aka via RequestXLogStreaming from xlog.c), and that we ought to make > this new parameter PGC_POSTMASTER instead of PGC_SIGHUP. HEAD is > inconsistent here. Thank you! I attached two patches: - first changes wal_receiver_create_temp_slot to PGC_POSTMASTER and moved the logic to RequestXLogStreaming - second is based on last published v6 version of main patch. It changes wal_receiver_create_temp_slot back to PGC_SIGHUPalong with primary_conninfo and primary_slot_name and will restart walreceiver if need. Regarding the main patch: we have several possibilities for moving RequestXLogStreaming call. We need to choose one. And, of course, changes in the text... regards, Sergei
Attachment
Hello Small rebase due to merge conflict of the tests. No functional changes since v7. PS: also it is end of current CF, I will mark patch entry as moved to the next CF.
Attachment
On 2020-Jan-22, Michael Paquier wrote: > On Tue, Jan 21, 2020 at 06:03:18PM +0300, Sergei Kornilov wrote: > > PS: also, I surprised why it's ok for wal_receiver_create_temp_slot > > to be PGC_SIGHUP and ignore change of this setting until walreceiver > > will reconnect by unrelated reason. I means walreceiver does nothing > > special on SIGHUP. In common case change of > > wal_receiver_create_temp_slot setting will have effect only during > > restart of walreceiver process. And therefore we will switch to > > archive recovery. But such design was strongly rejected for my patch > > year ago. > > [ Looks at 3297308... ] > Yeah, you are right. I was not paying much attention but something > does not stick here. My understanding is that we should have the WAL > receiver receive the value it needs to use from the startup process > (aka via RequestXLogStreaming from xlog.c), and that we ought to make > this new parameter PGC_POSTMASTER instead of PGC_SIGHUP. HEAD is > inconsistent here. I'm CCing Peter as committer of 329730827848. What are the downsides of changing wal_receiver_create_temp_slot to PGC_POSTMASTER? It seems pretty nasty to requires a full server restart. Maybe we can signal all walreceivers at that point so that they restart with the correct setting? That's much less problematic, I would think. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Mar-13, Alvaro Herrera wrote: > What are the downsides of changing wal_receiver_create_temp_slot to > PGC_POSTMASTER? It seems pretty nasty to requires a full server > restart. Maybe we can signal all walreceivers at that point so that > they restart with the correct setting? That's much less problematic, I > would think. ... oh, that's exactly what 0002 does. So patch 0001 is only about making a temporary change to the create_temp_slot to be consistent with existing policy, before changing the policy. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Mar 13, 2020 at 11:17:38AM -0300, Alvaro Herrera wrote: > ... oh, that's exactly what 0002 does. So patch 0001 is only about > making a temporary change to the create_temp_slot to be consistent with > existing policy, before changing the policy. Yes. In my opinion, patch 0002 should not change the GUC mode of wal_receiver_create_temp_slot as the discussion here is about primary_conninfo, even if both may share some logic regarding WAL receiver shutdown and its restart triggered by the startup process. Patch 0001 has actually been presented on this thread first: https://www.postgresql.org/message-id/753391579708726@iva3-77ae5995f07f.qloud-c.yandex.net And there is an independent patch registered in this CF: https://commitfest.postgresql.org/27/2456/ Should we add patch 0001 as an open item for v13 as there is a risk of forgetting this issue? I have created a wiki blank page a couple of weeks back: https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items -- Michael
Attachment
Hello Sorry for late replies. > Yes. In my opinion, patch 0002 should not change the GUC mode of > wal_receiver_create_temp_slot as the discussion here is about > primary_conninfo, even if both may share some logic regarding WAL > receiver shutdown and its restart triggered by the startup process. Ok, I removed related changes from main patch. Along with minor merge conflict. > Patch 0001 has actually been presented on this thread first: > https://www.postgresql.org/message-id/753391579708726@iva3-77ae5995f07f.qloud-c.yandex.net > And there is an independent patch registered in this CF: > https://commitfest.postgresql.org/27/2456/ Yep, 0001 is separate patch. I will post copy of this patch here to make cfbot works. Main patch 0002 requires resettingof is_temp_slot in RequestXLogStreaming to works properly. > Should we add patch 0001 as an open item for v13 as there is a risk of > forgetting this issue? I think yes. Well, it seems better to move this patch to next commitfest? regards, Sergei
Attachment
On 2020-Mar-17, Sergei Kornilov wrote: > Well, it seems better to move this patch to next commitfest? What? You want to make wal_receiver_create_temp_slot unchangeable and default to off in pg13, and delay the patch that fixes those things to pg14? That makes no sense to me. Please keep them both here so that we can get things to a usable state. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello >> Well, it seems better to move this patch to next commitfest? > > What? You want to make wal_receiver_create_temp_slot unchangeable and > default to off in pg13, and delay the patch that fixes those things to > pg14? That makes no sense to me. I want to handle similar things in a similar way. wal_receiver_create_temp_slot has good design? I will change my patch in same way in this case. But something like that wasstrongly rejected a year ago. > Please keep them both here so that we can get things to a usable state. Yes, of course. Here I attached 3 patches: 0001 is copy from https://commitfest.postgresql.org/27/2456/ It changes wal_receiver_create_temp_slot to PGC_POSTMASTER,changes the default value to off, and moves the logic to the startup process. 0002 changes primary_conninfo and primary_slot_name to be PGC_SIGHUP 0003 changes wal_receiver_create_temp_slot back to be PGC_SIGHUP. Michael Paquier asks to remove this from 0002, you askto leave it in this thread. So, I made separate patch on top of 0002. Thank you regards, Sergei
Attachment
I think the startup sighup handler should be in startup.c, not xlog.c, which has enough random code already. We can add an accessor in xlog.c to let changing the walrcv status flag, to be called from the signal handler. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Mar-24, Alvaro Herrera wrote: > I think the startup sighup handler should be in startup.c, not xlog.c, > which has enough random code already. We can add an accessor in xlog.c > to let changing the walrcv status flag, to be called from the signal > handler. ... as in the attached version. Sergei included LOG messages to indicate which setting has been changed. I put these in "#if 0" for now, but I'm thinking to remove these altogether; we already have LOG messages when a setting changes value, per ProcessConfigFileInternal(); the log sequence looks like this, taken from tmp_check/log/001_stream_rep_standby_2.log after running the tests: 2020-03-25 20:54:19.413 -03 [17493] LOG: received SIGHUP, reloading configuration files 2020-03-25 20:54:19.415 -03 [17493] LOG: parameter "primary_slot_name" changed to "standby_2" 2020-03-25 20:54:19.415 -03 [17493] LOG: parameter "wal_receiver_status_interval" changed to "1" 2020-03-25 20:54:19.422 -03 [17569] LOG: started streaming WAL from primary at 0/3000000 on timeline 1 2020-03-25 20:54:19.426 -03 [17494] LOG: wal receiver process shutdown requested 2020-03-25 20:54:19.426 -03 [17569] FATAL: terminating walreceiver process due to administrator command 2020-03-25 20:54:19.433 -03 [17572] LOG: started streaming WAL from primary at 0/3000000 on timeline 1 which looks sufficient. Maybe we can reword that new message, say "wal receiver process shutdown forced by parameter change". Not sure if we can or should adjust the FATAL line; probably not worth the trouble. Thoughts welcome. I'm thinking on getting this applied shortly. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Wed, Mar 25, 2020 at 09:08:12PM -0300, Alvaro Herrera wrote: > ... as in the attached version. Patch 0001 is one that has already been discussed previously (thanks for keeping the extra comments about GUCs and WAL receivers). That looks fine to me. > Sergei included LOG messages to indicate which setting has been changed. > I put these in "#if 0" for now, but I'm thinking to remove these > altogether; we already have LOG messages when a setting changes value, > per ProcessConfigFileInternal(); the log sequence looks like this, taken > from tmp_check/log/001_stream_rep_standby_2.log after running the tests: Yes, it makes sense to remove any knowledge of GUC updates from StartupRequestWalReceiverRestart(). I would add a description at the top of this routine by the way. +extern void ProcessStartupSigHup(void); This is declared, but used nowhere in patch 0002. Wouldn't it be better to be more careful about the NULL-ness of the string parameters in StartupRereadConfig()? + if (!slotnameChanged && strcmp(PrimarySlotName, "") == 0) + tempSlotChanged = tempSlot != wal_receiver_create_temp_slot; I would add parens here for clarity. > which looks sufficient. Maybe we can reword that new message, say "wal > receiver process shutdown forced by parameter change". Not sure if we > can or should adjust the FATAL line; probably not worth the trouble. There is merit to keep the LOG in StartupRequestWalReceiverRestart() unchanged, as the routine may get called in the future in some other context. -- Michael
Attachment
Hello Thank you! You were ahead of me. I wanted to double-check my changes this morning before posting. > Sergei included LOG messages to indicate which setting has been changed. > I put these in "#if 0" for now, but I'm thinking to remove these > altogether; No objections. > Not sure if we can or should adjust the FATAL line; probably not worth the trouble. I think it's ok. walreceiver is terminated exactly due to administrator command. regards, Sergei
Now, would anyone be too upset if I push these in this other order? I realized that the reason the tests broke after Sergei's patch is that recovery/t/001_stream_rep.pl's get_slot_xmins() is broken for temp walreceiver slots, since it's using the non-temp name it tries to give to the slot, rather than the temp name under which it is actually created. The workaround proposed by 0002 is to edit standby_1's config to set walreceiver's slot to be non-temp. Thanks to Justin Pryzby for offlist typo corrections. The reason is that I think I would like to get Sergei's patch pushed right away (0001+0002, as a single commit); but that I think there's more to attack in the walreceiver temp slot code than 0003 here, and I don't want to delay the new feature any longer while I figure out the fix for that. (The thing is: if I specify primary_slot_name in the config, why is the temp walreceiver slot code not obeying that name? I think walreceiver should create a temp slot, sure, but using the given name rather than coming up with a random name.) (The other reason is that I would like to push one patch to fix walreceiver tmp slot rather than two, setting the thing first this way and then the opposite way.) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Thu, Mar 26, 2020 at 09:39:17PM -0300, Alvaro Herrera wrote: > Now, would anyone be too upset if I push these in this other order? I > realized that the reason the tests broke after Sergei's patch is that > recovery/t/001_stream_rep.pl's get_slot_xmins() is broken for temp > walreceiver slots, since it's using the non-temp name it tries to give > to the slot, rather than the temp name under which it is actually > created. The workaround proposed by 0002 is to edit standby_1's config > to set walreceiver's slot to be non-temp. FWIW, I find a bit strange that the change introduced in 001_stream_rep.pl as of patch 0002 is basically undone in 0003 by changing the default value of wal_receiver_create_temp_slot. > The reason is that I think I would like to get Sergei's patch pushed > right away (0001+0002, as a single commit); but that I think there's > more to attack in the walreceiver temp slot code than 0003 here, and I > don't want to delay the new feature any longer while I figure out the > fix for that. Not sure I agree with this approach. I'd rather fix all the existing known problems first, and then introduce the new features on top of what we consider to be a clean state. If we lack of time between the first and second patches, we may have a risk of keeping the code with the new feature but without the fixes discussed for wal_receiver_create_temp_slot. > (The thing is: if I specify primary_slot_name in the config, why is the > temp walreceiver slot code not obeying that name? I think walreceiver > should create a temp slot, sure, but using the given name rather than > coming up with a random name.) Good point. I am not sure either why the current logic has been chosen. The discussion related to the original patch is in this area: https://www.postgresql.org/message-id/4792e456-d75f-8e6a-2d47-34b8f78c266f@2ndquadrant.com > (The other reason is that I would like to push one patch to fix > walreceiver tmp slot rather than two, setting the thing first this way > and then the opposite way.) So your problem here is that by applying first 0003 and then 0001-0002 you would finish by switching wal_receiver_create_temp_slot to PGC_POSTMASTER, and then back to PGC_SIGHUP again? -- Michael
Attachment
Hello > I realized that the reason the tests broke after Sergei's patch is that > recovery/t/001_stream_rep.pl's get_slot_xmins() is broken for temp > walreceiver slots, since it's using the non-temp name it tries to give > to the slot, rather than the temp name under which it is actually > created. The workaround proposed by 0002 is to edit standby_1's config > to set walreceiver's slot to be non-temp. This is bug in behavior, not in tests. We need walrcv->is_temp_slot = false; somewhere in RequestXLogStreaming to works correctly. HEAD is not affected since primary_slot_name cannot be changed online. > (The thing is: if I specify primary_slot_name in the config, why is the > temp walreceiver slot code not obeying that name? I think walreceiver > should create a temp slot, sure, but using the given name rather than > coming up with a random name.) Hm, interesting idea. regards, Sergei
Hello In other words, patches in reverse order: 0001 will change primary_conninfo and primary_slot_name to reloadable 0002 will move wal_receiver_create_temp_slot logic to startup process (without changing to PGC_POSTMASTER) 0003 is new draft patch: wal_receiver_create_temp_slot will use the given name or generate new one. regards, Sergei
Attachment
On 2020-Mar-27, Sergei Kornilov wrote: > Hello > > > I realized that the reason the tests broke after Sergei's patch is that > > recovery/t/001_stream_rep.pl's get_slot_xmins() is broken for temp > > walreceiver slots, since it's using the non-temp name it tries to give > > to the slot, rather than the temp name under which it is actually > > created. The workaround proposed by 0002 is to edit standby_1's config > > to set walreceiver's slot to be non-temp. > > This is bug in behavior, not in tests. > We need walrcv->is_temp_slot = false; somewhere in RequestXLogStreaming to works correctly. > > HEAD is not affected since primary_slot_name cannot be changed online. I pushed the wal_receiver_create_temp_slot bugfix, because I realized after looking for long enough at WalReceiverMain() that the code was beyond saving. I'll be pushing the rest of this later today. Thanks, -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Mar-27, Alvaro Herrera wrote: > I pushed the wal_receiver_create_temp_slot bugfix, because I realized > after looking for long enough at WalReceiverMain() that the code was > beyond saving. I'll be pushing the rest of this later today. So here's the next one. I still have to go over the doc changes here, but at least the tests are passing for me. I think I should set aside your new draft for now, but I think we should still get it in pg13 to avoid having the change the semantics of the walreceiver temp slot in the next release. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Hello Thank you! > I think I should set aside your new draft for now I agree, this patch definitely needs a bit more time to review. (currently it applies on top of v13 patch cleanly) > but I think we should still get it in pg13 to avoid having the change the semantics of the > walreceiver temp slot in the next release. Good point. regards, Sergei
On 2020-Mar-27, Alvaro Herrera wrote: > On 2020-Mar-27, Alvaro Herrera wrote: > > > I pushed the wal_receiver_create_temp_slot bugfix, because I realized > > after looking for long enough at WalReceiverMain() that the code was > > beyond saving. I'll be pushing the rest of this later today. > > So here's the next one. I still have to go over the doc changes here, > but at least the tests are passing for me. Pushed with some documentation tweaks. Many thanks for working on this! -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello Thank you very much! I attached updated patch: walreceiver will use configured primary_slot_name as temporary slot name if wal_receiver_create_temp_slotis enabled. regards, Sergei
Attachment
On 2020-03-28 11:49, Sergei Kornilov wrote: > I attached updated patch: walreceiver will use configured primary_slot_name as temporary slot name if wal_receiver_create_temp_slotis enabled. The original setup worked consistently with pg_basebackup. In pg_basebackup, if you specify -S/--slot, then it uses a permanent slot with the given name. This is the same behavior as primary_slot_name has now. We should be careful that we don't create two sets of options/settings that look similar but combine in different ways. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-03-28 02:39, Alvaro Herrera wrote: > On 2020-Mar-27, Alvaro Herrera wrote: > >> On 2020-Mar-27, Alvaro Herrera wrote: >> >> > I pushed the wal_receiver_create_temp_slot bugfix, because I realized >> > after looking for long enough at WalReceiverMain() that the code was >> > beyond saving. I'll be pushing the rest of this later today. >> >> So here's the next one. I still have to go over the doc changes here, >> but at least the tests are passing for me. > > Pushed with some documentation tweaks. > > Many thanks for working on this! Nice work! What do you think of possibility to add restore_command? Is it a good idea to make a restore_command to be reloadable via SUGHUP too? On the one hand it looks reasonable since primary_conninfo got such ability. On the other hand we don't exactly know whether the actual process after reload config would use "new" command or "old" since it may take a long time to finish running old command (in case of scp or smth). Also setting faulty restore_command lead to server termination, which is rather unexpectedly. If anyone makes some minor typo in restore_command, say write restore_command='cp1 /mnt/srv/%f %p' instead of restore_command='cp /mnt/srv/%f %p' and do SELECT pg_reload_conf() then server will terminate. Do you consider this feature useful? Any suggestions? --- Best regards, Maxim Orlov.
Attachment
Hello Yep, I think it's useful and I already posted patch in this thread: https://www.postgresql.org/message-id/flat/3090621585393698%40myt5-1466095fe4e5.qloud-c.yandex.net#ee6574e93982b5628d140f15ffffcb44 Currently without consensus regards, Sergei