Thread: allow online change primary_conninfo

allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello all
Now we integrate recovery.conf into GUC system. So i would like to start discussion to make primary_conninfo and
primary_slot_namePGC_SIGHUP.
 
My work-in-progress patch attached.

I think we have no futher reason to have a copy of conninfo and slot name in WalRcvData, so i propose remove these
fieldsfrom pg_stat_get_wal_receiver() (and pg_stat_wal_receiver view). This data can be accesible now via regular GUC
commands.

Thank you for advance!

regards, Sergei
Attachment

Re: allow online change primary_conninfo

From
Andres Freund
Date:
Hi,

On 2018-11-26 00:25:43 +0300, Sergei Kornilov wrote:
> Now we integrate recovery.conf into GUC system. So i would like to start discussion to make primary_conninfo and
primary_slot_namePGC_SIGHUP.
 
> My work-in-progress patch attached.

Cool!


> I think we have no futher reason to have a copy of conninfo and slot name in WalRcvData, so i propose remove these
fieldsfrom pg_stat_get_wal_receiver() (and pg_stat_wal_receiver view). This data can be accesible now via regular GUC
commands.

Is that wise? It seems possible that wal receivers run for a while with
the old connection information, and without this information that seems
hard to debug.


> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
> index db1a2d4..faa8e17 100644
> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -3797,7 +3797,6 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
>            <varname>primary_conninfo</varname> string.
>           </para>
>           <para>
> -          This parameter can only be set at server start.
>            This setting has no effect if the server is not in standby mode.
>           </para>
>          </listitem>

Should probably note something like

"This parameter can only be set in the <filename>postgresql.conf</filename>
file or on the server command line."


> @@ -435,9 +420,33 @@ WalReceiverMain(void)
>  
>                  if (got_SIGHUP)
>                  {
> +                    char    *conninfo = pstrdup(PrimaryConnInfo);
> +                    char    *slotname = pstrdup(PrimarySlotName);
> +
>                      got_SIGHUP = false;
>                      ProcessConfigFile(PGC_SIGHUP);
>                      XLogWalRcvSendHSFeedback(true);
> +
> +                    /*
> +                     * If primary_conninfo has been changed while walreceiver is running,
> +                     * shut down walreceiver so that a new walreceiver is started and
> +                     * initiates replication with the new connection information.
> +                     */
> +                    if (strcmp(conninfo, PrimaryConnInfo) != 0)
> +                        ereport(FATAL,
> +                                (errcode(ERRCODE_ADMIN_SHUTDOWN),
> +                                 errmsg("closing replication connection because primary_conninfo was changed")));
> +
> +                    /*
> +                     * And the same for primary_slot_name.
> +                     */
> +                    if (strcmp(slotname, PrimarySlotName) != 0)
> +                        ereport(FATAL,
> +                                (errcode(ERRCODE_ADMIN_SHUTDOWN),
> +                                 errmsg("closing replication connection because primary_slot_name was changed")));
> +
> +                    pfree(conninfo);
> +                    pfree(slotname);

I'd strongly advocate moving this to a separate function.


Greetings,

Andres Freund


Re: allow online change primary_conninfo

From
Michael Paquier
Date:
On Sun, Nov 25, 2018 at 01:43:13PM -0800, Andres Freund wrote:
> On 2018-11-26 00:25:43 +0300, Sergei Kornilov wrote:
>> I think we have no futher reason to have a copy of conninfo and slot
>> name in WalRcvData, so i propose remove these fields from
>> pg_stat_get_wal_receiver() (and pg_stat_wal_receiver view). This data
>> can be accesible now via regular GUC commands.
>
> Is that wise? It seems possible that wal receivers run for a while with
> the old connection information, and without this information that seems
> hard to debug.

No, that's unwise.  One of the reasons why conninfo is around is that we
know to which server a receiver is connected when specifying multiple
host and/or port targets in the configuration.  Please see 9a895462 for
the details.

Removing the slot does not look like an improvement to me either,
following Andres' argument.
--
Michael

Attachment

Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hi

>>  I think we have no futher reason to have a copy of conninfo and slot name in WalRcvData, so i propose remove these
fieldsfrom pg_stat_get_wal_receiver() (and pg_stat_wal_receiver view). This data can be accesible now via regular GUC
commands.
>
> Is that wise? It seems possible that wal receivers run for a while with
> the old connection information, and without this information that seems
> hard to debug.
Hmm... I considered SIGHUP processing was in fast loop and therefore shutdown should be fast. But i recheck code and
founda possible long loop without processing SIGHUP (in case we receive new data faster than writes to disk). Ok, i
willrevert back.
 
How about write to WalRcvData only clobbered conninfo?

> Should probably note something like
>
> "This parameter can only be set in the <filename>postgresql.conf</filename>
> file or on the server command line."
Seems other PGC_SIGHUP settings have such note, fixed, thank you.

> I'd strongly advocate moving this to a separate function.
Done

regards, Sergei
Attachment

Re: allow online change primary_conninfo

From
Andres Freund
Date:
Hi,

On 2018-11-26 12:30:06 +0300, Sergei Kornilov wrote:
> Hmm... I considered SIGHUP processing was in fast loop and therefore shutdown should be fast. But i recheck code and
founda possible long loop without processing SIGHUP (in case we receive new data faster than writes to disk). Ok, i
willrevert back.
 
> How about write to WalRcvData only clobbered conninfo?

I'm not sure I understand what you mean?  The way I'd solve this is that
that only walreceiver, at startup, writes out its conninfo/slot_name,
sourcing the values from the GUCs. That way there's no issue with values
being overwritten early.

- Andres


Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hi

>>  Hmm... I considered SIGHUP processing was in fast loop and therefore shutdown should be fast. But i recheck code
andfound a possible long loop without processing SIGHUP (in case we receive new data faster than writes to disk). Ok, i
willrevert back.
 
>>  How about write to WalRcvData only clobbered conninfo?
>
> I'm not sure I understand what you mean?
I am about my initial proposal with remove conninfo wrom WalRcvData - walreceiver may run some time with old conninfo
and
 
> without this information that seems hard to debug.
Earlier i thought walreceiver will shutdown fast on SIGHUP.

> The way I'd solve this is that
> that only walreceiver, at startup, writes out its conninfo/slot_name,
> sourcing the values from the GUCs. That way there's no issue with values
> being overwritten early.
In second patch i follow exactly this logic.

regards, Sergei


Re: allow online change primary_conninfo

From
Michael Paquier
Date:
On Mon, Nov 26, 2018 at 09:46:36AM -0800, Andres Freund wrote:
> I'm not sure I understand what you mean?  The way I'd solve this is that
> that only walreceiver, at startup, writes out its conninfo/slot_name,
> sourcing the values from the GUCs. That way there's no issue with values
> being overwritten early.

WalRcv->conninfo, as stored in the WAL receiver, clobbers some of its
fields and includes a set of default parameters with its values after
establishing the connection so as no sensible data shows up in
pg_stat_wal_receiver so you cannot simply compare what is stored in the
WAL receiver with the GUCs to do the decision-making.  For the password,
you'd want to do again a connection anyway even if the cloberred string
is the same.

What's proposed in the patch to issue an ERROR if the parameter has
changed does not look that bad to me as it depends on the process
context, but I think that this patch should document clearly that if
primary_conninfo or primary_slot_name are changed then the WAL receiver
is stopped immediately.

    /*
-    * replication slot name; is also used for walreceiver to connect with the
-    * primary
+    * replication slot name used by runned walreceiver
     */
Not sure that there is much point in updating those comments, because
it still has the same meaning in the new context.

-RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
-                    const char *slotname)
+RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr)
That's perhaps a matter of taste, but keeping the current API as-is
looks cleaner to me, particularly because those fields get copied into
WalRcv, so I'd rather not change what is done in
WaitForWALToBecomeAvailable and keep the patch at its simplest shape.

+$node_standby_2->reload; # should have effect without restart
This does not wait for the change to be effective, so I think that you
introduce a race condition for slow machines with this test, making it
unstable.
--
Michael

Attachment

Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello

thank you for review!

> but I think that this patch should document clearly that if
> primary_conninfo or primary_slot_name are changed then the WAL receiver
> is stopped immediately.
Good idea, will change.

>     /*
> - * replication slot name; is also used for walreceiver to connect with the
> - * primary
> + * replication slot name used by runned walreceiver
>      */
> Not sure that there is much point in updating those comments, because
> it still has the same meaning in the new context.
"is also used" seems outdated for me. With proposed patch this field means only currently active slot, not "also".
Well, depends on RequestXLogStreaming discussion

> -RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
> - const char *slotname)
> +RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr)
> That's perhaps a matter of taste, but keeping the current API as-is
> looks cleaner to me, particularly because those fields get copied into
> WalRcv, so I'd rather not change what is done in
> WaitForWALToBecomeAvailable and keep the patch at its simplest shape.
Unfortunally here is race condition if i leave RequestXLogStreaming infrastructure as-is and walreceiver will use this
structfor startup.
 
This case:
- startup process calls RequestXLogStreaming and fill WalRcv conninfo/slotname
- reload primary_conninnfo by SIGHUP
- walreceiver start with new GUC but trying conninfo/slotname from WalRcv struct. If walreceiver is able to connect -
itwill not restart till another error or primary_conninfo/slotname will be changed again (SIGHUP without change is not
enough).

I already had such failures while testing this patch.

If walreceiver will use primary_conninfo GUC at startup - i see no reason to leave *conninfo in RequestXLogStreaming.
Theseparameters will be misleading, we request them, but not using for anything.
 

> +$node_standby_2->reload; # should have effect without restart
> This does not wait for the change to be effective, so I think that you
> introduce a race condition for slow machines with this test, making it
> unstable.
No, here is no race condition because just after this line we wait this replication slot in upstream
pg_catalog.pg_replication_slots(get_slot_xmins routine).
 
Before reload we have no replication slot and should reconnect here with replication slot. I can add some comment
here.

regards, Sergei


Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello

After some thinking i can rewrite this patch in another way.

This is better or worse?

regards, Sergei


Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
oops, forgot attach patch

11.12.2018, 13:14, "Sergei Kornilov" <sk@zsrv.org>:
> Hello
>
> After some thinking i can rewrite this patch in another way.
>
> This is better or worse?
>
> regards, Sergei
Attachment

Re: allow online change primary_conninfo

From
Michael Paquier
Date:
On Tue, Dec 11, 2018 at 11:59:08AM +0300, Sergei Kornilov wrote:
>> but I think that this patch should document clearly that if
>> primary_conninfo or primary_slot_name are changed then the WAL receiver
>> is stopped immediately.
>
> Good idea, will change.

+         <para>
+          <application>walreceiver</application> will be restarted after
+          <varname>primary_conninfo</varname> was changed.
+         </para>
Hm.  It may be cleaner to use "WAL receiver" here for clarity.  Perhaps
that's a matter of state but that seems cleaner when referring to the
actual process.  The docs share both grammar, depending on the context.

> If walreceiver will use primary_conninfo GUC at startup - i see no
> reason to leave *conninfo in RequestXLogStreaming. These parameters
> will be misleading, we request them, but not using for anything.

Oh, indeed.  My apologies for being confused here, I can see now your
point.  It seems to me that this is an extra cleanup caused by the
recovery parameters switched to be GUCs, and not something that we
should be changed as an effect to SIGHUP for primary_conninfo and
primary_slot_name.  So let's do this cleanup first.  For this purpose, I
am going to post a new thread with a proper patch, with in CC the folks
who moved the recovery parameters to be GUCs.

>> +$node_standby_2->reload; # should have effect without restart
>> This does not wait for the change to be effective, so I think that you
>> introduce a race condition for slow machines with this test, making it
>> unstable.
>
> No, here is no race condition because just after this line we wait
> this replication slot in upstream pg_catalog.pg_replication_slots
> (get_slot_xmins routine).
> Before reload we have no replication slot and should reconnect here
> with replication slot. I can add some comment here.

Good point, I have forgotten the call to get_slot_xmins() below.
--
Michael

Attachment

Re: allow online change primary_conninfo

From
Kyotaro HORIGUCHI
Date:
At Tue, 11 Dec 2018 13:16:23 +0300, Sergei Kornilov <sk@zsrv.org> wrote in
<9653601544523383@iva8-37fc2ad204cd.qloud-c.yandex.net>
sk> oops, forgot attach patch
sk> 
sk> 11.12.2018, 13:14, "Sergei Kornilov" <sk@zsrv.org>:
sk> > Hello
sk> >
sk> > After some thinking i can rewrite this patch in another way.
sk> >
sk> > This is better or worse?

As the whole the new version looks better for me.

===
Do we no longer need static version of conninfo/slotname in
walreceiver.c? We can detect a change of the variables without
them (as you did in the previous version.).


===
I don't think it is acceptable that raw conninfo is exposed into
log file.

>  LOG:  parameter "primary_conninfo" changed to "host=/tmp port=5432 password=hoge"


===
> errmsg("closing replication connection because primary_conninfo was changed")));

Maybe it is better being as follows according to similar messages:

> errmsg("terminating walreceiver process due to change of primary_conninfo")));

And it *might* be good to have detail.

> errdetail("In a moment starts streaming WAL with new configuration.");


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello

> Do we no longer need static version of conninfo/slotname in
> walreceiver.c? We can detect a change of the variables without
> them (as you did in the previous version.).

Depends on this discussion: https://www.postgresql.org/message-id/20181212053042.GK17695@paquier.xyz
If walreceiver will use GUC conninfo for startup - then yes, we can remove such static variables.
If walreceiver will still use conninfo from WalRcv - we have race condition between RequestXLogStreaming from startup,
configreload, and walreceiver start. In this case i need save conninfo from WalRcv and compare new GUC value with
them.

> I don't think it is acceptable that raw conninfo is exposed into
> log file.
>
>>   LOG: parameter "primary_conninfo" changed to "host=/tmp port=5432 password=hoge"

Hmm. We have infrastructure to hide such values?
I need implement something like flag GUC_HIDE_VALUE_FROM_LOG near GUC_SUPERUSER_ONLY in separate thread? Log message
willbe changed to "LOG: parameter "primary_conninfo" changed" with this flag.
 

I also think that this is not a big problem. We may already have secret data in the logs. For example, in query text
fromapplication.
 

>>  errmsg("closing replication connection because primary_conninfo was changed")));
>
> Maybe it is better being as follows according to similar messages:
>
>>  errmsg("terminating walreceiver process due to change of primary_conninfo")));
>
> And it *might* be good to have detail.
>
>>  errdetail("In a moment starts streaming WAL with new configuration.");

Agreed, will change.

regards, Sergei


Re: allow online change primary_conninfo

From
Michael Paquier
Date:
On Thu, Dec 13, 2018 at 01:51:48PM +0300, Sergei Kornilov wrote:
> Depends on this discussion: https://www.postgresql.org/message-id/20181212053042.GK17695@paquier.xyz
> If walreceiver will use GUC conninfo for startup - then yes, we can
> remove such static variables.  If walreceiver will still use conninfo
> from WalRcv - we have race condition between RequestXLogStreaming from
> startup, config reload, and walreceiver start. In this case I need
> save conninfo from WalRcv and compare new GUC value with them.

I would recommend waiting for the conclusion of other thread before
moving on with this one.  There are arguments for letting the startup
process deciding which connection string the WAL receiver should use as
well as letting the WAL receiver use directly the connection string from
the GUC.

> Hmm. We have infrastructure to hide such values?
> I need implement something like flag GUC_HIDE_VALUE_FROM_LOG near
> GUC_SUPERUSER_ONLY in separate thread? Log message will be changed to
> "LOG: parameter "primary_conninfo" changed" with this flag.

Good point.  Passwords in logs can be considered as a security issue.
Having such an option may be interesting for other things, though I am
not sure if just having an option to hide logs related to a given
parameter are the correct way to shape it.  We could also consider using
the show hook function of a parameter to print its value in such logs.
--
Michael

Attachment

Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hi

> I would recommend waiting for the conclusion of other thread before
> moving on with this one.
Agreed.
I mark this patch as Waiting on Author. Not quite correct for waiting another discussion, but most suitable.

> We could also consider using
> the show hook function of a parameter to print its value in such logs.
But show hook also affects "show primary_conninfo;" command and pg_settings.value
I already marked primary_conninfo as GUC_SUPERUSER_ONLY. I doubt if we need hide some connection info even from
superuser.
Also this hook would be a bit complicated, i found suitable code only in libpqrcv_get_conninfo after establishing
connect

regards, Sergei


Re: allow online change primary_conninfo

From
"andres@anarazel.de"
Date:
Hi,

On 2018-12-14 16:55:42 +0900, Michael Paquier wrote:
> On Thu, Dec 13, 2018 at 01:51:48PM +0300, Sergei Kornilov wrote:
> > Depends on this discussion: https://www.postgresql.org/message-id/20181212053042.GK17695@paquier.xyz
> > If walreceiver will use GUC conninfo for startup - then yes, we can
> > remove such static variables.  If walreceiver will still use conninfo
> > from WalRcv - we have race condition between RequestXLogStreaming from
> > startup, config reload, and walreceiver start. In this case I need
> > save conninfo from WalRcv and compare new GUC value with them.
> 
> I would recommend waiting for the conclusion of other thread before
> moving on with this one.  There are arguments for letting the startup
> process deciding which connection string the WAL receiver should use as
> well as letting the WAL receiver use directly the connection string from
> the GUC.

I suggest continuing the development of the patch without relying on
that refactoring.

Greetings,

Andres Freund


Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello

Yeah, we have no consensus.

Another open question is about logging new primary_conninfo:

> LOG: parameter "primary_conninfo" changed to "host=/tmp port=5432 password=hoge"

I my opinion this is not issue, database logs can have sensitive data. User queries, for example.
If we not want expose such info - it is ok just hide new value from logs with new GUC flag? Or i need implement masked
conninfofor this purpose?
 

regards, Sergei


Re: allow online change primary_conninfo

From
Michael Paquier
Date:
On Thu, Jan 31, 2019 at 04:13:22PM +0300, Sergei Kornilov wrote:
> I my opinion this is not issue, database logs can have sensitive
> data. User queries, for example.  If we not want expose such info -
> it is ok just hide new value from logs with new GUC flag? Or i need
> implement masked conninfo for this purpose?

You have problems with things in this area for any commands logged and
able to show a connection string or a password, which can go down as
well to CREATE/ALTER ROLE or FDWs.  So for the purpose of what's
discussed on this thread it does not sound like a requirement to be
able to hide that.  Role DDLs can take an already-hashed input to
avoid that, still knowing the MD5 hash is sufficient for connection
(not for SCRAM!).  Now for FDWs..
--
Michael

Attachment

Re: allow online change primary_conninfo

From
"andres@anarazel.de"
Date:
Hi,

On 2019-01-31 16:13:22 +0300, Sergei Kornilov wrote:
> Hello
> 
> Yeah, we have no consensus.
> 

Are you planning to update the patch?  Given there's not been much
progress here, I think we ough tot mark the CF entry as returned with
feedback for now.


> Another open question is about logging new primary_conninfo:

> > LOG: parameter "primary_conninfo" changed to "host=/tmp port=5432 password=hoge"
> 
> I my opinion this is not issue, database logs can have sensitive data. User queries, for example.
> If we not want expose such info - it is ok just hide new value from logs with new GUC flag? Or i need implement
maskedconninfo for this purpose?
 

I agree that this doesn't need to be solved as part of this patch. Given
the config is in the conf file, I don't think it's meaningful to hide
this from the log. If necessary one can use client certs, service files,
etc.

Greetings,

Andres Freund


Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hi

> I agree that this doesn't need to be solved as part of this patch.

Thank you!

> Are you planning to update the patch?

Sorry, i was busy last month. But, well, i already did such update:
https://www.postgresql.org/message-id/9653601544523383%40iva8-37fc2ad204cd.qloud-c.yandex.net
v003 version does not change RequestXLogStreaming, not require "Making WAL receiver startup rely on GUC context" patch
anddoes not hide new value from logs.
 

Here is updated version (from v003 patch) with few wording changes suggested by Kyotaro HORIGUCHI and Michael Paquier

regards, Sergei
Attachment

Re: allow online change primary_conninfo

From
Andres Freund
Date:
Hi,

On 2019-02-03 15:33:38 +0300, Sergei Kornilov wrote:
> +/*
> + * Actual processing SIGHUP signal
> + */
> +static void
> +ProcessWalRcvSigHup(void)
> +{
> +    ProcessConfigFile(PGC_SIGHUP);
> +
> +    /*
> +     * If primary_conninfo has been changed while walreceiver is running,
> +     * shut down walreceiver so that a new walreceiver is started and
> +     * initiates replication with the new connection information.
> +     */
> +    if (strcmp(current_conninfo, PrimaryConnInfo) != 0)
> +        ereport(FATAL,
> +                (errcode(ERRCODE_ADMIN_SHUTDOWN),
> +                 errmsg("terminating walreceiver process due to change of primary_conninfo"),
> +                 errdetail("In a moment starts streaming WAL with new configuration.")));
> +
> +    /*
> +     * And the same for primary_slot_name.
> +     */
> +    if (strcmp(current_slotname, PrimarySlotName) != 0)
> +        ereport(FATAL,
> +                (errcode(ERRCODE_ADMIN_SHUTDOWN),
> +                 errmsg("terminating walreceiver process due to change of primary_slot_name"),
> +                 errdetail("In a moment starts streaming WAL with new configuration.")));
> +
> +    XLogWalRcvSendHSFeedback(true);
> +}

I don't quite think this is the right design. IMO the startup process
should signal the walreceiver to shut down, and the wal receiver should
never look at the config. Otherwise there's chances for knowledge of
pg.conf to differ between the processes.

Greetings,

Andres Freund


Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hi

> I don't quite think this is the right design. IMO the startup process
> should signal the walreceiver to shut down, and the wal receiver should
> never look at the config.

Ok, i can rewrite such way. Please check attached version.

> Otherwise there's chances for knowledge of
> pg.conf to differ between the processes.

I still not understood why this:
- is not issue for startup process with all primary_conninfo logic
- but issue for walreceiver with Michael Paquier patch; therefore without any primary_conninfo change logic in
startup.
In both cases primary_conninfo change logic is only in one process.

regards, Sergei
Attachment

Re: allow online change primary_conninfo

From
Michael Paquier
Date:
On Sat, Feb 16, 2019 at 02:50:35PM +0300, Sergei Kornilov wrote:
> +    if ((strcmp(conninfo, PrimaryConnInfo) != 0 ||
> +         strcmp(slotname, PrimarySlotName) != 0) &&
> +        WalRcvRunning())
> +    {
> +        ereport(LOG,
> +                (errmsg("terminating walreceiver process due to change of %s",
> +                        strcmp(conninfo, PrimaryConnInfo) != 0 ? "primary_conninfo" : "primary_slot_name"),
> +                 errdetail("In a moment starts streaming WAL with new configuration.")));
> +
> +        ShutdownWalRcv();
> +        lastSourceFailed = true;
> +    }

If you do that, the startup process would try to jump to a different
source to fetch WAL if archiving is enabled.  Is that really what we
want?  It would be nice to have one message for primary_conninfo being
updated, and one for primary_slot_name so as if both are changed at
the same time the user gets the correct information.

"In a moment starts streaming WAL with new configuration." sounds
strange.  It would be more natural to have something like "The WAL
receiver is going to be restarted with the new configuration", just a
suggestion.
--
Michael

Attachment

Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hi

> If you do that, the startup process would try to jump to a different
> source to fetch WAL if archiving is enabled. Is that really what we
> want?

Yes, similar behavior with exit walreceiver by any reason. Same as in all previous patch versions.
(not sure this can be changed in some small and elegant way)

> It would be nice to have one message for primary_conninfo being
> updated, and one for primary_slot_name so as if both are changed at
> the same time the user gets the correct information.

Hm, maybe even better would be separate message if both settings are changed? Doing this in attached patch.

> "In a moment starts streaming WAL with new configuration." sounds
> strange. It would be more natural to have something like "The WAL
> receiver is going to be restarted with the new configuration", just a
> suggestion.

This phrase was proposed here:
https://www.postgresql.org/message-id/20181213.184233.215171075.horiguchi.kyotaro%40lab.ntt.co.jp
My english is poor, so i added message just as proposed. I replaced to your variant.

regards, Sergei
Attachment

Re: allow online change primary_conninfo

From
Andres Freund
Date:
Hi,

On 2019-02-16 14:50:35 +0300, Sergei Kornilov wrote:
> > I don't quite think this is the right design. IMO the startup process
> > should signal the walreceiver to shut down, and the wal receiver should
> > never look at the config.
> 
> Ok, i can rewrite such way. Please check attached version.
> 
> > Otherwise there's chances for knowledge of
> > pg.conf to differ between the processes.
> 
> I still not understood why this:
> - is not issue for startup process with all primary_conninfo logic
> - but issue for walreceiver with Michael Paquier patch; therefore without any primary_conninfo change logic in
startup.
> In both cases primary_conninfo change logic is only in one process.
> 
> regards, Sergei

For one, it's not ok to just let the startup process think that the
walreceiver failed - that'll make it change source of WAL to the next
available method. Something we definitely do not want, as
restore_command is very commonly slower.

But just as importantly, I think, is that we ought to automate
cluster-wide tasks like failing over more inside postgres. And that's
much harder if different parts of PG, say the startup process and
walreceiver, have different assumptions about the current configuration.



> +void
> +ProcessStartupSigHup(void)
> +{
> +    char       *conninfo = pstrdup(PrimaryConnInfo);
> +    char       *slotname = pstrdup(PrimarySlotName);
> +
> +    ProcessConfigFile(PGC_SIGHUP);
> +
> +    /*
> +     * we need shutdown running walreceiver if replication settings was
> +     * changed. walreceiver will be started with new settings in next
> +     * WaitForWALToBecomeAvailable iteration
> +     */
> +    if ((strcmp(conninfo, PrimaryConnInfo) != 0 ||
> +         strcmp(slotname, PrimarySlotName) != 0) &&
> +        WalRcvRunning())
> +    {
> +        ereport(LOG,
> +                (errmsg("terminating walreceiver process due to change of %s",
> +                        strcmp(conninfo, PrimaryConnInfo) != 0 ? "primary_conninfo" : "primary_slot_name"),
> +                 errdetail("In a moment starts streaming WAL with new configuration.")));
> +
> +        ShutdownWalRcv();
> +        lastSourceFailed = true;
> +    }
> +
> +    pfree(conninfo);
> +    pfree(slotname);
> +}

That'd still switch to a different method, something we do not want...


Greetings,

Andres Freund


Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hi

> For one, it's not ok to just let the startup process think that the
> walreceiver failed - that'll make it change source of WAL to the next
> available method. Something we definitely do not want, as
> restore_command is very commonly slower.

Ok. This was not mentioned before Michael response yesterday. restore_command is much faster compared to database
restart,also switch to a different method was proposed few years ago by Simon Riggs in original change recovery.conf
proposal(
https://www.postgresql.org/message-id/flat/CANP8+jLO5fmfudbB1b1iw3pTdOK1HBM=xMTaRfOa5zpDVcqzew@mail.gmail.com/). I
assumedwe can start with this. Sorry for your wasted time.
 

> That'd still switch to a different method, something we do not want...

Ok, do not want means we do not want. Will try change behavior.

regards, Sergei


Re: allow online change primary_conninfo

From
Michael Paquier
Date:
On Mon, Feb 18, 2019 at 12:06:05AM +0300, Sergei Kornilov wrote:
> Ok. This was not mentioned before Michael response
> yesterday. restore_command is much faster compared to database
> restart [...]

The startup process would not switch back to streaming with archiving
enabled until it has consumed all the segments available.  I recall
David Steele mentioning me that a perl command invocation can take
easily 100ms.  On a system which generates more than 10 segments per
segment, you can guess what happens..  I think that this argument is a
non-starter if we want to make the WAL receiver a maximum available.

>> That'd still switch to a different method, something we do not want...
>
> Ok, do not want means we do not want. Will try change behavior.

Thanks Sergei for considering it.  The code paths touched are
sensitive, so we had better be careful here.
--
Michael

Attachment

Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello

This might be not the right way, but I can't think of a better way to not switch to a different method than split of
lastSourceFailedprocessing and starting new source. Something like refactoring in first attached patch. I move
RequestXLogStreaminglogic from lastSourceFailed processing and add new pendingRestartSource indicator.
 
Second patch is mostly the same as previous version but uses new pendingRestartSource mechanism instead of
lastSourceFailed.

Thanks in advance!

regards, Sergei
Attachment

Re: allow online change primary_conninfo

From
Michael Paquier
Date:
On Sat, Mar 02, 2019 at 01:49:51PM +0300, Sergei Kornilov wrote:
> This might be not the right way, but I can't think of a better way
> to not switch to a different method than split of lastSourceFailed
> processing and starting new source. Something like refactoring in
> first attached patch. I move RequestXLogStreaming logic from
> lastSourceFailed processing and add new pendingRestartSource
> indicator.

OK.  This needs a rather close lookup, and I am not actually sure that
you even need pendingRestartSource.  Please let me a couple of days
before coming back to you on 0001.

> Second patch is mostly the same as previous version but uses new
> pendingRestartSource mechanism instead of lastSourceFailed.

This, on the other hand, looks like the implementation we are looking
for based on the discussions which happened until now to have the
startup process handle the shutdown and restart of the WAL receiver.
--
Michael

Attachment

Re: Re: allow online change primary_conninfo

From
David Steele
Date:
Hi Michael,

On 3/4/19 7:19 AM, Michael Paquier wrote:
> On Sat, Mar 02, 2019 at 01:49:51PM +0300, Sergei Kornilov wrote:
>> This might be not the right way, but I can't think of a better way
>> to not switch to a different method than split of lastSourceFailed
>> processing and starting new source. Something like refactoring in
>> first attached patch. I move RequestXLogStreaming logic from
>> lastSourceFailed processing and add new pendingRestartSource
>> indicator.
> 
> OK.  This needs a rather close lookup, and I am not actually sure that
> you even need pendingRestartSource.  Please let me a couple of days
> before coming back to you on 0001.
> 
>> Second patch is mostly the same as previous version but uses new
>> pendingRestartSource mechanism instead of lastSourceFailed.
> 
> This, on the other hand, looks like the implementation we are looking
> for based on the discussions which happened until now to have the
> startup process handle the shutdown and restart of the WAL receiver.

Do you know when you'll have a chance to revisit this?

Regards,
-- 
-David
david@pgmasters.net


Re: allow online change primary_conninfo

From
Andres Freund
Date:
Hi,

On 2019-03-25 12:32:21 +0400, David Steele wrote:
> On 3/4/19 7:19 AM, Michael Paquier wrote:
> > On Sat, Mar 02, 2019 at 01:49:51PM +0300, Sergei Kornilov wrote:
> > > This might be not the right way, but I can't think of a better way
> > > to not switch to a different method than split of lastSourceFailed
> > > processing and starting new source. Something like refactoring in
> > > first attached patch. I move RequestXLogStreaming logic from
> > > lastSourceFailed processing and add new pendingRestartSource
> > > indicator.
> > 
> > OK.  This needs a rather close lookup, and I am not actually sure that
> > you even need pendingRestartSource.  Please let me a couple of days
> > before coming back to you on 0001.
> > 
> > > Second patch is mostly the same as previous version but uses new
> > > pendingRestartSource mechanism instead of lastSourceFailed.
> > 
> > This, on the other hand, looks like the implementation we are looking
> > for based on the discussions which happened until now to have the
> > startup process handle the shutdown and restart of the WAL receiver.
> 
> Do you know when you'll have a chance to revisit this?

I think we unfortunately got to mark this as returned with
feedback. I've not done so, but just switched the entry to waiting on
author.

Greetings,

Andres Freund



Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hi

> I think we unfortunately got to mark this as returned with
> feedback. I've not done so, but just switched the entry to waiting on
> author.

Why returned with feedback? Why waiting on author? I didn't receive a feedback for latest published patch version. What
canI do as author? Patch still applied (thanks cf bot)
 
Obviously too late for pg12, but why can not be target pg13 and therefore be moved to next CF?

regards, Sergei



Re: allow online change primary_conninfo

From
Andres Freund
Date:
Hi,

On 2019-04-03 23:49:54 +0300, Sergei Kornilov wrote:
> > I think we unfortunately got to mark this as returned with
> > feedback. I've not done so, but just switched the entry to waiting on
> > author.
> 
> Why returned with feedback? Why waiting on author? I didn't receive a
> feedback for latest published patch version. What can I do as author?
> Patch still applied (thanks cf bot) Obviously too late for pg12, but
> why can not be target pg13 and therefore be moved to next CF?

Well, my impression was that the patch didn't yet really address the
feedback. And thus should have been marked as waiting on author for a
while.

Greetings,

Andres Freund



Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hi 

>>  > I think we unfortunately got to mark this as returned with
>>  > feedback. I've not done so, but just switched the entry to waiting on
>>  > author.
>>
>>  Why returned with feedback? Why waiting on author? I didn't receive a
>>  feedback for latest published patch version. What can I do as author?
>>  Patch still applied (thanks cf bot) Obviously too late for pg12, but
>>  why can not be target pg13 and therefore be moved to next CF?
>
> Well, my impression was that the patch didn't yet really address the
> feedback. And thus should have been marked as waiting on author for a
> while.

Not agree. Latest patch version perform walreceiver restart without switch to a different method as discussed. Here is
norace condition between startup process and walreceiver because conninfo passed via WalRcvData struct as currently. I
misssomething important?
 
Michael Paquier had no possibility to review latest implementation, but did not say this is totally wrong, just asked
waita rather close lookup.
 

Of cource we can close this cf entry. I would be happy if someone else post proper implementation. And I can rework my
implementationagain, but I don’t know how the correct implementation should look or why latest implementation is
wrong.

regards, Sergei



Re: allow online change primary_conninfo

From
Michael Paquier
Date:
On Thu, Apr 04, 2019 at 12:27:55AM +0300, Sergei Kornilov wrote:
> Not agree. Latest patch version perform walreceiver restart without
> switch to a different method as discussed. Here is no race condition
> between startup process and walreceiver because conninfo passed via
> WalRcvData struct as currently. I miss something important?
> Michael Paquier had no possibility to review latest implementation,
> but did not say this is totally wrong, just asked wait a rather
> close lookup.

I agree with Sergei's statement here.  He has sent a patch for review,
which I have looked up a bit, but not enough to provide a full review
unfortunately, and this even if I was listed as a reviewer of it.  So
if somebody is to blame here, that's me.

> Of cource we can close this cf entry. I would be happy if someone
> else post proper implementation. And I can rework my implementation
> again, but I don’t know how the correct implementation should look
> or why latest implementation is wrong.

Moving this patch to next CF is fine.
--
Michael

Attachment

Re: allow online change primary_conninfo

From
Andres Freund
Date:
On 2019-04-04 11:06:05 +0900, Michael Paquier wrote:
> On Thu, Apr 04, 2019 at 12:27:55AM +0300, Sergei Kornilov wrote:
> > Not agree. Latest patch version perform walreceiver restart without
> > switch to a different method as discussed. Here is no race condition
> > between startup process and walreceiver because conninfo passed via
> > WalRcvData struct as currently. I miss something important? 
> > Michael Paquier had no possibility to review latest implementation,
> > but did not say this is totally wrong, just asked wait a rather
> > close lookup.
> 
> I agree with Sergei's statement here.  He has sent a patch for review,
> which I have looked up a bit, but not enough to provide a full review
> unfortunately, and this even if I was listed as a reviewer of it.  So
> if somebody is to blame here, that's me.

> > Of cource we can close this cf entry. I would be happy if someone
> > else post proper implementation. And I can rework my implementation
> > again, but I don’t know how the correct implementation should look
> > or why latest implementation is wrong. 
> 
> Moving this patch to next CF is fine.

Cool, sorry for the misunderstanding then.



Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello

Updated version attached. Merge conflict was about tests count in 001_stream_rep.pl. Nothing else was changed. My
approachcan be still incorrect, any redesign ideas are welcome. Thanks in advance!
 

regards, Sergei
Attachment

Re: allow online change primary_conninfo

From
Michael Paquier
Date:
On Mon, Jul 01, 2019 at 02:33:39PM +0300, Sergei Kornilov wrote:
> Updated version attached. Merge conflict was about tests count in
> 001_stream_rep.pl. Nothing else was changed. My approach can be
> still incorrect, any redesign ideas are welcome. Thanks in advance!

It has been some time, and I am finally catching up with this patch.

+         <para>
+          WAL receiver will be restarted after <varname>primary_slot_name</varname>
+          was changed.
          </para>
The sentence sounds strange.  Here is a suggestion:
The WAL receiver is restarted after an update of primary_slot_name (or
primary_conninfo).

The comment at the top of the call of ProcessStartupSigHup() in
HandleStartupProcInterrupts() needs to be updated as it mentions a
configuration file re-read, but that's not the case anymore.

pendingRestartSource's name does not represent what it does, as it is
only associated with the restart of a WAL receiver when in streaming
state, and that's a no-op for the archive mode and the local mode.

So, the patch splits the steps taken when checking for a WAL source by
adding an extra step after the failure handling that you are calling
the restart step.  When a failure happens for the stream mode
(shutdown of WAL receiver, promotion. etc), there is a switch to the
archive mode, and nothing is changed in this case in your patch.  So
when shutting down the WAL receiver after a parameter change, what
happens is that the startup process waits for retrieve_retry_interval
before moving back to the archive mode.  Only after scanning again the
archives do we restart a new WAL receiver.  However, if the restart of
the WAL receiver is planned because of an update of primary_conninfo
(or slot), shouldn't the follow-up mode be XLOG_FROM_STREAM without
waiting for wal_retrieve_retry_interval ms for extra WAL to become
available?
--
Michael

Attachment

Re: allow online change primary_conninfo

From
Thomas Munro
Date:
On Tue, Jul 30, 2019 at 6:42 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Mon, Jul 01, 2019 at 02:33:39PM +0300, Sergei Kornilov wrote:
> > Updated version attached. Merge conflict was about tests count in
> > 001_stream_rep.pl. Nothing else was changed. My approach can be
> > still incorrect, any redesign ideas are welcome. Thanks in advance!
>
> [review]

Unfortunately this comes right that the end of the CF, but the good
news is that there is another one just around the corner.  Moved to
September.

-- 
Thomas Munro
https://enterprisedb.com



Re: allow online change primary_conninfo

From
Michael Paquier
Date:
On Thu, Aug 01, 2019 at 09:06:59PM +1200, Thomas Munro wrote:
> Unfortunately this comes right that the end of the CF, but the good
> news is that there is another one just around the corner.  Moved to
> September.

Moving it to the next CF is fine, thanks.  The author had no time to
reply since my last lookup.
--
Michael

Attachment

Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello

> It has been some time, and I am finally catching up with this patch.

Thank you!

> + <para>
> + WAL receiver will be restarted after <varname>primary_slot_name</varname>
> + was changed.
>           </para>
> The sentence sounds strange. Here is a suggestion:
> The WAL receiver is restarted after an update of primary_slot_name (or
> primary_conninfo).
>
> The comment at the top of the call of ProcessStartupSigHup() in
> HandleStartupProcInterrupts() needs to be updated as it mentions a
> configuration file re-read, but that's not the case anymore.

I agree. But I want to know if this is an acceptable idea in general, before spending a few more months polishing the
texton something we don’t want.
 

> pendingRestartSource's name does not represent what it does, as it is
> only associated with the restart of a WAL receiver when in streaming
> state, and that's a no-op for the archive mode and the local mode.

Currently we have only one source with such operation, so I can name variable pendingRestartWalreceiver and replace
correspondingswitch to compact condition. If someday we need additional actions for another wal source - we can replace
thiscode. Looks like some form of premature optimization from me
 

> However, if the restart of
> the WAL receiver is planned because of an update of primary_conninfo
> (or slot), shouldn't the follow-up mode be XLOG_FROM_STREAM without
> waiting for wal_retrieve_retry_interval ms for extra WAL to become
> available?

Hmm. Standby with log_min_messages = debug2 and default wal_retrieve_retry_interval = 5s gives me:

2019-08-01 12:38:31.255 MSK 15707 @ from  [vxid: txid:0] [] DEBUG:  sendtime 2019-08-01 12:38:31.254905+03 receipttime
2019-08-0112:38:31.254986+03 replication apply delay 0 ms transfer latency 0 ms
 
2019-08-01 12:38:31.255 MSK 15707 @ from  [vxid: txid:0] [] DEBUG:  sending write 0/3023FA8 flush 0/3023F38 apply
0/3023F38
2019-08-01 12:38:31.262 MSK 15707 @ from  [vxid: txid:0] [] DEBUG:  sending write 0/3023FA8 flush 0/3023FA8 apply
0/3023F38
2019-08-01 12:38:31.262 MSK 15707 @ from  [vxid: txid:0] [] DEBUG:  sending write 0/3023FA8 flush 0/3023FA8 apply
0/3023FA8
2019-08-01 12:38:31.457 MSK 15699 @ from  [vxid: txid:0] [] LOG:  received SIGHUP, reloading configuration files
2019-08-01 12:38:31.458 MSK 15699 @ from  [vxid: txid:0] [] LOG:  parameter "primary_conninfo" changed to " host='/tmp'
port=5555"
2019-08-01 12:38:31.459 MSK 15700 @ from  [vxid:1/0 txid:0] [] LOG:  The WAL receiver is going to be restarted due to
changeof primary_conninfo
 
2019-08-01 12:38:31.459 MSK 15703 @ from  [vxid: txid:0] [] DEBUG:  checkpointer updated shared memory configuration
values
2019-08-01 12:38:31.459 MSK 15707 @ from  [vxid: txid:0] [] FATAL:  terminating walreceiver process due to
administratorcommand
 
2019-08-01 12:38:31.463 MSK 15724 @ from  [vxid: txid:0] [] LOG:  started streaming WAL from primary at 0/3000000 on
timeline1
 
2019-08-01 12:38:31.464 MSK 15724 @ from  [vxid: txid:0] [] DEBUG:  sendtime 2019-08-01 12:38:31.463954+03 receipttime
2019-08-0112:38:31.464228+03 replication apply delay 0 ms transfer latency 0 ms
 
2019-08-01 12:38:31.464 MSK 15724 @ from  [vxid: txid:0] [] DEBUG:  sendtime 2019-08-01 12:38:31.464068+03 receipttime
2019-08-0112:38:31.464375+03 replication apply delay 0 ms transfer latency 0 ms
 
2019-08-01 12:38:31.464 MSK 15724 @ from  [vxid: txid:0] [] DEBUG:  sending write 0/3023FA8 flush 0/3023FA8 apply
0/3023FA8

No source switch, no wal_retrieve_retry_interval waiting. (wal writes on primary by \watch 0.3 query)
RequestXLogStreaming set walRcvState to WALRCV_STARTING - correct state for WalRcvStreaming and therefore we have no
lastSourceFailed.

regards, Sergei



Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello

Updated patch attached. (also I merged into one file)

> + <para>
> + WAL receiver will be restarted after <varname>primary_slot_name</varname>
> + was changed.
>           </para>
> The sentence sounds strange. Here is a suggestion:
> The WAL receiver is restarted after an update of primary_slot_name (or
> primary_conninfo).

Changed.

> The comment at the top of the call of ProcessStartupSigHup() in
> HandleStartupProcInterrupts() needs to be updated as it mentions a
> configuration file re-read, but that's not the case anymore.

Changed to "Process any requests or signals received recently." like in other places, e.g. syslogger

> pendingRestartSource's name does not represent what it does, as it is
> only associated with the restart of a WAL receiver when in streaming
> state, and that's a no-op for the archive mode and the local mode.

I renamed to pendingWalRcvRestart and replaced switch with simple condition.

> So when shutting down the WAL receiver after a parameter change, what
> happens is that the startup process waits for retrieve_retry_interval
> before moving back to the archive mode. Only after scanning again the
> archives do we restart a new WAL receiver.

As I answered earlier, here is no switch to archive or wal_retrieve_retry_interval waiting in my patch. I recheck on
currentrevision too:
 

2019-08-28 12:16:27.295 MSK 11180 @ from  [vxid: txid:0] [] DEBUG:  sending write 0/30346C8 flush 0/30346C8 apply
0/30346C8
2019-08-28 12:16:27.493 MSK 11172 @ from  [vxid: txid:0] [] LOG:  received SIGHUP, reloading configuration files
2019-08-28 12:16:27.494 MSK 11172 @ from  [vxid: txid:0] [] LOG:  parameter "primary_conninfo" changed to "host='/tmp'
port=5555sslmode=disable sslcompression=0 gssencmode=disable target_session_attrs=any"
 
2019-08-28 12:16:27.496 MSK 11173 @ from  [vxid:1/0 txid:0] [] LOG:  The WAL receiver is going to be restarted due to
changeof primary_conninfo
 
2019-08-28 12:16:27.496 MSK 11176 @ from  [vxid: txid:0] [] DEBUG:  checkpointer updated shared memory configuration
values
2019-08-28 12:16:27.496 MSK 11180 @ from  [vxid: txid:0] [] FATAL:  terminating walreceiver process due to
administratorcommand
 
2019-08-28 12:16:27.500 MSK 11335 @ from  [vxid: txid:0] [] LOG:  started streaming WAL from primary at 0/3000000 on
timeline1
 
2019-08-28 12:16:27.500 MSK 11335 @ from  [vxid: txid:0] [] DEBUG:  sendtime 2019-08-28 12:16:27.50037+03 receipttime
2019-08-2812:16:27.500821+03 replication apply delay 0 ms transfer latency 0 ms
 

No "DEBUG:  switched WAL source from stream to archive after failure" messages, no time difference
(wal_retrieve_retry_interval= 5s).
 

regards, Sergei
Attachment

Re: allow online change primary_conninfo

From
Fujii Masao
Date:
On Wed, Aug 28, 2019 at 6:50 PM Sergei Kornilov <sk@zsrv.org> wrote:
>
> Hello
>
> Updated patch attached. (also I merged into one file)

Thanks for updating the patch! Here are some comments from me.

    #primary_conninfo = '' # connection string to sending server
    # (change requires restart)
    #primary_slot_name = '' # replication slot on sending server
    # (change requires restart)

ISTM that you need to update the above parts in postgresql.conf.sample.


    /* we don't expect primary_conninfo to change */

ISTM that you need to update the above comment in walreceiver.c.


+         <para>
+          The WAL receiver is restarted after an update of
<varname>primary_conninfo</varname>.
+         </para>

If primary_conninfo is set to an empty string, walreceiver just shuts down,
not restarts. The above description in the doc is not always true.
So I'm thinking something like the following description is better.
Thought?

    If primary_conninfo is changed while WAL receiver is running,
    the WAL receiver shuts down and then restarts with new setting,
    except when primary_conninfo is an empty string.


There is the case where walreceiver doesn't shut down immediately
after the change of primary_conninfo. If the change happens while
the startup process in paused state (e.g., by pg_wal_replay_pause(),
recovery_min_apply_delay, recovery conflict, etc), the startup
process tries to terminate walreceiver after it gets out of such state.
Shouldn't this fact be documented as a note?

Regards,

-- 
Fujii Masao



Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello

Thank you for review!

> ISTM that you need to update the above parts in postgresql.conf.sample.

Good catch, I forgot about conf sample.

> ISTM that you need to update the above comment in walreceiver.c.

Changed

> If primary_conninfo is set to an empty string, walreceiver just shuts down,
> not restarts. The above description in the doc is not always true.
> So I'm thinking something like the following description is better.
> Thought?
>
>     If primary_conninfo is changed while WAL receiver is running,
>     the WAL receiver shuts down and then restarts with new setting,
>     except when primary_conninfo is an empty string.

Ok, changed.
I leave primary_slot_name description as before.

> There is the case where walreceiver doesn't shut down immediately
> after the change of primary_conninfo. If the change happens while
> the startup process in paused state (e.g., by pg_wal_replay_pause(),
> recovery_min_apply_delay, recovery conflict, etc), the startup
> process tries to terminate walreceiver after it gets out of such state.
> Shouldn't this fact be documented as a note?

Hmm. Is somewhere documented that walreceiver will receive WAL while the startup process in paused state?
(didn't add such note in current version)

regards, Sergei
Attachment

Re: allow online change primary_conninfo

From
Kyotaro Horiguchi
Date:
Hello. I looked this and have some comments.

At Wed, 28 Aug 2019 12:49:46 +0300, Sergei Kornilov <sk@zsrv.org> wrote in
<34084371566985786@sas1-0a6c2e2b59d7.qloud-c.yandex.net>
sk> Hello
sk>
sk> Updated patch attached. (also I merged into one file)
sk>
sk> > + <para>
sk> > + WAL receiver will be restarted after <varname>primary_slot_name</varname>
sk> > + was changed.
sk> >           </para>
sk> > The sentence sounds strange. Here is a suggestion:
sk> > The WAL receiver is restarted after an update of primary_slot_name (or
sk> > primary_conninfo).
sk>
sk> Changed.

-          This parameter can only be set at server start.
+          This parameter can only be set in the <filename>postgresql.conf</filename>
+          file or on the server command line.

I'm not sure it's good to change the context of this
description. This was mentioning that changing of this parameter
requires server (re)start. So if we want to be on the same
context after rewriting, it would be like "This parameter can be
set any time and causes WAL receiver restart with the new setting
if the server is in standby mode."

And If I'm not missing something, I don't find an (explict)
paramter of postmaster for setting primary_conninfo.



sk> > The comment at the top of the call of ProcessStartupSigHup() in
sk> > HandleStartupProcInterrupts() needs to be updated as it mentions a
sk> > configuration file re-read, but that's not the case anymore.
sk>
sk> Changed to "Process any requests or signals received recently." like in other places, e.g. syslogger
sk>
sk> > pendingRestartSource's name does not represent what it does, as it is
sk> > only associated with the restart of a WAL receiver when in streaming
sk> > state, and that's a no-op for the archive mode and the local mode.
sk>
sk> I renamed to pendingWalRcvRestart and replaced switch with simple condition.
sk>
sk> > So when shutting down the WAL receiver after a parameter change, what
sk> > happens is that the startup process waits for retrieve_retry_interval
sk> > before moving back to the archive mode. Only after scanning again the
sk> > archives do we restart a new WAL receiver.
sk>
sk> As I answered earlier, here is no switch to archive or wal_retrieve_retry_interval waiting in my patch. I recheck
oncurrent revision too: 

@@ -11787,48 +11793,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
                     if (!StandbyMode)
                         return false;

-                    /*
-                     * If primary_conninfo is set, launch walreceiver to try
-                     * to stream the missing WAL.
-                     *
-                     * If fetching_ckpt is true, RecPtr points to the initial
-                     * checkpoint location. In that case, we use RedoStartLSN

Mentioning code, the movement of the code block makes the
surrounding swtich almost useless and the structure and the
result looks somewhat strange.

Couldn't we do the same thing by just skipping the wait and
setting lastSourceFailed to true in the case of intentional
walreceiver restart?

The attached is an incomplete (fraction) patch to sketch what is
in my mind. With something like this and making
ProcessStartupSigHup kill walreceiver would work.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6876537b62..61b235f8f7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11881,10 +11881,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
                      * obtaining the requested WAL. We're going to loop back
                      * and retry from the archive, but if it hasn't been long
                      * since last attempt, sleep wal_retrieve_retry_interval
-                     * milliseconds to avoid busy-waiting.
+                     * milliseconds to avoid busy-waiting. We don't wait if
+                     * explicitly requested to restart.
                      */
                     now = GetCurrentTimestamp();
-                    if (!TimestampDifferenceExceeds(last_fail_time, now,
+                    if (!pendingWalRcvRestart &&
+                        !TimestampDifferenceExceeds(last_fail_time, now,
                                                     wal_retrieve_retry_interval))
                     {
                         long        secs,
@@ -11905,6 +11907,17 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
                     }
                     last_fail_time = now;
                     currentSource = XLOG_FROM_ARCHIVE;
+
+                    /*
+                     * If wal receiver is requested to restart, we skip the
+                     * next XLOG_FROM_ARCHIVE to immediately starting it.
+                     */
+                    if (pendingWalRcvRestart)
+                    {
+                        lastSourceFailed = true;
+                        continue;
+                    }
+
                     break;
 
                 default:

Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello

Thank you for review!

> - This parameter can only be set at server start.
> + This parameter can only be set in the <filename>postgresql.conf</filename>
> + file or on the server command line.
>
> I'm not sure it's good to change the context of this
> description. This was mentioning that changing of this parameter
> requires server (re)start. So if we want to be on the same
> context after rewriting, it would be like "This parameter can be
> set any time and causes WAL receiver restart with the new setting
> if the server is in standby mode."

Was written such way after this review:
https://www.postgresql.org/message-id/20181125214313.lydvmrraqjfrb3s2%40alap3.anarazel.de

> And If I'm not missing something, I don't find an (explict)
> paramter of postmaster for setting primary_conninfo.

Well, we have common -c option: -c name=value

> Couldn't we do the same thing by just skipping the wait and
> setting lastSourceFailed to true in the case of intentional
> walreceiver restart?

Yes, it's possible. Let's see... Done in attached variant.
We need check pendingWalRcvRestart before rescanLatestTimeLine lines.

regards, Sergei
Attachment

Re: allow online change primary_conninfo

From
Andres Freund
Date:
Hi,

On 2019-09-19 17:46:06 +0300, Sergei Kornilov wrote:

>           <para>
> -          This parameter can only be set at server start.
> +          This parameter can only be set in the <filename>postgresql.conf</filename>
> +          file or on the server command line.
>            This setting has no effect if the server is not in standby mode.
>           </para>
> +         <para>
> +          If <varname>primary_conninfo</varname> is changed while WAL receiver is running,
> +          the WAL receiver shuts down and then restarts with new setting,
> +          except when primary_conninfo is an empty string.
> +         </para>

From the sentence structure it's not clear that "except when
primary_conninfo is an empty string" only applies to "and then restarts
with new setting".


> +/*
> + * Need for restart running WalReceiver due the configuration change.
> + * Suitable only for XLOG_FROM_STREAM source
> + */
> +static bool pendingWalRcvRestart = false;

s/due the/due to a/ (or even "due to the").


> @@ -11862,6 +11869,17 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
>                      if (WalRcvStreaming())
>                          ShutdownWalRcv();
>  
> +                    /*
> +                     * If wal receiver is requested to restart, we skip the
> +                     * next XLOG_FROM_ARCHIVE to immediately starting it.
> +                     */
> +                    if (pendingWalRcvRestart)
> +                    {
> +                        lastSourceFailed = true;
> +                        currentSource = XLOG_FROM_ARCHIVE;
> +                        continue;
> +                    }

I can't parse that comment. What does "skipping to starting" mean? I
assume it's just about avoiding wal_retrieve_retry_interval, but I think
the comment ought to be rephrased.

Also, should we really check this before scanning for new timelines?

Why is it the right thing to change to XLOG_FROM_ARCHIVE when we're just
restarting walreceiver? The server might unnecessarily get stuck in
archive based recovery for a long time this way? It seems to me that
we'd need to actually trigger RequestXLogStreaming() in this case.

Greetings,

Andres Freund



Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello

Thank you for review! Can you please also check v4 version? v5 implements design suggested by Kyotaro Horiguchi-san,
whilev4 has another design. Which one do you prefer? Or are both wrong?
 

> I can't parse that comment. What does "skipping to starting" mean? I
> assume it's just about avoiding wal_retrieve_retry_interval, but I think
> the comment ought to be rephrased.

Design idea is to rewrite current state from working XLOG_FROM_STREAM to failed XLOG_FROM_ARCHIVE (without actually try
thismethod on this iteration) and immediately go to next iteration to advance the state machine. Next iteration after
failedarchive recovery is walreceiver. So walreceiver will be stopped just before this lines and started on next
iteration.Walreceiver will be restarted, we do not call restore_command
 

> Also, should we really check this before scanning for new timelines?

Hmm, on the next day... No, this is not really necessary.

> Why is it the right thing to change to XLOG_FROM_ARCHIVE when we're just
> restarting walreceiver? The server might unnecessarily get stuck in
> archive based recovery for a long time this way? It seems to me that
> we'd need to actually trigger RequestXLogStreaming() in this case.

I hope I clarified this in design idea description above.

Thank you!

regards, Sergei



Re: allow online change primary_conninfo

From
Kyotaro Horiguchi
Date:
Hello.

At Sat, 21 Sep 2019 13:06:25 +0300, Sergei Kornilov <sk@zsrv.org> wrote in
<41171569060385@myt5-b646bde4b8f3.qloud-c.yandex.net>
> Hello
> 
> Thank you for review! Can you please also check v4 version? v5 implements design suggested by Kyotaro Horiguchi-san,
whilev4 has another design. Which one do you prefer? Or are both wrong?
 
> 
> > I can't parse that comment. What does "skipping to starting" mean? I
> > assume it's just about avoiding wal_retrieve_retry_interval, but I think
> > the comment ought to be rephrased.
> 
> Design idea is to rewrite current state from working XLOG_FROM_STREAM to failed XLOG_FROM_ARCHIVE (without actually
trythis method on this iteration) and immediately go to next iteration to advance the state machine. Next iteration
afterfailed archive recovery is walreceiver. So walreceiver will be stopped just before this lines and started on next
iteration.Walreceiver will be restarted, we do not call restore_command
 

Sorry, it's my bad. It meant "If wal receiver is requested to
restart, change state to XLOG_FROM_STREAM immediately skipping
the next XLOG_FROM_ARCHIVE state.".

> > Also, should we really check this before scanning for new timelines?
> 
> Hmm, on the next day... No, this is not really necessary.

No problem when timeline is not changed when explicit restart of
wal receiver.  But if it happened just after the standby received
new hisotry file, we suffer an extra restart of wal receiver. It
seems better that we check that in the case.

> > Why is it the right thing to change to XLOG_FROM_ARCHIVE when we're just
> > restarting walreceiver? The server might unnecessarily get stuck in
> > archive based recovery for a long time this way? It seems to me that
> > we'd need to actually trigger RequestXLogStreaming() in this case.
> 
> I hope I clarified this in design idea description above.

My suggestion was just to pretend that the next XLOG_FROM_STREAM
failed, but the outcome actually looks wierd as Andres commented.

I think that comes from the structure of the state machine. WAL
receiver is started not at the beginning of XLOG_FROM_STREAM
state but at the end of XLOG_FROM_ARCHIVE. So we need to switch
once to XLOG_FROM_ARCHIVE in order to start wal receiver.

So, I'd like to propose to move the stuff to the second switch().
(See the attached incomplete patch.) This is rather similar to
Sergei's previous proposal, but the structure of the state
machine is kept.

Some other comments are below.

In ProcessStartupSigHup, conninfo and slotname don't need to be
retained until the end of the function.

The log message in the function seems to be too detailed.  On the
other hand, if we changed primary_conninfo to '' (stop) or vise
versa (start), the message (restart) looks strange.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6c69eb6dd7..dba0042db6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11755,12 +11755,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
     for (;;)
     {
         int            oldSource = currentSource;
+        bool        start_wal_receiver = false;
 
         /*
-         * First check if we failed to read from the current source, and
+         * First check if we failed to read from the current source or if
+         * we want to restart wal receiver, and
          * advance the state machine if so. The failure to read might've
          * happened outside this function, e.g when a CRC check fails on a
          * record, or within this loop.
+         * Restart wal receiver if explicitly requested, too.
          */
         if (lastSourceFailed)
         {
@@ -11788,53 +11791,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
                     if (!StandbyMode)
                         return false;
 
-                    /*
-                     * If primary_conninfo is set, launch walreceiver to try
-                     * to stream the missing WAL.
-                     *
-                     * If fetching_ckpt is true, RecPtr points to the initial
-                     * checkpoint location. In that case, we use RedoStartLSN
-                     * as the streaming start position instead of RecPtr, so
-                     * that when we later jump backwards to start redo at
-                     * RedoStartLSN, we will have the logs streamed already.
-                     */
-                    if (PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0)
-                    {
-                        XLogRecPtr    ptr;
-                        TimeLineID    tli;
-
-                        if (fetching_ckpt)
-                        {
-                            ptr = RedoStartLSN;
-                            tli = ControlFile->checkPointCopy.ThisTimeLineID;
-                        }
-                        else
-                        {
-                            ptr = RecPtr;
-
-                            /*
-                             * Use the record begin position to determine the
-                             * TLI, rather than the position we're reading.
-                             */
-                            tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);
-
-                            if (curFileTLI > 0 && tli < curFileTLI)
-                                elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but
previousrecovered WAL file came from timeline %u",
 
-                                     (uint32) (tliRecPtr >> 32),
-                                     (uint32) tliRecPtr,
-                                     tli, curFileTLI);
-                        }
-                        curFileTLI = tli;
-                        RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
-                                             PrimarySlotName);
-                        receivedUpto = 0;
-                    }
-
                     /*
                      * Move to XLOG_FROM_STREAM state in either case. We'll
                      * get immediate failure if we didn't launch walreceiver,
                      * and move on to the next state.
                      */
+                    start_wal_receiver = true;
                     currentSource = XLOG_FROM_STREAM;
                     break;
 
@@ -11864,8 +11826,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
                         ShutdownWalRcv();
 
                     /*
-                     * Before we sleep, re-scan for possible new timelines if
-                     * we were requested to recover to the latest timeline.
+                     * Re-scan for possible new timelines if we were requested
+                     * to recover to the latest timeline.
                      */
                     if (recoveryTargetTimeLineGoal == RECOVERY_TARGET_TIMELINE_LATEST)
                     {
@@ -11929,8 +11891,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
                  lastSourceFailed ? "failure" : "success");
 
         /*
-         * We've now handled possible failure. Try to read from the chosen
-         * source.
+         * We've now handled possible failure and configuration change. Try to
+         * read from the chosen source.
          */
         lastSourceFailed = false;
 
@@ -11969,9 +11931,71 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
                     bool        havedata;
 
                     /*
-                     * Check if WAL receiver is still active.
+                     * shutdown wal receiver if restart is requested.
                      */
-                    if (!WalRcvStreaming())
+                    if (!start_wal_receiver && pendingWalRcvRestart)
+                    {
+                        if (WalRcvRunning())
+                            ShutdownWalRcv();
+
+                        /*
+                         * Re-scan for possible new timelines if we were
+                         * requested to recover to the latest timeline.
+                         */
+                        if (recoveryTargetTimeLineGoal ==
+                            RECOVERY_TARGET_TIMELINE_LATEST)
+                            rescanLatestTimeLine();
+                        
+                        start_wal_receiver = true;
+                    }
+                    pendingWalRcvRestart = false;
+
+                    /*
+                     * Launch walreceiver if needed.
+                     *
+                     * If fetching_ckpt is true, RecPtr points to the initial
+                     * checkpoint location. In that case, we use RedoStartLSN
+                     * as the streaming start position instead of RecPtr, so
+                     * that when we later jump backwards to start redo at
+                     * RedoStartLSN, we will have the logs streamed already.
+                     */
+                    if (start_wal_receiver &&
+                        PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0)
+                    {
+                        XLogRecPtr    ptr;
+                        TimeLineID    tli;
+
+                        if (fetching_ckpt)
+                        {
+                            ptr = RedoStartLSN;
+                            tli = ControlFile->checkPointCopy.ThisTimeLineID;
+                        }
+                        else
+                        {
+                            ptr = RecPtr;
+
+                            /*
+                             * Use the record begin position to determine the
+                             * TLI, rather than the position we're reading.
+                             */
+                            tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);
+
+                            if (curFileTLI > 0 && tli < curFileTLI)
+                                elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but
previousrecovered WAL file came from timeline %u",
 
+                                     (uint32) (tliRecPtr >> 32),
+                                     (uint32) tliRecPtr,
+                                     tli, curFileTLI);
+                        }
+                        curFileTLI = tli;
+                        RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
+                                             PrimarySlotName);
+                        receivedUpto = 0;
+                    }
+
+                    /*
+                     * Else, check if WAL receiver is still active.
+                     */
+                    else if (!WalRcvStreaming())
                     {
                         lastSourceFailed = true;
                         break;

Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello

> So, I'd like to propose to move the stuff to the second switch().
> (See the attached incomplete patch.) This is rather similar to
> Sergei's previous proposal, but the structure of the state
> machine is kept.

Very similar to my v4 proposal (also move RequestXLogStreaming call), but closer to currentSource reading. No
objectionsfrom me, attached patch is changed this way.
 
I renamed start_wal_receiver to startWalReceiver - this style looks more consistent to near code.

> +                    /*
> +                     * Else, check if WAL receiver is still active.
> +                     */
> +                    else if (!WalRcvStreaming())

I think we still need wait WalRcvStreaming after RequestXLogStreaming call. So I remove else branch and leave separate
condition.

> In ProcessStartupSigHup, conninfo and slotname don't need to be
> retained until the end of the function.

Agreed, I move pfree

> The log message in the function seems to be too detailed. On the
> other hand, if we changed primary_conninfo to '' (stop) or vise
> versa (start), the message (restart) looks strange.

I have no strong opinion here. These messages was changed many times during this thread lifetime, can be changed
anytime.I think this is not issue since we have no consensus about overall design.
 
Write detailed messages was proposed here: https://www.postgresql.org/message-id/20190216151025.GJ2240%40paquier.xyz

> or vise versa (start)

I explicitly check currentSource and WalRcvRunning, so we have no such messages if user had no walreceiver before.

regards, Sergei
Attachment

Re: allow online change primary_conninfo

From
Tomas Vondra
Date:
Hi,

I'm a bit confused by the status of this patch - it's marked as Waiting
on Author (since last week), but that status was set by Sergei Kornilov
who is also the patch author. And there have been no messages since the
beginning of 2019-11 CF.

So what's the current status of this patch?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello

"Waiting on Author" is the right status. I found logical conflict with recently added wal_receiver_create_temp_slot GUC
andmy tests are currently broken. Will fix it and post new version.
 

PS: also, I surprised why it's ok for wal_receiver_create_temp_slot to be PGC_SIGHUP and ignore change of this setting
untilwalreceiver will reconnect by unrelated reason. I means walreceiver does nothing special on SIGHUP. In common case
changeof wal_receiver_create_temp_slot setting will have effect only during restart of walreceiver process. And
thereforewe will switch to archive recovery. But such design was strongly rejected for my patch year ago.
 

regards, Sergei



Re: allow online change primary_conninfo

From
Michael Paquier
Date:
On Tue, Jan 21, 2020 at 06:03:18PM +0300, Sergei Kornilov wrote:
> PS: also, I surprised why it's ok for wal_receiver_create_temp_slot
> to be PGC_SIGHUP and ignore change of this setting until walreceiver
> will reconnect by unrelated reason. I means walreceiver does nothing
> special on SIGHUP. In common case change of
> wal_receiver_create_temp_slot setting will have effect only during
> restart of walreceiver process. And therefore we will switch to
> archive recovery. But such design was strongly rejected for my patch
> year ago.

[ Looks at 3297308... ]
Yeah, you are right.  I was not paying much attention but something
does not stick here.  My understanding is that we should have the WAL
receiver receive the value it needs to use from the startup process
(aka via RequestXLogStreaming from xlog.c), and that we ought to make
this new parameter PGC_POSTMASTER instead of PGC_SIGHUP.  HEAD is
inconsistent here.
--
Michael

Attachment

Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello

> Yeah, you are right. I was not paying much attention but something
> does not stick here. My understanding is that we should have the WAL
> receiver receive the value it needs to use from the startup process
> (aka via RequestXLogStreaming from xlog.c), and that we ought to make
> this new parameter PGC_POSTMASTER instead of PGC_SIGHUP. HEAD is
> inconsistent here.

Thank you!

I attached two patches:
- first changes wal_receiver_create_temp_slot to PGC_POSTMASTER and moved the logic to RequestXLogStreaming
- second is based on last published v6 version of main patch. It changes wal_receiver_create_temp_slot back to
PGC_SIGHUPalong with primary_conninfo and primary_slot_name and will restart walreceiver if need.
 

Regarding the main patch: we have several possibilities for moving RequestXLogStreaming call. We need to choose one.
And, of course, changes in the text...

regards, Sergei
Attachment

Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello
Small rebase due to merge conflict of the tests. No functional changes since v7.

PS: also it is end of current CF, I will mark patch entry as moved to the next CF.
Attachment

Re: allow online change primary_conninfo

From
Alvaro Herrera
Date:
On 2020-Jan-22, Michael Paquier wrote:

> On Tue, Jan 21, 2020 at 06:03:18PM +0300, Sergei Kornilov wrote:
> > PS: also, I surprised why it's ok for wal_receiver_create_temp_slot
> > to be PGC_SIGHUP and ignore change of this setting until walreceiver
> > will reconnect by unrelated reason. I means walreceiver does nothing
> > special on SIGHUP. In common case change of
> > wal_receiver_create_temp_slot setting will have effect only during
> > restart of walreceiver process. And therefore we will switch to
> > archive recovery. But such design was strongly rejected for my patch
> > year ago. 
> 
> [ Looks at 3297308... ]
> Yeah, you are right.  I was not paying much attention but something
> does not stick here.  My understanding is that we should have the WAL
> receiver receive the value it needs to use from the startup process
> (aka via RequestXLogStreaming from xlog.c), and that we ought to make
> this new parameter PGC_POSTMASTER instead of PGC_SIGHUP.  HEAD is
> inconsistent here.

I'm CCing Peter as committer of 329730827848.

What are the downsides of changing wal_receiver_create_temp_slot to
PGC_POSTMASTER?  It seems pretty nasty to requires a full server
restart.  Maybe we can signal all walreceivers at that point so that
they restart with the correct setting?  That's much less problematic, I
would think.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: allow online change primary_conninfo

From
Alvaro Herrera
Date:
On 2020-Mar-13, Alvaro Herrera wrote:

> What are the downsides of changing wal_receiver_create_temp_slot to
> PGC_POSTMASTER?  It seems pretty nasty to requires a full server
> restart.  Maybe we can signal all walreceivers at that point so that
> they restart with the correct setting?  That's much less problematic, I
> would think.

... oh, that's exactly what 0002 does.  So patch 0001 is only about
making a temporary change to the create_temp_slot to be consistent with
existing policy, before changing the policy.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: allow online change primary_conninfo

From
Michael Paquier
Date:
On Fri, Mar 13, 2020 at 11:17:38AM -0300, Alvaro Herrera wrote:
> ... oh, that's exactly what 0002 does.  So patch 0001 is only about
> making a temporary change to the create_temp_slot to be consistent with
> existing policy, before changing the policy.

Yes.  In my opinion, patch 0002 should not change the GUC mode of
wal_receiver_create_temp_slot as the discussion here is about
primary_conninfo, even if both may share some logic regarding WAL
receiver shutdown and its restart triggered by the startup process.

Patch 0001 has actually been presented on this thread first:
https://www.postgresql.org/message-id/753391579708726@iva3-77ae5995f07f.qloud-c.yandex.net
And there is an independent patch registered in this CF:
https://commitfest.postgresql.org/27/2456/

Should we add patch 0001 as an open item for v13 as there is a risk of
forgetting this issue?  I have created a wiki blank page a couple of
weeks back:
https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items
--
Michael

Attachment

Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello

Sorry for late replies.

> Yes. In my opinion, patch 0002 should not change the GUC mode of
> wal_receiver_create_temp_slot as the discussion here is about
> primary_conninfo, even if both may share some logic regarding WAL
> receiver shutdown and its restart triggered by the startup process.

Ok, I removed related changes from main patch. Along with minor merge conflict.

> Patch 0001 has actually been presented on this thread first:
> https://www.postgresql.org/message-id/753391579708726@iva3-77ae5995f07f.qloud-c.yandex.net
> And there is an independent patch registered in this CF:
> https://commitfest.postgresql.org/27/2456/

Yep, 0001 is separate patch. I will post copy of this patch here to make cfbot works.  Main patch 0002 requires
resettingof is_temp_slot in RequestXLogStreaming to works properly.
 

> Should we add patch 0001 as an open item for v13 as there is a risk of
> forgetting this issue?

I think yes.

Well, it seems better to move this patch to next commitfest?

regards, Sergei
Attachment

Re: allow online change primary_conninfo

From
Alvaro Herrera
Date:
On 2020-Mar-17, Sergei Kornilov wrote:

> Well, it seems better to move this patch to next commitfest?

What?  You want to make wal_receiver_create_temp_slot unchangeable and
default to off in pg13, and delay the patch that fixes those things to
pg14?  That makes no sense to me.  Please keep them both here so that we
can get things to a usable state.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello

>>  Well, it seems better to move this patch to next commitfest?
>
> What? You want to make wal_receiver_create_temp_slot unchangeable and
> default to off in pg13, and delay the patch that fixes those things to
> pg14? That makes no sense to me.

I want to handle similar things in a similar way.
wal_receiver_create_temp_slot has good design? I will change my patch in same way in this case. But something like that
wasstrongly rejected a year ago.
 

> Please keep them both here so that we can get things to a usable state.

Yes, of course.

Here I attached 3 patches:
0001 is copy from https://commitfest.postgresql.org/27/2456/ It changes wal_receiver_create_temp_slot to
PGC_POSTMASTER,changes the default value to off, and moves the logic to the startup process.
 
0002 changes primary_conninfo and primary_slot_name to be PGC_SIGHUP
0003 changes wal_receiver_create_temp_slot back to be PGC_SIGHUP. Michael Paquier asks to remove this from 0002, you
askto leave it in this thread. So, I made separate patch on top of 0002.
 

Thank you

regards, Sergei
Attachment

Re: allow online change primary_conninfo

From
Alvaro Herrera
Date:
I think the startup sighup handler should be in startup.c, not xlog.c,
which has enough random code already.  We can add an accessor in xlog.c
to let changing the walrcv status flag, to be called from the signal
handler.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: allow online change primary_conninfo

From
Alvaro Herrera
Date:
On 2020-Mar-24, Alvaro Herrera wrote:

> I think the startup sighup handler should be in startup.c, not xlog.c,
> which has enough random code already.  We can add an accessor in xlog.c
> to let changing the walrcv status flag, to be called from the signal
> handler.

... as in the attached version.

Sergei included LOG messages to indicate which setting has been changed.
I put these in "#if 0" for now, but I'm thinking to remove these
altogether; we already have LOG messages when a setting changes value,
per ProcessConfigFileInternal(); the log sequence looks like this, taken
from tmp_check/log/001_stream_rep_standby_2.log after running the tests:

2020-03-25 20:54:19.413 -03 [17493] LOG:  received SIGHUP, reloading configuration files
2020-03-25 20:54:19.415 -03 [17493] LOG:  parameter "primary_slot_name" changed to "standby_2"
2020-03-25 20:54:19.415 -03 [17493] LOG:  parameter "wal_receiver_status_interval" changed to "1"
2020-03-25 20:54:19.422 -03 [17569] LOG:  started streaming WAL from primary at 0/3000000 on timeline 1
2020-03-25 20:54:19.426 -03 [17494] LOG:  wal receiver process shutdown requested
2020-03-25 20:54:19.426 -03 [17569] FATAL:  terminating walreceiver process due to administrator command
2020-03-25 20:54:19.433 -03 [17572] LOG:  started streaming WAL from primary at 0/3000000 on timeline 1

which looks sufficient.  Maybe we can reword that new message, say "wal
receiver process shutdown forced by parameter change".  Not sure if we
can or should adjust the FATAL line; probably not worth the trouble.

Thoughts welcome.  I'm thinking on getting this applied shortly.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: allow online change primary_conninfo

From
Michael Paquier
Date:
On Wed, Mar 25, 2020 at 09:08:12PM -0300, Alvaro Herrera wrote:
> ... as in the attached version.

Patch 0001 is one that has already been discussed previously (thanks
for keeping the extra comments about GUCs and WAL receivers).  That
looks fine to me.

> Sergei included LOG messages to indicate which setting has been changed.
> I put these in "#if 0" for now, but I'm thinking to remove these
> altogether; we already have LOG messages when a setting changes value,
> per ProcessConfigFileInternal(); the log sequence looks like this, taken
> from tmp_check/log/001_stream_rep_standby_2.log after running the tests:

Yes, it makes sense to remove any knowledge of GUC updates from
StartupRequestWalReceiverRestart().  I would add a description at the
top of this routine by the way.

+extern void ProcessStartupSigHup(void);
This is declared, but used nowhere in patch 0002.

Wouldn't it be better to be more careful about the NULL-ness of the
string parameters in StartupRereadConfig()?

+   if (!slotnameChanged && strcmp(PrimarySlotName, "") == 0)
+       tempSlotChanged = tempSlot != wal_receiver_create_temp_slot;
I would add parens here for clarity.

> which looks sufficient.  Maybe we can reword that new message, say "wal
> receiver process shutdown forced by parameter change".  Not sure if we
> can or should adjust the FATAL line; probably not worth the trouble.

There is merit to keep the LOG in StartupRequestWalReceiverRestart()
unchanged, as the routine may get called in the future in some other
context.
--
Michael

Attachment

Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello

Thank you! You were ahead of me. I wanted to double-check my changes this morning before posting.

> Sergei included LOG messages to indicate which setting has been changed.
> I put these in "#if 0" for now, but I'm thinking to remove these
> altogether;

No objections.

> Not sure if we can or should adjust the FATAL line; probably not worth the trouble.

I think it's ok. walreceiver is terminated exactly due to administrator command.

regards, Sergei



Re: allow online change primary_conninfo

From
Alvaro Herrera
Date:
Now, would anyone be too upset if I push these in this other order?  I
realized that the reason the tests broke after Sergei's patch is that
recovery/t/001_stream_rep.pl's get_slot_xmins() is broken for temp
walreceiver slots, since it's using the non-temp name it tries to give
to the slot, rather than the temp name under which it is actually
created.  The workaround proposed by 0002 is to edit standby_1's config
to set walreceiver's slot to be non-temp.

Thanks to Justin Pryzby for offlist typo corrections.

The reason is that I think I would like to get Sergei's patch pushed
right away (0001+0002, as a single commit); but that I think there's
more to attack in the walreceiver temp slot code than 0003 here, and I
don't want to delay the new feature any longer while I figure out the
fix for that.


(The thing is: if I specify primary_slot_name in the config, why is the
temp walreceiver slot code not obeying that name?  I think walreceiver
should create a temp slot, sure, but using the given name rather than
coming up with a random name.)

(The other reason is that I would like to push one patch to fix
walreceiver tmp slot rather than two, setting the thing first this way
and then the opposite way.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: allow online change primary_conninfo

From
Michael Paquier
Date:
On Thu, Mar 26, 2020 at 09:39:17PM -0300, Alvaro Herrera wrote:
> Now, would anyone be too upset if I push these in this other order?  I
> realized that the reason the tests broke after Sergei's patch is that
> recovery/t/001_stream_rep.pl's get_slot_xmins() is broken for temp
> walreceiver slots, since it's using the non-temp name it tries to give
> to the slot, rather than the temp name under which it is actually
> created.  The workaround proposed by 0002 is to edit standby_1's config
> to set walreceiver's slot to be non-temp.

FWIW, I find a bit strange that the change introduced in
001_stream_rep.pl as of patch 0002 is basically undone in 0003 by
changing the default value of wal_receiver_create_temp_slot.

> The reason is that I think I would like to get Sergei's patch pushed
> right away (0001+0002, as a single commit); but that I think there's
> more to attack in the walreceiver temp slot code than 0003 here, and I
> don't want to delay the new feature any longer while I figure out the
> fix for that.

Not sure I agree with this approach.  I'd rather fix all the existing
known problems first, and then introduce the new features on top of
what we consider to be a clean state.  If we lack of time between the
first and second patches, we may have a risk of keeping the code with
the new feature but without the fixes discussed for
wal_receiver_create_temp_slot.

> (The thing is: if I specify primary_slot_name in the config, why is the
> temp walreceiver slot code not obeying that name?  I think walreceiver
> should create a temp slot, sure, but using the given name rather than
> coming up with a random name.)

Good point.  I am not sure either why the current logic has been
chosen.  The discussion related to the original patch is in this area:
https://www.postgresql.org/message-id/4792e456-d75f-8e6a-2d47-34b8f78c266f@2ndquadrant.com

> (The other reason is that I would like to push one patch to fix
> walreceiver tmp slot rather than two, setting the thing first this way
> and then the opposite way.)

So your problem here is that by applying first 0003 and then 0001-0002
you would finish by switching wal_receiver_create_temp_slot to
PGC_POSTMASTER, and then back to PGC_SIGHUP again?
--
Michael

Attachment

Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello

> I realized that the reason the tests broke after Sergei's patch is that
> recovery/t/001_stream_rep.pl's get_slot_xmins() is broken for temp
> walreceiver slots, since it's using the non-temp name it tries to give
> to the slot, rather than the temp name under which it is actually
> created. The workaround proposed by 0002 is to edit standby_1's config
> to set walreceiver's slot to be non-temp.

This is bug in behavior, not in tests.
We need walrcv->is_temp_slot = false; somewhere in RequestXLogStreaming to works correctly.

HEAD is not affected since primary_slot_name cannot be changed online.

> (The thing is: if I specify primary_slot_name in the config, why is the
> temp walreceiver slot code not obeying that name? I think walreceiver
> should create a temp slot, sure, but using the given name rather than
> coming up with a random name.)

Hm, interesting idea.

regards, Sergei



Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello

In other words, patches in reverse order:
0001 will change primary_conninfo and primary_slot_name to reloadable
0002 will move wal_receiver_create_temp_slot logic to startup process (without changing to PGC_POSTMASTER)
0003 is new draft patch: wal_receiver_create_temp_slot will use the given name or generate new one.

regards, Sergei
Attachment

Re: allow online change primary_conninfo

From
Alvaro Herrera
Date:
On 2020-Mar-27, Sergei Kornilov wrote:

> Hello
> 
> > I realized that the reason the tests broke after Sergei's patch is that
> > recovery/t/001_stream_rep.pl's get_slot_xmins() is broken for temp
> > walreceiver slots, since it's using the non-temp name it tries to give
> > to the slot, rather than the temp name under which it is actually
> > created. The workaround proposed by 0002 is to edit standby_1's config
> > to set walreceiver's slot to be non-temp.
> 
> This is bug in behavior, not in tests.
> We need walrcv->is_temp_slot = false; somewhere in RequestXLogStreaming to works correctly.
> 
> HEAD is not affected since primary_slot_name cannot be changed online.

I pushed the wal_receiver_create_temp_slot bugfix, because I realized
after looking for long enough at WalReceiverMain() that the code was
beyond saving.  I'll be pushing the rest of this later today.

Thanks,

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: allow online change primary_conninfo

From
Alvaro Herrera
Date:
On 2020-Mar-27, Alvaro Herrera wrote:

> I pushed the wal_receiver_create_temp_slot bugfix, because I realized
> after looking for long enough at WalReceiverMain() that the code was
> beyond saving.  I'll be pushing the rest of this later today.

So here's the next one.  I still have to go over the doc changes here,
but at least the tests are passing for me.

I think I should set aside your new draft for now, but I think we should
still get it in pg13 to avoid having the change the semantics of the
walreceiver temp slot in the next release.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello

Thank you!

> I think I should set aside your new draft for now

I agree, this patch definitely needs a bit more time to review. (currently it applies on top of v13 patch cleanly)

> but I think we should still get it in pg13 to avoid having the change the semantics of the
> walreceiver temp slot in the next release.

Good point.

regards, Sergei



Re: allow online change primary_conninfo

From
Alvaro Herrera
Date:
On 2020-Mar-27, Alvaro Herrera wrote:

> On 2020-Mar-27, Alvaro Herrera wrote:
> 
> > I pushed the wal_receiver_create_temp_slot bugfix, because I realized
> > after looking for long enough at WalReceiverMain() that the code was
> > beyond saving.  I'll be pushing the rest of this later today.
> 
> So here's the next one.  I still have to go over the doc changes here,
> but at least the tests are passing for me.

Pushed with some documentation tweaks.

Many thanks for working on this!

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello

Thank you very much!

I attached updated patch: walreceiver will use configured primary_slot_name as temporary slot name if
wal_receiver_create_temp_slotis enabled.
 

regards, Sergei
Attachment

Re: allow online change primary_conninfo

From
Peter Eisentraut
Date:
On 2020-03-28 11:49, Sergei Kornilov wrote:
> I attached updated patch: walreceiver will use configured primary_slot_name as temporary slot name if
wal_receiver_create_temp_slotis enabled.
 

The original setup worked consistently with pg_basebackup.  In 
pg_basebackup, if you specify -S/--slot, then it uses a permanent slot 
with the given name.  This is the same behavior as primary_slot_name has 
now.  We should be careful that we don't create two sets of 
options/settings that look similar but combine in different ways.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: allow online change primary_conninfo

From
Maxim Orlov
Date:
On 2020-03-28 02:39, Alvaro Herrera wrote:
> On 2020-Mar-27, Alvaro Herrera wrote:
> 
>> On 2020-Mar-27, Alvaro Herrera wrote:
>> 
>> > I pushed the wal_receiver_create_temp_slot bugfix, because I realized
>> > after looking for long enough at WalReceiverMain() that the code was
>> > beyond saving.  I'll be pushing the rest of this later today.
>> 
>> So here's the next one.  I still have to go over the doc changes here,
>> but at least the tests are passing for me.
> 
> Pushed with some documentation tweaks.
> 
> Many thanks for working on this!

Nice work! What do you think of possibility to add restore_command?

Is it a good idea to make a restore_command to be reloadable via SUGHUP 
too?

On the one hand it looks reasonable since primary_conninfo got such 
ability. On the other hand we don't exactly know whether the actual 
process after reload config would use "new" command or "old"  since it 
may take a long time to finish running old command (in case of scp or 
smth).

Also setting faulty restore_command lead to server termination, which is 
rather unexpectedly.
If anyone makes some minor typo in restore_command, say write 
restore_command='cp1 /mnt/srv/%f %p' instead of restore_command='cp 
/mnt/srv/%f %p' and do SELECT pg_reload_conf() then server will 
terminate.

Do you consider this feature useful? Any suggestions?

---
Best regards,
Maxim Orlov.
Attachment

Re: allow online change primary_conninfo

From
Sergei Kornilov
Date:
Hello

Yep, I think it's useful and I already posted patch in this thread:
https://www.postgresql.org/message-id/flat/3090621585393698%40myt5-1466095fe4e5.qloud-c.yandex.net#ee6574e93982b5628d140f15ffffcb44
Currently without consensus

regards, Sergei