Thread: Making WAL receiver startup rely on GUC context for primary_conninfoand primary_slot_name

Making WAL receiver startup rely on GUC context for primary_conninfoand primary_slot_name

From

Michael Paquier

Date:

12 December 2018, 08:30:42

Hi all,

Since 2dedf4d, recovery.conf is dead and all recovery parameters are now
GUCs.  While looking at a patch to switch primary_conninfo and
primary_slot_name to be reloadable, Sergei Kornilov had a very good
point that the WAL receiver uses a connection string and a physical slot
name based on what the startup process wants the WAL receiver to use:
https://www.postgresql.org/message-id/20181212043208.GI17695@paquier.xyz

It seems to me that doing so is now strange as the WAL receiver knows
about the GUC context, and actually it knows the parameters it should
use, so there is no point to pass down the values when requesting the
WAL receiver to start.

What do you think about the attached to simplify the logic?  Even if
primary_conninfo and primary_slot_name are not switched to SIGHUP this
cleanup looks like a good thing to me.

Thoughts?
--
Michael

Attachment

Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

From

Andres Freund

Date:

12 December 2018, 08:34:58


On December 11, 2018 9:30:42 PM PST, Michael Paquier <michael@paquier.xyz> wrote:
>Hi all,
>
>Since 2dedf4d, recovery.conf is dead and all recovery parameters are
>now
>GUCs.  While looking at a patch to switch primary_conninfo and
>primary_slot_name to be reloadable, Sergei Kornilov had a very good
>point that the WAL receiver uses a connection string and a physical
>slot
>name based on what the startup process wants the WAL receiver to use:
>https://www.postgresql.org/message-id/20181212043208.GI17695@paquier.xyz
>
>It seems to me that doing so is now strange as the WAL receiver knows
>about the GUC context, and actually it knows the parameters it should
>use, so there is no point to pass down the values when requesting the
>WAL receiver to start.
>
>What do you think about the attached to simplify the logic?  Even if
>primary_conninfo and primary_slot_name are not switched to SIGHUP this
>cleanup looks like a good thing to me.

I am not convinced this is a good idea. This allows the state of walrcv and startup to diverge, they could e.g. have
differentconfiguration, due to differently time config reloads.  And they need to communicate via shmem anyway, so
there'snot a lot of complexity avoided.  And I think it's good to be able to show the active connection via functions,
ratherthan the one currently in pg.conf, which might be different. 

Andres

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Making WAL receiver startup rely on GUC context forprimary_conninfo and primary_slot_name

From

Michael Paquier

Date:

12 December 2018, 08:55:17

On Tue, Dec 11, 2018 at 09:34:58PM -0800, Andres Freund wrote:
> I am not convinced this is a good idea. This allows the state of
> walrcv and startup to diverge, they could e.g. have different
> configuration, due to differently time config reloads.  And they need
> to communicate via shmem anyway, so there's not a lot of complexity
> avoided.  And I think it's good to be able to show the active
> connection via functions, rather than the one currently in pg.conf,
> which might be different.

Well, the conninfo and slot name accessible to the user are the values
available only once the information of the WAL receiver in shared memory
is ready to be displayed.  Relying more on the GUC context has the
advantage to never have in shared memory the original string and only
store the clobbered one, which actually simplifies what's stored in
shared memory because you can entirely remove ready_to_display (I forgot
to remove that in the patch actually).  If those parameters become
reloadable, you actually rely only on what the param context has, and
not on what the shared memory context may have or not.

Removing entirely ready_to_display is quite appealing actually..
--
Michael

Attachment

signature.asc

Re: Making WAL receiver startup rely on GUC context forprimary_conninfo and primary_slot_name

From

Simon Riggs

Date:

12 December 2018, 11:50:16

On Wed, 12 Dec 2018 at 05:35, Andres Freund <andres@anarazel.de> wrote:

>What do you think about the attached to simplify the logic? Even if
>primary_conninfo and primary_slot_name are not switched to SIGHUP this
>cleanup looks like a good thing to me.

I am not convinced this is a good idea. This allows the state of walrcv and startup to diverge, they could e.g. have different configuration, due to differently time config reloads.

That sounds bad, but most parameters apply to one or the other, not both.

If there are some that apply to both, then yes, coordination would be important.

It does seem likely that the new scheme will require us to look carefully at when parameters are reloaded, since the timing of reloads was never taken into account in the previous coding.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

From

Sergei Kornilov

Date:

12 December 2018, 18:55:04

Hello

> This allows the state of walrcv and startup to diverge, they could e.g. have different configuration, due to
differentlytime config reloads.
 
So we have exactly same problem with, for example, hot_standby_feedback, right?
We change hot_standby_feedback value, reload it and we can have 'show hot_standby_feedback' different to currently
runningwalreceiver.
 
But why we have nothing about hot_standby_feedback in pg_stat_get_wal_receiver()?
Where is difference?

regards, Sergei

Re: Making WAL receiver startup rely on GUC context forprimary_conninfo and primary_slot_name

From

Michael Paquier

Date:

13 December 2018, 03:06:26

On Wed, Dec 12, 2018 at 06:55:04PM +0300, Sergei Kornilov wrote:
>> This allows the state of walrcv and startup to diverge, they could
>> e.g. have different configuration, due to differently time config
>> reloads.
>
> So we have exactly same problem with, for example, hot_standby_feedback, right?
> We change hot_standby_feedback value, reload it and we can have 'show
> hot_standby_feedback' different to currently running walreceiver.  But
> why we have nothing about hot_standby_feedback in pg_stat_get_wal_receiver()?
> Where is difference?

The difference is in the fact that a WAL receiver uses the connection
string and the slot name when establishing the connection when using
START_STREAMING through the replication protocol, and
hot_standby_feedback gets used depending on the context of the messages
exchanged.
--
Michael

Attachment

signature.asc

Re: Making WAL receiver startup rely on GUC context forprimary_conninfo and primary_slot_name

From

Michael Paquier

Date:

13 December 2018, 06:22:58

On Wed, Dec 12, 2018 at 02:55:17PM +0900, Michael Paquier wrote:
> Well, the conninfo and slot name accessible to the user are the values
> available only once the information of the WAL receiver in shared memory
> is ready to be displayed.  Relying more on the GUC context has the
> advantage to never have in shared memory the original string and only
> store the clobbered one, which actually simplifies what's stored in
> shared memory because you can entirely remove ready_to_display (I forgot
> to remove that in the patch actually).  If those parameters become
> reloadable, you actually rely only on what the param context has, and
> not on what the shared memory context may have or not.
>
> Removing entirely ready_to_display is quite appealing actually..

I have been looking at that stuff this morning, and indeed
ready_for_display can be entirely removed.  I think that's quite an
advantage as WalRcv->conninfo only stores the connection string
cloberred to hide security-sensitive fields and it never has to save the
initial state which could have sensitive data, simplifying how the WAL
receiver data is filled.  I am adding that to the next commit fest.
--
Michael

Attachment

Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

From

Sergei Kornilov

Date:

14 December 2018, 12:23:09

Hi

Not sure i can be reviewer since this was initially my proposal.
I vote +1 for this patch. Code looks good and i think we have no reason to leave RequestXLogStreaming as-is.

regards, Sergei

Re: Making WAL receiver startup rely on GUC context forprimary_conninfo and primary_slot_name

From

Michael Paquier

Date:

14 December 2018, 15:18:26

On Fri, Dec 14, 2018 at 12:23:09PM +0300, Sergei Kornilov wrote:
> Not sure i can be reviewer since this was initially my proposal.

No issues from me if you wish to review the code.  In my opinion, it is
not a problem if you are a reviewer as I wrote the code.

> I vote +1 for this patch. Code looks good and i think we have no
> reason to leave RequestXLogStreaming as-is.

Thanks for the input.  The take on my side is to be able to remove
ready_to_display from WalRcv so let's see what others think.
--
Michael

Attachment

signature.asc

Re: Making WAL receiver startup rely on GUC context forprimary_conninfo and primary_slot_name

From

Donald Dong

Date:

09 January 2019, 17:00:23

Hi Michael,

Thank you for the information!

> On Dec 11, 2018, at 9:55 PM, Michael Paquier <michael@paquier.xyz> wrote:
>
> Well, the conninfo and slot name accessible to the user are the values
> available only once the information of the WAL receiver in shared memory
> is ready to be displayed.  Relying more on the GUC context has the
> advantage to never have in shared memory the original string and only
> store the clobbered one, which actually simplifies what's stored in
> shared memory because you can entirely remove ready_to_display (I forgot
> to remove that in the patch actually).  If those parameters become
> reloadable, you actually rely only on what the param context has, and
> not on what the shared memory context may have or not.

I wonder why do we need to have this addition?

src/backend/replication/walreceiver.c
+       memset(walrcv->slotname, 0, NAMEDATALEN);
+       if (PrimarySlotName)
+               strlcpy((char *) walrcv->slotname, PrimarySlotName, NAMEDATALEN);


src/backend/replication/walreceiver.c
288: PrimaryConnInfo == NULL || PrimaryConnInfo[0] == '\0'
291: errmsg("cannot connect to the primary server as \"primary_conninfo\" is not defined")));
392: PrimarySlotName && PrimarySlotName[0] != '\0'

src/backend/access/transam/xlog.c
11824: * If primary_conninfo is set, launch walreceiver to try
11833: PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0

I notice these patterns appear in different places. Maybe it will be better to have a common method such as
pg_str_empty()?

Regards,
Donald Dong

Re: Making WAL receiver startup rely on GUC context forprimary_conninfo and primary_slot_name

From

Michael Paquier

Date:

10 January 2019, 03:47:52

On Wed, Jan 09, 2019 at 06:00:23AM -0800, Donald Dong wrote:
> I wonder why do we need to have this addition?
>
> src/backend/replication/walreceiver.c
> +       memset(walrcv->slotname, 0, NAMEDATALEN);
> +       if (PrimarySlotName)
> +               strlcpy((char *) walrcv->slotname, PrimarySlotName, NAMEDATALEN);
>

That's much cleaner to me to clean up the field for safety before
starting the process.  When requesting a WAL receiver to start,
slotname and conninfo gets zeroed before doing anything, you are right
that we could do without it actually.

> I notice these patterns appear in different places. Maybe it will be
> better to have a common method such as pg_str_empty()?

That's a pattern used quite a lot for many GUCs, so that's quite
separate from this patch.  Perhaps it would make sense to have a macro
for that purpose for GUCs, I am not sure if that's worth it though.
--
Michael

Attachment

signature.asc

Re: Making WAL receiver startup rely on GUC context forprimary_conninfo and primary_slot_name

From

Donald Dong

Date:

10 January 2019, 04:38:53

On Jan 9, 2019, at 4:47 PM, Michael Paquier <michael@paquier.xyz> wrote:
> That's much cleaner to me to clean up the field for safety before
> starting the process.  When requesting a WAL receiver to start,
> slotname and conninfo gets zeroed before doing anything, you are right
> that we could do without it actually.

Cool! I agree that cleaning up the field would make the logic cleaner.

> That's a pattern used quite a lot for many GUCs, so that's quite
> separate from this patch.  Perhaps it would make sense to have a macro
> for that purpose for GUCs, I am not sure if that's worth it though.

Yeah. I think having such macro would make the code a bit cleaner. Tho the
duplication of logic is not too significant.

Re: Making WAL receiver startup rely on GUC context forprimary_conninfo and primary_slot_name

From

Michael Paquier

Date:

10 January 2019, 04:46:51

On Wed, Jan 09, 2019 at 05:38:53PM -0800, Donald Dong wrote:
> On Jan 9, 2019, at 4:47 PM, Michael Paquier <michael@paquier.xyz> wrote:
>> That's much cleaner to me to clean up the field for safety before
>> starting the process.  When requesting a WAL receiver to start,
>> slotname and conninfo gets zeroed before doing anything, you are right
>> that we could do without it actually.
>
> Cool! I agree that cleaning up the field would make the logic cleaner.

I was just double-checking the code, and it is possible to remove the
part from RequestXLogStreaming() which sets slotname and conninfo to
'\0', so I removed that part from my local branch to simplify the
code.
--
Michael

Attachment

signature.asc

Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

From

Sergei Kornilov

Date:

10 January 2019, 13:10:16

Hello

> I was just double-checking the code, and it is possible to remove the
> part from RequestXLogStreaming() which sets slotname and conninfo to
> '\0', so I removed that part from my local branch to simplify the
> code.

Without both ready_to_display and cleaning in RequestXLogStreaming we do not show outdated PrimaryConnInfo in case
walreceiverjust started, but does not connected yet? While waiting walrcv_connect for example.
 

regards, Sergei

Re: Making WAL receiver startup rely on GUC context forprimary_conninfo and primary_slot_name

From

Michael Paquier

Date:

10 January 2019, 13:57:20

On Thu, Jan 10, 2019 at 01:10:16PM +0300, Sergei Kornilov wrote:
> Without both ready_to_display and cleaning in RequestXLogStreaming
> we do not show outdated PrimaryConnInfo in case walreceiver just
> started, but does not connected yet? While waiting walrcv_connect
> for example.

Good point.  There is a gap between the moment the WAL receiver PID is
set when the WAL receiver starts up and the moment where the different
fields are reset to 0 which is not good as it could be possible that
the user sees the information from the previous WAL receiver, so we
should move the memset calls when the PID is set, reaching a state
where the PID is alive but where there is no connection yet.  The port
number needs a similar treatment.

Looking closer at the code, it seems to me that it would be a good
thing if we update the shared memory state when the WAL receiver dies,
so as not only the PID is set to 0, but all connection-related
information gets the call.

With all those comments I am finishing with the updated attached.
Thoughts?
--
Michael

Attachment

Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

From

Sergei Kornilov

Date:

10 January 2019, 14:20:27

Hi

Thank you, patch looks good for me.

regards, Sergei

Re: Making WAL receiver startup rely on GUC context forprimary_conninfo and primary_slot_name

From

Michael Paquier

Date:

11 January 2019, 03:34:28

On Thu, Jan 10, 2019 at 02:20:27PM +0300, Sergei Kornilov wrote:
> Thank you, patch looks good for me.

Thanks Sergei for the review.  I have been looking at the patch again
this morning with a fresh set of eyes and the thing looks in a
committable shape (the GUC values for NULL checks are a bit
inconsistent in the last patch by the way, so I have fixed them
locally).

I'll do an extra pass on it in the next couple of days and commit.
Then let's move on with the reloadable portions.
--
Michael

Attachment

signature.asc

Re: Making WAL receiver startup rely on GUC context forprimary_conninfo and primary_slot_name

From

Andres Freund

Date:

11 January 2019, 03:41:47

Hi,

On 2019-01-11 09:34:28 +0900, Michael Paquier wrote:
> On Thu, Jan 10, 2019 at 02:20:27PM +0300, Sergei Kornilov wrote:
> > Thank you, patch looks good for me.
> 
> Thanks Sergei for the review.  I have been looking at the patch again
> this morning with a fresh set of eyes and the thing looks in a
> committable shape (the GUC values for NULL checks are a bit
> inconsistent in the last patch by the way, so I have fixed them
> locally).
> 
> I'll do an extra pass on it in the next couple of days and commit.
> Then let's move on with the reloadable portions.

I still think this whole direction of accessing the GUC in walreceiver
is a bad idea and shouldn't be pursued further.  There's definite
potential for startup process and WAL receiver having different states
of GUCs, the code doesn't get meaningfully simpler, the GUC value checks
in walreceiver make for horrible reporting up the chain.

Greetings,

Andres Freund

Re: Making WAL receiver startup rely on GUC context forprimary_conninfo and primary_slot_name

From

Michael Paquier

Date:

11 January 2019, 03:50:49

On Thu, Jan 10, 2019 at 04:41:47PM -0800, Andres Freund wrote:
> I still think this whole direction of accessing the GUC in walreceiver
> is a bad idea and shouldn't be pursued further.  There's definite
> potential for startup process and WAL receiver having different states
> of GUCs, the code doesn't get meaningfully simpler, the GUC value checks
> in walreceiver make for horrible reporting up the chain.

Did you notice the set of messages from upthread?  The code *gets*
simpler by removing ready_to_display and the need to manipulate the
non-clobbered connection string sent directly from the startup
process.  In my opinion that's a clear gain.  We gain also the
possibility to track down that a WAL receiver is started but not
connected yet for monitoring tools.
--
Michael

Attachment

signature.asc

Re: Making WAL receiver startup rely on GUC context forprimary_conninfo and primary_slot_name

From

Andres Freund

Date:

11 January 2019, 03:52:53

On 2019-01-11 09:50:49 +0900, Michael Paquier wrote:
> On Thu, Jan 10, 2019 at 04:41:47PM -0800, Andres Freund wrote:
> > I still think this whole direction of accessing the GUC in walreceiver
> > is a bad idea and shouldn't be pursued further.  There's definite
> > potential for startup process and WAL receiver having different states
> > of GUCs, the code doesn't get meaningfully simpler, the GUC value checks
> > in walreceiver make for horrible reporting up the chain.
> 
> Did you notice the set of messages from upthread?  The code *gets*
> simpler by removing ready_to_display and the need to manipulate the
> non-clobbered connection string sent directly from the startup
> process.

It's a minor simplification for code that's existed for quite a
while. Not worth it.

Re: Making WAL receiver startup rely on GUC context forprimary_conninfo and primary_slot_name

From

Michael Paquier

Date:

11 January 2019, 04:09:04

On Thu, Jan 10, 2019 at 04:52:53PM -0800, Andres Freund wrote:
> It's a minor simplification for code that's existed for quite a
> while. Not worth it.

Meh, I disagree for the ready_to_display part as it has been around
for a long time:
commit: 9ed551e0a4fdb046d8e5ea779adfd3e184d5dc0d
author: Alvaro Herrera <alvherre@alvh.no-ip.org>
date: Wed, 29 Jun 2016 16:57:17 -0400
Add conninfo to pg_stat_wal_receiver

I was honestly unhappy from the start with it because there was no
actual way to have the WAL receiver *not* save the original connection
string.  In my opinion, getting rid of it is worth it because this
really simplifies the way the WAL receiver data visibility is handled
at SQL level and we don't finish by using again and again the same
field for different purposes, consolidating the code.

For the reloading part, we also make the WAL receiver rely on the GUC
context and stop it if there is a change in primary_conninfo and
primary_slot_name.  It would be inconsistent to rely on the GUC
context for one thing, and the startup process for the other one.
Adding a safeguard to fail WAL receiver startup is actually more of
sanity check that anything else (that could be an assertion but for
forks a failure is of better benefit).

At this stage, I would like to hear more opinions before doing or not
doing something.  Alvaro, you got involved in the introduction of
ready_to_siplay.  Peter, you have committed the merge of the recovery
params.  Perhaps you have an opinion to share?
--
Michael

Attachment

signature.asc

Re: Making WAL receiver startup rely on GUC context forprimary_conninfo and primary_slot_name

From

Michael Paquier

Date:

11 January 2019, 04:14:13

On Fri, Jan 11, 2019 at 10:09:04AM +0900, Michael Paquier wrote:
> On Thu, Jan 10, 2019 at 04:52:53PM -0800, Andres Freund wrote:
> > It's a minor simplification for code that's existed for quite a
> > while. Not worth it.
>
> Meh, I disagree for the ready_to_display part as it has been around
> for a long time:
> commit: 9ed551e0a4fdb046d8e5ea779adfd3e184d5dc0d
> author: Alvaro Herrera <alvherre@alvh.no-ip.org>
> date: Wed, 29 Jun 2016 16:57:17 -0400
> Add conninfo to pg_stat_wal_receiver

This should read "ready_to_display has not been around for a long
time" :)
--
Michael

Attachment

signature.asc

Re: Making WAL receiver startup rely on GUC context forprimary_conninfo and primary_slot_name

From

Robert Haas

Date:

11 January 2019, 20:52:08

On Thu, Jan 10, 2019 at 7:42 PM Andres Freund <andres@anarazel.de> wrote:
> I still think this whole direction of accessing the GUC in walreceiver
> is a bad idea and shouldn't be pursued further.  There's definite
> potential for startup process and WAL receiver having different states
> of GUCs, the code doesn't get meaningfully simpler, the GUC value checks
> in walreceiver make for horrible reporting up the chain.

I'm trying to assess the validity of this complaint.  It seems to me
that it might advance the discussion to be more clear about what the
problem is here.  When pg_ctl reload is executed (or the equivalent),
different backends reload the file at different times.  Currently,
because the values are explicitly passed down from the startup process
to the walreceiver, it's guaranteed that they will both use the same
value.  With this change, one might see an older value and the other
the current value.  So there's room to be nervous about code like
this:

                    if (PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0)
                    {
... assorted other code ...
                        RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
                                             PrimarySlotName);
                        receivedUpto = 0;
                    }

With the patch, the PrimaryConnInfo and PrimarySlotName arguments are
removed from RequestXLogStreaming.  That means that the new
walreceiver could come to a different conclusion about the values of
those arguments than the startup process.  In particular, it could end
up thinking that primary_conninfo is empty when, if the startup
process had thought that, the walreceiver would never have been
launched in the first place.  But it's not obvious that you've added
any logic in WALReceiverMain or elsewhere to compensate for this
possibility -- what would happen in that case?  Would we crash?
Connect to the wrong server?

I might be wrong here, but I'm inclined to think that this scenario
hasn't really been contemplated carefully by the patch authors.  There
are no added TAP tests for the scenario where the values differ
between the two processes, no code comments which explain why it's OK
if that happens, really no mention of it in the patch at all.  And on
that basis I'm inclined to think that Andres is really quite correct
to be worried about this.  The problem he's talking about here is very
low-probability because the race condition is narrow, but it's real,
and it surely needs to be handled somehow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Making WAL receiver startup rely on GUC context forprimary_conninfo and primary_slot_name

From

Andres Freund

Date:

11 January 2019, 21:04:51

Hi,

Thanks for chiming in.

On 2019-01-11 12:52:08 -0500, Robert Haas wrote:
> And on that basis I'm inclined to think that Andres is really quite
> correct to be worried about this.  The problem he's talking about here
> is very low-probability because the race condition is narrow, but it's
> real, and it surely needs to be handled somehow.

I think it's also going to get more likely with projects we really got
to tackle like providing more builtin support for failing over to
different systems and such.

Greetings,

Andres Freund

Re: Making WAL receiver startup rely on GUC context forprimary_conninfo and primary_slot_name

From

Andres Freund

Date:

12 January 2019, 01:57:01

Hi,

On 2019-01-12 01:55:23 +0300, Sergei Kornilov wrote:
> > pg_ctl reload is executed (or the equivalent),
> > different backends reload the file at different times.
> > There are no added TAP tests for the scenario where the values differ between
> the two processes, no code comments which explain why it's OK
> 
> But wait, connection string can not be changed via reload, only during cluster
> start. How it can differs between processes?

One of the major reasons for moving recovery parameters from
recovery.conf to GUCs was to make them changable at runtime in the
future.

Greetings,

Andres Freund

Re: Making WAL receiver startup rely on GUC context forprimary_conninfo and primary_slot_name

From

Michael Paquier

Date:

12 January 2019, 02:10:07

On Fri, Jan 11, 2019 at 12:52:08PM -0500, Robert Haas wrote:
> With the patch, the PrimaryConnInfo and PrimarySlotName arguments are
> removed from RequestXLogStreaming.  That means that the new
> walreceiver could come to a different conclusion about the values of
> those arguments than the startup process.  In particular, it could end
> up thinking that primary_conninfo is empty when, if the startup
> process had thought that, the walreceiver would never have been
> launched in the first place.  But it's not obvious that you've added
> any logic in WALReceiverMain or elsewhere to compensate for this
> possibility -- what would happen in that case?  Would we crash?
> Connect to the wrong server?

If I contemplate the patch this morning there is this bit:
@@ -291,32 +295,40 @@ WalReceiverMain(void)
    /* Unblock signals (they were blocked when the postmaster forked
    us) */
        PG_SETMASK(&UnBlockSig);

+   /*
+    * Fail immediately if primary_conninfo goes missing, better safe than
+    * sorry.
+    */
+   if (PrimaryConnInfo == NULL || strcmp(PrimaryConnInfo, "") == 0)
+       ereport(ERROR,
+               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                errmsg("cannot connect to the primary server as \"primary_conninfo\" is not defined")));

So the answer to your question is: the WAL receiver fails to start.

> I might be wrong here, but I'm inclined to think that this scenario
> hasn't really been contemplated carefully by the patch authors.  There
> are no added TAP tests for the scenario where the values differ
> between the two processes, no code comments which explain why it's OK
> if that happens, really no mention of it in the patch at all.  And on
> that basis I'm inclined to think that Andres is really quite correct
> to be worried about this.  The problem he's talking about here is very
> low-probability because the race condition is narrow, but it's real,
> and it surely needs to be handled somehow.

primary_conninfo and primary_slot_name are PGC_POSTMASTER now, so
adding tests now don't really make sense.
--
Michael

Attachment

signature.asc

Re: Making WAL receiver startup rely on GUC context forprimary_conninfo and primary_slot_name

From

Michael Paquier

Date:

16 January 2019, 04:48:24

On Sat, Jan 12, 2019 at 08:10:07AM +0900, Michael Paquier wrote:
> On Fri, Jan 11, 2019 at 12:52:08PM -0500, Robert Haas wrote:
>> With the patch, the PrimaryConnInfo and PrimarySlotName arguments are
>> removed from RequestXLogStreaming.  That means that the new
>> walreceiver could come to a different conclusion about the values of
>> those arguments than the startup process.  In particular, it could end
>> up thinking that primary_conninfo is empty when, if the startup
>> process had thought that, the walreceiver would never have been
>> launched in the first place.  But it's not obvious that you've added
>> any logic in WALReceiverMain or elsewhere to compensate for this
>> possibility -- what would happen in that case?  Would we crash?
>> Connect to the wrong server?
>
> If I contemplate the patch this morning there is this bit:
> @@ -291,32 +295,40 @@ WalReceiverMain(void)
>     /* Unblock signals (they were blocked when the postmaster forked
>     us) */
>         PG_SETMASK(&UnBlockSig);
>
> +   /*
> +    * Fail immediately if primary_conninfo goes missing, better safe than
> +    * sorry.
> +    */
> +   if (PrimaryConnInfo == NULL || strcmp(PrimaryConnInfo, "") == 0)
> +       ereport(ERROR,
> +               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                errmsg("cannot connect to the primary server as \"primary_conninfo\" is not defined")));
>
> So the answer to your question is: the WAL receiver fails to start.

Robert, does this answer your question?  I agree that depending on the
timing, we could finish by having the startup process spawning a WAL
receiver with a given connection string, and that the WAL receiver
could use a different one, but as long as we fail the WAL receiver
startup this does not seem like an issue to me, so I disagree with the
upthread statement that the patch author has not thought about such
cases :)

It seems to me that making the WAL receiver relying more on the GUC
context makes it more robust when it comes to reloading the parameters
which would be an upcoming change as we need to rely on the WAL
receiver check the GUC context itself and FATAL (or we would need to
have the startup process check for a change in
primary_conninfo/primary_slot_name, and then have the startup process
kill the WAL receiver by itself).  In short, handling all that in the
WAL receiver would be more robust in the long term in my opinion as we
reduce interactions between the WAL receiver and the startup process.
And on top of it we get rid of ready_to_display, which is something I
am unhappy with since we had to introduce it.
--
Michael

Attachment

signature.asc

Re: Making WAL receiver startup rely on GUC context forprimary_conninfo and primary_slot_name

From

Robert Haas

Date:

16 January 2019, 19:02:48

On Tue, Jan 15, 2019 at 8:48 PM Michael Paquier <michael@paquier.xyz> wrote:
> > So the answer to your question is: the WAL receiver fails to start.
>
> Robert, does this answer your question?  I agree that depending on the
> timing, we could finish by having the startup process spawning a WAL
> receiver with a given connection string, and that the WAL receiver
> could use a different one, but as long as we fail the WAL receiver
> startup this does not seem like an issue to me, so I disagree with the
> upthread statement that the patch author has not thought about such
> cases :)

OK, if that was in the patch, I dunno why I didn't see it.  I really
didn't think it was there, but if I'm wrong, then I am.

> It seems to me that making the WAL receiver relying more on the GUC
> context makes it more robust when it comes to reloading the parameters
> which would be an upcoming change as we need to rely on the WAL
> receiver check the GUC context itself and FATAL (or we would need to
> have the startup process check for a change in
> primary_conninfo/primary_slot_name, and then have the startup process
> kill the WAL receiver by itself).  In short, handling all that in the
> WAL receiver would be more robust in the long term in my opinion as we
> reduce interactions between the WAL receiver and the startup process.
> And on top of it we get rid of ready_to_display, which is something I
> am unhappy with since we had to introduce it.

I think you have some valid points here, but I also think that the
patch as written doesn't really seem to have any user-visible
benefits.  Sure, ready_to_display is a crock, but getting rid of it
doesn't immediately help anybody.  And on the flip side your patch
might have bugs, in which case we'll be worse off.  I'm not going to
stand on my soapbox and say that committing this patch is a terrible
idea, because as far as I can tell, it isn't.  But it feels like it
might be a commit for the sake of making a commit, and I can't get too
excited about that either.  Since the logic will have to be further
revised anyway to make primary_conninfo PGC_SIGHUP, why not just wait
until that's done and then commit all the changes together instead of
guessing that this might make that easier?

If this does get committed now, and count me as about -0.25 for that,
then I at least think it should have some comments clearly addressing
the synchronization issues.  Unless those are also in the patch and I
missed them, too, but I hope I'm not that dense.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Making WAL receiver startup rely on GUC context forprimary_conninfo and primary_slot_name

From

Michael Paquier

Date:

17 January 2019, 03:54:16

On Wed, Jan 16, 2019 at 11:02:48AM -0500, Robert Haas wrote:
> I think you have some valid points here, but I also think that the
> patch as written doesn't really seem to have any user-visible
> benefits.  Sure, ready_to_display is a crock, but getting rid of it
> doesn't immediately help anybody.  And on the flip side your patch
> might have bugs, in which case we'll be worse off.  I'm not going to
> stand on my soapbox and say that committing this patch is a terrible
> idea, because as far as I can tell, it isn't.  But it feels like it
> might be a commit for the sake of making a commit, and I can't get too
> excited about that either.  Since the logic will have to be further
> revised anyway to make primary_conninfo PGC_SIGHUP, why not just wait
> until that's done and then commit all the changes together instead of
> guessing that this might make that easier?

You can say that about any patch which gets committed, even
refactoring patches have a risk of introducing bugs, and even subtle
ones understood only after release.

I was justifying the existence of this thread exactly for opposite
reasons.  After reviewing the other patch switch to reload the
parameters we could do more, and spawning a new thread to attract the
correct audience looked rather right (it still does):
https://www.postgresql.org/message-id/20181212043208.GI17695@paquier.xyz

And this refactoring seemed to be justified as part of a different
thread.  I don't mind dropping this patch and this thread and just go
back there, but that seems like taking steps backward, and what's
proposed here is a bit different than just making the parameters
reloadable.  Still the refactoring looks justified to me for
correctness with the parameter handling.

Anyway, based on the opinions gathered until now, I don't mind just
dropping this patch and move on to the other thread, though we will
likely finish with the same discussion as what's done here :)

A patch on which two committers are expressing concerns about is as
good as going to the waste bin.  That's of course fine by me.
--
Michael

Attachment

signature.asc