Thread: Improve handling of parameter differences in physical replication

Improve handling of parameter differences in physical replication

From

Peter Eisentraut

Date:

27 February 2020, 08:23:46

When certain parameters are changed on a physical replication primary, 
   this is communicated to standbys using the XLOG_PARAMETER_CHANGE WAL 
record.  The standby then checks whether its own settings are at least 
as big as the ones on the primary.  If not, the standby shuts down with 
a fatal error.

The correspondence of settings between primary and standby is required 
because those settings influence certain shared memory sizings that are 
required for processing WAL records that the primary might send.  For 
example, if the primary sends a prepared transaction, the standby must 
have had max_prepared_transaction set appropriately or it won't be able 
to process those WAL records.

However, fatally shutting down the standby immediately upon receipt of 
the parameter change record might be a bit of an overreaction.  The 
resources related to those settings are not required immediately at that 
point, and might never be required if the activity on the primary does 
not exhaust all those resources.  An extreme example is raising 
max_prepared_transactions on the primary but never actually using 
prepared transactions.

Where this becomes a serious problem is if you have many standbys and 
you do a failover.  If the newly promoted standby happens to have a 
higher setting for one of the relevant parameters, all the other 
standbys that have followed it then shut down immediately and won't be 
able to continue until you change all their settings.

If we didn't do the hard shutdown and we just let the standby roll on 
with recovery, nothing bad will happen and it will eventually produce an 
appropriate error when those resources are required (e.g., "maximum 
number of prepared transactions reached").

So I think there are better ways to handle this.  It might be reasonable 
to provide options.  The attached patch doesn't do that but it would be 
pretty easy.  What the attached patch does is:

Upon receipt of XLOG_PARAMETER_CHANGE, we still check the settings but 
only issue a warning and set a global flag if there is a problem.  Then 
when we actually hit the resource issue and the flag was set, we issue 
another warning message with relevant information.  Additionally, at 
that point we pause recovery instead of shutting down, so a hot standby 
remains usable.  (That could certainly be configurable.)

Btw., I think the current setup is slightly buggy.  The MaxBackends 
value that is used to size shared memory is computed as MaxConnections + 
autovacuum_max_workers + 1 + max_worker_processes + max_wal_senders, but 
we don't track autovacuum_max_workers in WAL.

(This patch was developed together with Simon Riggs.)

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

v1-0001-Improve-handling-of-parameter-differences-in-phys.patch

Re: Improve handling of parameter differences in physical replication

From

Sergei Kornilov

Date:

27 February 2020, 09:48:14

Hello

Thank you for working on this!

> Where this becomes a serious problem is if you have many standbys and you do a failover.

+1
Several times my team would like to pause recovery instead of panic after change settings on primary. (same thing for
create_tablespace_directoriesreplay errors too...)
 

We documented somewhere (excluding code) shutting down the standby immediately upon receipt of the parameter change?
doc/src/sgml/high-availability.sgmlsays only about "refuse to start".
 

regards, Sergei

Re: Improve handling of parameter differences in physical replication

From

Fujii Masao

Date:

27 February 2020, 10:13:54


On 2020/02/27 17:23, Peter Eisentraut wrote:
> When certain parameters are changed on a physical replication primary,   this is communicated to standbys using the
XLOG_PARAMETER_CHANGEWAL record.  The standby then checks whether its own settings are at least as big as the ones on
theprimary.  If not, the standby shuts down with a fatal error.
 
> 
> The correspondence of settings between primary and standby is required because those settings influence certain
sharedmemory sizings that are required for processing WAL records that the primary might send.  For example, if the
primarysends a prepared transaction, the standby must have had max_prepared_transaction set appropriately or it won't
beable to process those WAL records.
 
> 
> However, fatally shutting down the standby immediately upon receipt of the parameter change record might be a bit of
anoverreaction.  The resources related to those settings are not required immediately at that point, and might never be
requiredif the activity on the primary does not exhaust all those resources.  An extreme example is raising
max_prepared_transactionson the primary but never actually using prepared transactions.
 
> 
> Where this becomes a serious problem is if you have many standbys and you do a failover.  If the newly promoted
standbyhappens to have a higher setting for one of the relevant parameters, all the other standbys that have followed
itthen shut down immediately and won't be able to continue until you change all their settings.
 
> 
> If we didn't do the hard shutdown and we just let the standby roll on with recovery, nothing bad will happen and it
willeventually produce an appropriate error when those resources are required (e.g., "maximum number of prepared
transactionsreached").
 
> 
> So I think there are better ways to handle this.  It might be reasonable to provide options.  The attached patch
doesn'tdo that but it would be pretty easy.  What the attached patch does is:
 
> 
> Upon receipt of XLOG_PARAMETER_CHANGE, we still check the settings but only issue a warning and set a global flag if
thereis a problem.  Then when we actually hit the resource issue and the flag was set, we issue another warning message
withrelevant information.  Additionally, at that point we pause recovery instead of shutting down, so a hot standby
remainsusable.  (That could certainly be configurable.)
 

+1
> Btw., I think the current setup is slightly buggy.  The MaxBackends value that is used to size shared memory is
computedas MaxConnections + autovacuum_max_workers + 1 + max_worker_processes + max_wal_senders, but we don't track
autovacuum_max_workersin WAL.
 

Maybe this is because autovacuum doesn't work during recovery?

Regards,

-- 
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

Re: Improve handling of parameter differences in physical replication

From

Peter Eisentraut

Date:

27 February 2020, 13:37:24

On 2020-02-27 11:13, Fujii Masao wrote:
>> Btw., I think the current setup is slightly buggy.  The MaxBackends value that is used to size shared memory is
computedas MaxConnections + autovacuum_max_workers + 1 + max_worker_processes + max_wal_senders, but we don't track
autovacuum_max_workersin WAL.
 
> Maybe this is because autovacuum doesn't work during recovery?

Autovacuum on the primary can use locks or xids, and so it's possible 
that the standby when processing WAL encounters more of those than it 
has locally allocated shared memory to handle.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Improve handling of parameter differences in physical replication

From

Michael Paquier

Date:

28 February 2020, 07:45:47

On Thu, Feb 27, 2020 at 02:37:24PM +0100, Peter Eisentraut wrote:
> On 2020-02-27 11:13, Fujii Masao wrote:
>>> Btw., I think the current setup is slightly buggy.  The
> MaxBackends value that is used to size shared memory is computed as
> MaxConnections + autovacuum_max_workers + 1 + max_worker_processes +
> max_wal_senders, but we don't track autovacuum_max_workers in WAL.
>> Maybe this is because autovacuum doesn't work during recovery?
>
> Autovacuum on the primary can use locks or xids, and so it's possible that
> the standby when processing WAL encounters more of those than it has locally
> allocated shared memory to handle.

Putting aside your patch because that sounds like a separate issue..
Doesn't this mean that autovacuum_max_workers should be added to the
control file, that we need to record in WAL any updates done to it and
that CheckRequiredParameterValues() is wrong?
--
Michael

Attachment

signature.asc

Re: Improve handling of parameter differences in physical replication

From

Peter Eisentraut

Date:

28 February 2020, 07:49:08

On 2020-02-28 08:45, Michael Paquier wrote:
> On Thu, Feb 27, 2020 at 02:37:24PM +0100, Peter Eisentraut wrote:
>> On 2020-02-27 11:13, Fujii Masao wrote:
>>>> Btw., I think the current setup is slightly buggy.  The
>> MaxBackends value that is used to size shared memory is computed as
>> MaxConnections + autovacuum_max_workers + 1 + max_worker_processes +
>> max_wal_senders, but we don't track autovacuum_max_workers in WAL.
>>> Maybe this is because autovacuum doesn't work during recovery?
>>
>> Autovacuum on the primary can use locks or xids, and so it's possible that
>> the standby when processing WAL encounters more of those than it has locally
>> allocated shared memory to handle.
> 
> Putting aside your patch because that sounds like a separate issue..
> Doesn't this mean that autovacuum_max_workers should be added to the
> control file, that we need to record in WAL any updates done to it and
> that CheckRequiredParameterValues() is wrong?

That would be a direct fix, yes.

Perhaps it might be better to track the combined MaxBackends instead, 
however.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Improve handling of parameter differences in physical replication

From

Michael Paquier

Date:

28 February 2020, 08:06:53

On Fri, Feb 28, 2020 at 08:49:08AM +0100, Peter Eisentraut wrote:
> Perhaps it might be better to track the combined MaxBackends instead,
> however.

Not sure about that.  I think that we should keep them separated, as
that's more useful for debugging and more verbose for error reporting.

(Worth noting that max_prepared_xacts is separate because of its dummy
PGPROC entries created by PREPARE TRANSACTION, so it cannot be
included in the set).
--
Michael

Attachment

signature.asc

Re: Improve handling of parameter differences in physical replication

From

Alvaro Herrera

Date:

28 February 2020, 15:33:31

On 2020-Feb-27, Peter Eisentraut wrote:

> So this patch relaxes this a bit.  Upon receipt of
> XLOG_PARAMETER_CHANGE, we still check the settings but only issue a
> warning and set a global flag if there is a problem.  Then when we
> actually hit the resource issue and the flag was set, we issue another
> warning message with relevant information.  Additionally, at that
> point we pause recovery, so a hot standby remains usable.

Hmm, so what is the actual end-user behavior?  As I read the code, we
first send the WARNING, then pause recovery until the user resumes
replication; at that point we raise the original error.  Presumably, at
that point the startup process terminates and is relaunched, and replay
continues normally.  Is that it?

I think if the startup process terminates because of the original error,
after it is unpaused, postmaster will get that as a signal to do a
crash-recovery cycle, closing all existing connections.  Is that right?
If so, it would be worth improving that (possibly by adding a
sigsetjmp() block) to avoid the disruption.

Thanks,

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Improve handling of parameter differences in physical replication

From

Masahiko Sawada

Date:

09 March 2020, 08:11:56

On Sat, 29 Feb 2020 at 06:39, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2020-Feb-27, Peter Eisentraut wrote:
>
> > So this patch relaxes this a bit.  Upon receipt of
> > XLOG_PARAMETER_CHANGE, we still check the settings but only issue a
> > warning and set a global flag if there is a problem.  Then when we
> > actually hit the resource issue and the flag was set, we issue another
> > warning message with relevant information.  Additionally, at that
> > point we pause recovery, so a hot standby remains usable.
>
> Hmm, so what is the actual end-user behavior?  As I read the code, we
> first send the WARNING, then pause recovery until the user resumes
> replication; at that point we raise the original error.

I think after recovery is paused users will be better to restart the
server rather than resume the recovery. I agree with this idea but I'm
slightly concerned that users might not realize that recovery is
paused until they look at that line in server log or at
pg_stat_replication because the standby server is still functional. So
I think we can periodically send WARNING to inform user that we're
still waiting for parameter change and restart.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Improve handling of parameter differences in physical replication

From

Peter Eisentraut

Date:

09 March 2020, 09:42:21

On 2020-02-28 16:33, Alvaro Herrera wrote:
> Hmm, so what is the actual end-user behavior?  As I read the code, we
> first send the WARNING, then pause recovery until the user resumes
> replication; at that point we raise the original error.  Presumably, at
> that point the startup process terminates and is relaunched, and replay
> continues normally.  Is that it?

No, at that point you get the original, current behavior that the server 
instance shuts down with a fatal error.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Improve handling of parameter differences in physical replication

From

Peter Eisentraut

Date:

09 March 2020, 09:45:54

On 2020-03-09 09:11, Masahiko Sawada wrote:
> I think after recovery is paused users will be better to restart the
> server rather than resume the recovery. I agree with this idea but I'm
> slightly concerned that users might not realize that recovery is
> paused until they look at that line in server log or at
> pg_stat_replication because the standby server is still functional. So
> I think we can periodically send WARNING to inform user that we're
> still waiting for parameter change and restart.

I think that would be annoying, unless you create a system for 
configuring those periodic warnings.

I imagine in a case like having set max_prepared_transactions but never 
actually using prepared transactions, people will just ignore the 
warning until they have their next restart, so it could be months of 
periodic warnings.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Improve handling of parameter differences in physical replication

From

Masahiko Sawada

Date:

09 March 2020, 12:13:38

On Mon, 9 Mar 2020 at 18:45, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
>
> On 2020-03-09 09:11, Masahiko Sawada wrote:
> > I think after recovery is paused users will be better to restart the
> > server rather than resume the recovery. I agree with this idea but I'm
> > slightly concerned that users might not realize that recovery is
> > paused until they look at that line in server log or at
> > pg_stat_replication because the standby server is still functional. So
> > I think we can periodically send WARNING to inform user that we're
> > still waiting for parameter change and restart.
>
> I think that would be annoying, unless you create a system for
> configuring those periodic warnings.
>
> I imagine in a case like having set max_prepared_transactions but never
> actually using prepared transactions, people will just ignore the
> warning until they have their next restart, so it could be months of
> periodic warnings.

Well I meant to periodically send warning messages while waiting for
parameter change, that is after exhausting resources and stopping
recovery. In this situation user need to notice that as soon as
possible.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Improve handling of parameter differences in physicalreplication

From

Kyotaro Horiguchi

Date:

10 March 2020, 08:57:55

At Mon, 9 Mar 2020 21:13:38 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in 
> On Mon, 9 Mar 2020 at 18:45, Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com> wrote:
> >
> > On 2020-03-09 09:11, Masahiko Sawada wrote:
> > > I think after recovery is paused users will be better to restart the
> > > server rather than resume the recovery. I agree with this idea but I'm
> > > slightly concerned that users might not realize that recovery is
> > > paused until they look at that line in server log or at
> > > pg_stat_replication because the standby server is still functional. So
> > > I think we can periodically send WARNING to inform user that we're
> > > still waiting for parameter change and restart.
> >
> > I think that would be annoying, unless you create a system for
> > configuring those periodic warnings.
> >
> > I imagine in a case like having set max_prepared_transactions but never
> > actually using prepared transactions, people will just ignore the
> > warning until they have their next restart, so it could be months of
> > periodic warnings.
> 
> Well I meant to periodically send warning messages while waiting for
> parameter change, that is after exhausting resources and stopping
> recovery. In this situation user need to notice that as soon as
> possible.

If we lose connection, standby continues to complain about lost
connection every 5 seconds.  This is a situation of that kind.

By the way, when I reduced max_connection only on master then take
exclusive locks until standby complains on lock exchaustion, I see a
WARNING that is saying max_locks_per_transaction instead of
max_connection.

WARNING:  insufficient setting for parameter max_connections
DETAIL:  max_connections = 2 is a lower setting than on the master server (where its value was 3).
HINT:  Change parameters and restart the server, or there may be resource exhaustion errors sooner or later.
CONTEXT:  WAL redo at 0/60000A0 for XLOG/PARAMETER_CHANGE: max_connections=3 max_worker_processes=8 max_wal_senders=2
max_prepared_xacts=0max_locks_per_xact=10 wal_level=replica wal_log_hints=off track_commit_timestamp=off

WARNING:  recovery paused because of insufficient setting of parameter max_locks_per_transaction (currently 10)
DETAIL:  The value must be at least as high as on the primary server.
HINT:  Recovery cannot continue unless the parameter is changed and the server restarted.
CONTEXT:  WAL redo at 0/6004A80 for Standb

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Improve handling of parameter differences in physical replication

From

Peter Eisentraut

Date:

10 March 2020, 13:47:47

On 2020-03-10 09:57, Kyotaro Horiguchi wrote:
>> Well I meant to periodically send warning messages while waiting for
>> parameter change, that is after exhausting resources and stopping
>> recovery. In this situation user need to notice that as soon as
>> possible.
> 
> If we lose connection, standby continues to complain about lost
> connection every 5 seconds.  This is a situation of that kind.

My argument is that it's not really the same.  If a standby is 
disconnected for more than a few minutes, it's really not going to be a 
good standby anymore after a while.  In this case, however, having 
certain parameter discrepancies is really harmless and you can run with 
it for a long time.  I'm not strictly opposed to a periodic warning, but 
it's unclear to me how we would find a good interval.

> By the way, when I reduced max_connection only on master then take
> exclusive locks until standby complains on lock exchaustion, I see a
> WARNING that is saying max_locks_per_transaction instead of
> max_connection.
> 
> 
> WARNING:  insufficient setting for parameter max_connections
> DETAIL:  max_connections = 2 is a lower setting than on the master server (where its value was 3).
> HINT:  Change parameters and restart the server, or there may be resource exhaustion errors sooner or later.
> CONTEXT:  WAL redo at 0/60000A0 for XLOG/PARAMETER_CHANGE: max_connections=3 max_worker_processes=8 max_wal_senders=2
max_prepared_xacts=0max_locks_per_xact=10 wal_level=replica wal_log_hints=off track_commit_timestamp=off

> WARNING:  recovery paused because of insufficient setting of parameter max_locks_per_transaction (currently 10)
> DETAIL:  The value must be at least as high as on the primary server.
> HINT:  Recovery cannot continue unless the parameter is changed and the server restarted.
> CONTEXT:  WAL redo at 0/6004A80 for Standb

This is all a web of half-truths.  The lock tables are sized based on 
max_locks_per_xact * (MaxBackends + max_prepared_xacts).  So if you run 
out of lock space, we currently recommend (in the single-server case), 
that you raise max_locks_per_xact, but you could also raise 
max_prepared_xacts or something else.  So this is now the opposite case 
where the lock table on the master was bigger because of max_connections.

We could make the advice less specific and just say, in essence, you 
need to make some parameter changes; see earlier for some hints.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Improve handling of parameter differences in physicalreplication

From

Kyotaro Horiguchi

Date:

11 March 2020, 02:06:37

At Tue, 10 Mar 2020 14:47:47 +0100, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote in 
> On 2020-03-10 09:57, Kyotaro Horiguchi wrote:
> >> Well I meant to periodically send warning messages while waiting for
> >> parameter change, that is after exhausting resources and stopping
> >> recovery. In this situation user need to notice that as soon as
> >> possible.
> > If we lose connection, standby continues to complain about lost
> > connection every 5 seconds.  This is a situation of that kind.
> 
> My argument is that it's not really the same.  If a standby is
> disconnected for more than a few minutes, it's really not going to be
> a good standby anymore after a while.  In this case, however, having
> certain parameter discrepancies is really harmless and you can run
> with it for a long time.  I'm not strictly opposed to a periodic
> warning, but it's unclear to me how we would find a good interval.

I meant the behavior after streaming is paused.  That situation leads
to loss of WAL or running out of WAL storage on the master.  Actually
5 seconds as a interval would be too frequent, but, maybe, we need at
least one message for a WAL segment-size?

> > By the way, when I reduced max_connection only on master then take
> > exclusive locks until standby complains on lock exchaustion, I see a
> > WARNING that is saying max_locks_per_transaction instead of
> > max_connection.
...
> > WARNING: recovery paused because of insufficient setting of parameter
> > max_locks_per_transaction (currently 10)
> > DETAIL:  The value must be at least as high as on the primary server.
> > HINT: Recovery cannot continue unless the parameter is changed and the
> > server restarted.
> > CONTEXT:  WAL redo at 0/6004A80 for Standb
> 
> This is all a web of half-truths.  The lock tables are sized based on
> max_locks_per_xact * (MaxBackends + max_prepared_xacts).  So if you
> run out of lock space, we currently recommend (in the single-server
> case), that you raise max_locks_per_xact, but you could also raise
> max_prepared_xacts or something else.  So this is now the opposite
> case where the lock table on the master was bigger because of
> max_connections.

Yeah, I know.  So, I'm not sure whether the checks on individual GUC
variable (other than wal_level) makes sense.  We might even not need
the WARNING on parameter change.

> We could make the advice less specific and just say, in essence, you
> need to make some parameter changes; see earlier for some hints.

In that sense the direction menetioned above seems sensible.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Improve handling of parameter differences in physical replication

From

Peter Eisentraut

Date:

11 March 2020, 19:34:34

Here is an updated patch that incorporates some of the suggestions.  In 
particular, some of the warning messages have been rephrased to more 
accurate (but also less specific), the warning message at recovery pause 
repeats every 1 minute, and the documentation has been updated.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

v2-0001-Improve-handling-of-parameter-differences-in-phys.patch

Re: Improve handling of parameter differences in physical replication

From

Masahiko Sawada

Date:

26 March 2020, 07:55:54

On Thu, 12 Mar 2020 at 04:34, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
>
> Here is an updated patch that incorporates some of the suggestions.  In
> particular, some of the warning messages have been rephrased to more
> accurate (but also less specific), the warning message at recovery pause
> repeats every 1 minute, and the documentation has been updated.
>

Thank you for updating the patch. I have one comment on the latest
version patch:

+   do
+   {
+       TimestampTz now = GetCurrentTimestamp();
+
+       if (TimestampDifferenceExceeds(last_warning, now, 60000))
+       {
+           ereport(WARNING,
+                   (errmsg("recovery paused because of insufficient
parameter settings"),
+                    errdetail("See earlier in the log about which
settings are insufficient."),
+                    errhint("Recovery cannot continue unless the
configuration is changed and the server restarted.")));
+           last_warning = now;
+       }
+
+       pg_usleep(1000000L);    /* 1000 ms */
+       HandleStartupProcInterrupts();
+   }

I think we can set wait event WAIT_EVENT_RECOVERY_PAUSE here.

The others look good to me.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Improve handling of parameter differences in physical replication

From

Sergei Kornilov

Date:

27 March 2020, 19:15:50

Hello

> I think we can set wait event WAIT_EVENT_RECOVERY_PAUSE here.

+1, since we added this in recoveryPausesHere.

PS: do we need to add a prototype for the RecoveryRequiredIntParameter function in top of xlog.c?

regards, Sergei

Re: Improve handling of parameter differences in physical replication

From

Peter Eisentraut

Date:

30 March 2020, 08:09:48

On 2020-03-27 20:15, Sergei Kornilov wrote:
>> I think we can set wait event WAIT_EVENT_RECOVERY_PAUSE here.
> 
> +1, since we added this in recoveryPausesHere.

committed with that addition

> PS: do we need to add a prototype for the RecoveryRequiredIntParameter function in top of xlog.c?

There is no consistent style, I think, but I usually only add prototypes 
for static functions if they are required because of the ordering in the 
file.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Improve handling of parameter differences in physical replication

From

Peter Eisentraut

Date:

24 June 2020, 08:00:00

Here is another stab at this subject.

This is a much simplified variant:  When encountering a parameter change 
in the WAL that is higher than the standby's current setting, we log a 
warning (instead of an error until now) and pause recovery.  If you 
resume (unpause) recovery, the instance shuts down as before.

This allows you to keep your standbys running for a bit (depending on 
lag requirements) and schedule the required restart more deliberately.

I had previously suggested making this new behavior configurable, but 
there didn't seem to be much interest in that, so I have not included 
that there.

The documentation changes are mostly carried over from previous patch 
versions (but adjusted for the actual behavior of the patch).

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Improve handling of parameter differences in physical replication

From

Peter Eisentraut

Date:

15 July 2020, 13:47:25

Here is a minimally updated new patch version to resolve a merge conflict.

On 2020-06-24 10:00, Peter Eisentraut wrote:
> Here is another stab at this subject.
> 
> This is a much simplified variant:  When encountering a parameter change
> in the WAL that is higher than the standby's current setting, we log a
> warning (instead of an error until now) and pause recovery.  If you
> resume (unpause) recovery, the instance shuts down as before.
> 
> This allows you to keep your standbys running for a bit (depending on
> lag requirements) and schedule the required restart more deliberately.
> 
> I had previously suggested making this new behavior configurable, but
> there didn't seem to be much interest in that, so I have not included
> that there.
> 
> The documentation changes are mostly carried over from previous patch
> versions (but adjusted for the actual behavior of the patch).

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Improve handling of parameter differences in physical replication

From

Sergei Kornilov

Date:

19 November 2020, 19:17:34

Hello

Thank you! I'm on vacation, so I was finally able to review the patch.

Seems WAIT_EVENT_RECOVERY_PAUSE addition was lost during patch simplification.

>         ereport(FATAL,
>                (errmsg("recovery aborted because of insufficient parameter settings"),
>                 errhint("You can restart the server after making the necessary configuration changes.")));

I think we should repeat here conflicted param_name and minValue. pg_wal_replay_resume can be called days after
recoverybeing paused. The initial message can be difficult to find.
 

> errmsg("recovery will be paused")

May be use the same "recovery has paused" as in recoveryPausesHere? It doesn't seem to make any difference since we set
pauseright after that, but there will be a little less work translators.
 

Not sure about "If recovery is unpaused". The word "resumed" seems to have been usually used in docs.

regards, Sergei

Re: Improve handling of parameter differences in physical replication

From

Peter Eisentraut

Date:

20 November 2020, 13:14:44

On 2020-11-19 20:17, Sergei Kornilov wrote:
> Seems WAIT_EVENT_RECOVERY_PAUSE addition was lost during patch simplification.

added

>>         ereport(FATAL,
>>                 (errmsg("recovery aborted because of insufficient parameter settings"),
>>                  errhint("You can restart the server after making the necessary configuration changes.")));
> 
> I think we should repeat here conflicted param_name and minValue. pg_wal_replay_resume can be called days after
recoverybeing paused. The initial message can be difficult to find.
 

done

> 
>> errmsg("recovery will be paused")
> 
> May be use the same "recovery has paused" as in recoveryPausesHere? It doesn't seem to make any difference since we
setpause right after that, but there will be a little less work translators.
 

done

> Not sure about "If recovery is unpaused". The word "resumed" seems to have been usually used in docs.

I think I like "unpaused" better here, because "resumed" would seem to 
imply that recovery can actually continue.

One thing that has not been added to my patch is the equivalent of 
496ee647ecd2917369ffcf1eaa0b2cdca07c8730, which allows promotion while 
recovery is paused.  I'm not sure that would be necessary, and it 
doesn't look easy to add either.

-- 
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/

Attachment

v5-0001-Pause-recovery-for-insufficient-parameter-setting.patch

Re: Improve handling of parameter differences in physical replication

From

Sergei Kornilov

Date:

20 November 2020, 15:47:59

Hello

> I think I like "unpaused" better here, because "resumed" would seem to
> imply that recovery can actually continue.

Good, I agree.

> One thing that has not been added to my patch is the equivalent of
> 496ee647ecd2917369ffcf1eaa0b2cdca07c8730, which allows promotion while
> recovery is paused. I'm not sure that would be necessary, and it
> doesn't look easy to add either.

Hmm... Good question. How about putting CheckForStandbyTrigger() in a wait loop, but reporting FATAL with an
appropriatemessage, such as "promotion is not possible because of insufficient parameter settings"?
 
Also it suits me if we only document that we ignore promote here. I don't think this is an important case. And yes,
it'snot easy to allow promotion, since we have already updated control file.
 

Probably we need pause only after we reached consistency?

2020-11-20 18:10:23.617 MSK 19722 @ from  [vxid: txid:0] [] LOG:  entering standby mode
2020-11-20 18:10:23.632 MSK 19722 @ from  [vxid: txid:0] [] WARNING:  hot standby is not possible because of
insufficientparameter settings
 
2020-11-20 18:10:23.632 MSK 19722 @ from  [vxid: txid:0] [] DETAIL:  max_connections = 100 is a lower setting than on
theprimary server, where its value was 150.
 
2020-11-20 18:10:23.632 MSK 19722 @ from  [vxid: txid:0] [] LOG:  recovery has paused
2020-11-20 18:10:23.632 MSK 19722 @ from  [vxid: txid:0] [] DETAIL:  If recovery is unpaused, the server will shut
down.
2020-11-20 18:10:23.632 MSK 19722 @ from  [vxid: txid:0] [] HINT:  You can then restart the server after making the
necessaryconfiguration changes.
 
2020-11-20 18:13:09.767 MSK 19755 melkij@postgres from [local] [vxid: txid:0] [] FATAL:  the database system is
startingup
 

regards, Sergei

Re: Improve handling of parameter differences in physical replication

From

Peter Eisentraut

Date:

30 November 2020, 18:37:36

On 2020-11-20 16:47, Sergei Kornilov wrote:
> Hmm... Good question. How about putting CheckForStandbyTrigger() in a wait loop, but reporting FATAL with an
appropriatemessage, such as "promotion is not possible because of insufficient parameter settings"?
 
> Also it suits me if we only document that we ignore promote here. I don't think this is an important case. And yes,
it'snot easy to allow promotion, since we have already updated control file.
 
> 
> Probably we need pause only after we reached consistency?

Here is an updated patch that implements both of these points.  I have 
used hot standby active instead of reached consistency.  I guess 
arguments could be made either way, but the original use case really 
cared about hot standby.

-- 
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/

Attachment

v6-0001-Pause-recovery-for-insufficient-parameter-setting.patch

Re: Improve handling of parameter differences in physical replication

From

Sergei Kornilov

Date:

15 January 2021, 11:28:56

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           not tested
Documentation:            tested, passed

Hello
Look good for me. I think the patch is ready for commiter.

The new status of this patch is: Ready for Committer

Re: Improve handling of parameter differences in physical replication

From

Peter Eisentraut

Date:

18 January 2021, 08:33:20

On 2021-01-15 12:28, Sergei Kornilov wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:       tested, passed
> Spec compliant:           not tested
> Documentation:            tested, passed
> 
> Hello
> Look good for me. I think the patch is ready for commiter.
> 
> The new status of this patch is: Ready for Committer

committed

-- 
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/