Thread: Re: pgsql: Improve handling of parameter differences in physicalreplicatio

Re: pgsql: Improve handling of parameter differences in physicalreplicatio

From

Andres Freund

Date:

30 March 2020, 21:10:30

Hi,

On 2020-03-30 19:41:43 +0900, Fujii Masao wrote:
> On 2020/03/30 16:58, Peter Eisentraut wrote:
> > Improve handling of parameter differences in physical replication
> > 
> > When certain parameters are changed on a physical replication primary,
> > this is communicated to standbys using the XLOG_PARAMETER_CHANGE WAL
> > record.  The standby then checks whether its own settings are at least
> > as big as the ones on the primary.  If not, the standby shuts down
> > with a fatal error.
> > 
> > The correspondence of settings between primary and standby is required
> > because those settings influence certain shared memory sizings that
> > are required for processing WAL records that the primary might send.
> > For example, if the primary sends a prepared transaction, the standby
> > must have had max_prepared_transaction set appropriately or it won't
> > be able to process those WAL records.
> > 
> > However, fatally shutting down the standby immediately upon receipt of
> > the parameter change record might be a bit of an overreaction.  The
> > resources related to those settings are not required immediately at
> > that point, and might never be required if the activity on the primary
> > does not exhaust all those resources.  If we just let the standby roll
> > on with recovery, it will eventually produce an appropriate error when
> > those resources are used.
> > 
> > So this patch relaxes this a bit.  Upon receipt of
> > XLOG_PARAMETER_CHANGE, we still check the settings but only issue a
> > warning and set a global flag if there is a problem.  Then when we
> > actually hit the resource issue and the flag was set, we issue another
> > warning message with relevant information.

I find it somewhat hostile that we don't display the actual resource
error once the problem is hit - we just pause. Sure, there's going to be
some previous log entry explaining what the actual parameter difference
is - but that could have been weeks ago. So either hard to find, or even
rotated out.


> > At that point we pause recovery, so a hot standby remains usable.
> > We also repeat the last warning message once a minute so it is
> > harder to miss or ignore.


I can't really imagine that the adjustments made in this patch are
sufficient.

One important issue seems to me to be the size of the array that
TransactionIdIsInProgress() allocates:
    /*
     * If first time through, get workspace to remember main XIDs in. We
     * malloc it permanently to avoid repeated palloc/pfree overhead.
     */
    if (xids == NULL)
    {
        /*
         * In hot standby mode, reserve enough space to hold all xids in the
         * known-assigned list. If we later finish recovery, we no longer need
         * the bigger array, but we don't bother to shrink it.
         */
        int            maxxids = RecoveryInProgress() ? TOTAL_MAX_CACHED_SUBXIDS : arrayP->maxProcs;

        xids = (TransactionId *) malloc(maxxids * sizeof(TransactionId));
        if (xids == NULL)
            ereport(ERROR,
                    (errcode(ERRCODE_OUT_OF_MEMORY),
                     errmsg("out of memory")));
    }

Which I think means we'll just overrun the xids array in some cases,
e.g. if KnownAssignedXids overflowed.  Obviously we should have a
crosscheck in the code (which we don't), but it was previously a
supposedly unreachable path.

Similarly, the allocation in GetSnapshotData() will be too small, I
think:
    if (snapshot->xip == NULL)
    {
        /*
         * First call for this snapshot. Snapshot is same size whether or not
         * we are in recovery, see later comments.
         */
        snapshot->xip = (TransactionId *)
            malloc(GetMaxSnapshotXidCount() * sizeof(TransactionId));
        if (snapshot->xip == NULL)
            ereport(ERROR,
                    (errcode(ERRCODE_OUT_OF_MEMORY),
                     errmsg("out of memory")));
        Assert(snapshot->subxip == NULL);
        snapshot->subxip = (TransactionId *)
            malloc(GetMaxSnapshotSubxidCount() * sizeof(TransactionId));
        if (snapshot->subxip == NULL)
            ereport(ERROR,
                    (errcode(ERRCODE_OUT_OF_MEMORY),
                     errmsg("out of memory")));
    }

I think basically all code using TOTAL_MAX_CACHED_SUBXIDS,
GetMaxSnapshotSubxidCount(), PROCARRAY_MAXPROCS needs to be reviewed
much more carefully than done here.


Also, shouldn't dynahash be adjusted as well? There's e.g. the
following HASH_ENTER path:
                /* report a generic message */
                if (hashp->isshared)
                    ereport(ERROR,
                            (errcode(ERRCODE_OUT_OF_MEMORY),
                             errmsg("out of shared memory")));


I'm also not sure it's ok to not have the waiting in
ProcArrayAdd(). Is it guaranteed that can't be hit now, due to the WAL
replay path sometimes adding procs?


How is it safe to just block in StandbyParamErrorPauseRecovery() before
raising an error? The error would have released lwlocks, but now we
don't? If we e.g. block in PrepareRedoAdd() we'll continue to hold
TwoPhaseStateLock(), but shutting the database down might also acquire
that (CheckPointTwoPhase()). Similar with the error in
KnownAssignedXidsAdd().


This does not seem ready.


> I encountered the trouble maybe related to this commit.
> 
> Firstly I set up the master and the standby with max_connections=100 (default value).
> Then I decreased max_connections to 1 only in the standby and restarted
> the server. Thanks to the commit, I saw the following warning message
> in the standby.
> 
>     WARNING:  insufficient setting for parameter max_connections
>     DETAIL:  max_connections = 1 is a lower setting than on the master server (where its value was 100).
>     HINT:  Change parameters and restart the server, or there may be resource exhaustion errors sooner or later.
> 
> Then I made the script that inserted 1,000,000 rows in one transaction,
> and ran it 30 times at the same time. That is, 30 transactions inserting
> lots of rows were running at the same time.
> 
> I confirmed that there are expected number of rows in the master,
> but found 0 row in the standby unxpectedly.

Have the relevant records actually been replicated?


> Also I suspected that issue
> happened because recovery is paused, but pg_is_wal_replay_paused()
> returned false in the standby.

As far as I understand the commit it shouldn't cause a pause until
there's an actual resource exhaustion - in which case there should be
another message first (WARNING: recovery paused because of insufficient
parameter settings).


Greetings,

Andres Freund

Re: pgsql: Improve handling of parameter differences in physicalreplicatio

From

Fujii Masao

Date:

31 March 2020, 04:16:03


On 2020/03/31 3:10, Andres Freund wrote:
> Hi,
> 
> On 2020-03-30 19:41:43 +0900, Fujii Masao wrote:
>> On 2020/03/30 16:58, Peter Eisentraut wrote:
>>> Improve handling of parameter differences in physical replication
>>>
>>> When certain parameters are changed on a physical replication primary,
>>> this is communicated to standbys using the XLOG_PARAMETER_CHANGE WAL
>>> record.  The standby then checks whether its own settings are at least
>>> as big as the ones on the primary.  If not, the standby shuts down
>>> with a fatal error.
>>>
>>> The correspondence of settings between primary and standby is required
>>> because those settings influence certain shared memory sizings that
>>> are required for processing WAL records that the primary might send.
>>> For example, if the primary sends a prepared transaction, the standby
>>> must have had max_prepared_transaction set appropriately or it won't
>>> be able to process those WAL records.
>>>
>>> However, fatally shutting down the standby immediately upon receipt of
>>> the parameter change record might be a bit of an overreaction.  The
>>> resources related to those settings are not required immediately at
>>> that point, and might never be required if the activity on the primary
>>> does not exhaust all those resources.  If we just let the standby roll
>>> on with recovery, it will eventually produce an appropriate error when
>>> those resources are used.
>>>
>>> So this patch relaxes this a bit.  Upon receipt of
>>> XLOG_PARAMETER_CHANGE, we still check the settings but only issue a
>>> warning and set a global flag if there is a problem.  Then when we
>>> actually hit the resource issue and the flag was set, we issue another
>>> warning message with relevant information.
> 
> I find it somewhat hostile that we don't display the actual resource
> error once the problem is hit - we just pause. Sure, there's going to be
> some previous log entry explaining what the actual parameter difference
> is - but that could have been weeks ago. So either hard to find, or even
> rotated out.
> 
> 
>>> At that point we pause recovery, so a hot standby remains usable.
>>> We also repeat the last warning message once a minute so it is
>>> harder to miss or ignore.
> 
> 
> I can't really imagine that the adjustments made in this patch are
> sufficient.
> 
> One important issue seems to me to be the size of the array that
> TransactionIdIsInProgress() allocates:
>     /*
>      * If first time through, get workspace to remember main XIDs in. We
>      * malloc it permanently to avoid repeated palloc/pfree overhead.
>      */
>     if (xids == NULL)
>     {
>         /*
>          * In hot standby mode, reserve enough space to hold all xids in the
>          * known-assigned list. If we later finish recovery, we no longer need
>          * the bigger array, but we don't bother to shrink it.
>          */
>         int            maxxids = RecoveryInProgress() ? TOTAL_MAX_CACHED_SUBXIDS : arrayP->maxProcs;
> 
>         xids = (TransactionId *) malloc(maxxids * sizeof(TransactionId));
>         if (xids == NULL)
>             ereport(ERROR,
>                     (errcode(ERRCODE_OUT_OF_MEMORY),
>                      errmsg("out of memory")));
>     }
> 
> Which I think means we'll just overrun the xids array in some cases,
> e.g. if KnownAssignedXids overflowed.  Obviously we should have a
> crosscheck in the code (which we don't), but it was previously a
> supposedly unreachable path.
> 
> Similarly, the allocation in GetSnapshotData() will be too small, I
> think:
>     if (snapshot->xip == NULL)
>     {
>         /*
>          * First call for this snapshot. Snapshot is same size whether or not
>          * we are in recovery, see later comments.
>          */
>         snapshot->xip = (TransactionId *)
>             malloc(GetMaxSnapshotXidCount() * sizeof(TransactionId));
>         if (snapshot->xip == NULL)
>             ereport(ERROR,
>                     (errcode(ERRCODE_OUT_OF_MEMORY),
>                      errmsg("out of memory")));
>         Assert(snapshot->subxip == NULL);
>         snapshot->subxip = (TransactionId *)
>             malloc(GetMaxSnapshotSubxidCount() * sizeof(TransactionId));
>         if (snapshot->subxip == NULL)
>             ereport(ERROR,
>                     (errcode(ERRCODE_OUT_OF_MEMORY),
>                      errmsg("out of memory")));
>     }
> 
> I think basically all code using TOTAL_MAX_CACHED_SUBXIDS,
> GetMaxSnapshotSubxidCount(), PROCARRAY_MAXPROCS needs to be reviewed
> much more carefully than done here.
> 
> 
> Also, shouldn't dynahash be adjusted as well? There's e.g. the
> following HASH_ENTER path:
>                 /* report a generic message */
>                 if (hashp->isshared)
>                     ereport(ERROR,
>                             (errcode(ERRCODE_OUT_OF_MEMORY),
>                              errmsg("out of shared memory")));
> 
> 
> I'm also not sure it's ok to not have the waiting in
> ProcArrayAdd(). Is it guaranteed that can't be hit now, due to the WAL
> replay path sometimes adding procs?
> 
> 
> How is it safe to just block in StandbyParamErrorPauseRecovery() before
> raising an error? The error would have released lwlocks, but now we
> don't? If we e.g. block in PrepareRedoAdd() we'll continue to hold
> TwoPhaseStateLock(), but shutting the database down might also acquire
> that (CheckPointTwoPhase()). Similar with the error in
> KnownAssignedXidsAdd().
> 
> 
> This does not seem ready.
> 
> 
>> I encountered the trouble maybe related to this commit.
>>
>> Firstly I set up the master and the standby with max_connections=100 (default value).
>> Then I decreased max_connections to 1 only in the standby and restarted
>> the server. Thanks to the commit, I saw the following warning message
>> in the standby.
>>
>>      WARNING:  insufficient setting for parameter max_connections
>>      DETAIL:  max_connections = 1 is a lower setting than on the master server (where its value was 100).
>>      HINT:  Change parameters and restart the server, or there may be resource exhaustion errors sooner or later.
>>
>> Then I made the script that inserted 1,000,000 rows in one transaction,
>> and ran it 30 times at the same time. That is, 30 transactions inserting
>> lots of rows were running at the same time.
>>
>> I confirmed that there are expected number of rows in the master,
>> but found 0 row in the standby unxpectedly.
> 
> Have the relevant records actually been replicated?

Yeah, I thought I confirmed that, but when I tried to reproduce
the issue in the clean env, I failed to do that. So which means
that I did mistake... :( Sorry for noise..

Regards,

-- 
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

Re: pgsql: Improve handling of parameter differences in physicalreplicatio

From

Fujii Masao

Date:

31 March 2020, 04:16:03


On 2020/03/31 3:10, Andres Freund wrote:
> Hi,
> 
> On 2020-03-30 19:41:43 +0900, Fujii Masao wrote:
>> On 2020/03/30 16:58, Peter Eisentraut wrote:
>>> Improve handling of parameter differences in physical replication
>>>
>>> When certain parameters are changed on a physical replication primary,
>>> this is communicated to standbys using the XLOG_PARAMETER_CHANGE WAL
>>> record.  The standby then checks whether its own settings are at least
>>> as big as the ones on the primary.  If not, the standby shuts down
>>> with a fatal error.
>>>
>>> The correspondence of settings between primary and standby is required
>>> because those settings influence certain shared memory sizings that
>>> are required for processing WAL records that the primary might send.
>>> For example, if the primary sends a prepared transaction, the standby
>>> must have had max_prepared_transaction set appropriately or it won't
>>> be able to process those WAL records.
>>>
>>> However, fatally shutting down the standby immediately upon receipt of
>>> the parameter change record might be a bit of an overreaction.  The
>>> resources related to those settings are not required immediately at
>>> that point, and might never be required if the activity on the primary
>>> does not exhaust all those resources.  If we just let the standby roll
>>> on with recovery, it will eventually produce an appropriate error when
>>> those resources are used.
>>>
>>> So this patch relaxes this a bit.  Upon receipt of
>>> XLOG_PARAMETER_CHANGE, we still check the settings but only issue a
>>> warning and set a global flag if there is a problem.  Then when we
>>> actually hit the resource issue and the flag was set, we issue another
>>> warning message with relevant information.
> 
> I find it somewhat hostile that we don't display the actual resource
> error once the problem is hit - we just pause. Sure, there's going to be
> some previous log entry explaining what the actual parameter difference
> is - but that could have been weeks ago. So either hard to find, or even
> rotated out.
> 
> 
>>> At that point we pause recovery, so a hot standby remains usable.
>>> We also repeat the last warning message once a minute so it is
>>> harder to miss or ignore.
> 
> 
> I can't really imagine that the adjustments made in this patch are
> sufficient.
> 
> One important issue seems to me to be the size of the array that
> TransactionIdIsInProgress() allocates:
>     /*
>      * If first time through, get workspace to remember main XIDs in. We
>      * malloc it permanently to avoid repeated palloc/pfree overhead.
>      */
>     if (xids == NULL)
>     {
>         /*
>          * In hot standby mode, reserve enough space to hold all xids in the
>          * known-assigned list. If we later finish recovery, we no longer need
>          * the bigger array, but we don't bother to shrink it.
>          */
>         int            maxxids = RecoveryInProgress() ? TOTAL_MAX_CACHED_SUBXIDS : arrayP->maxProcs;
> 
>         xids = (TransactionId *) malloc(maxxids * sizeof(TransactionId));
>         if (xids == NULL)
>             ereport(ERROR,
>                     (errcode(ERRCODE_OUT_OF_MEMORY),
>                      errmsg("out of memory")));
>     }
> 
> Which I think means we'll just overrun the xids array in some cases,
> e.g. if KnownAssignedXids overflowed.  Obviously we should have a
> crosscheck in the code (which we don't), but it was previously a
> supposedly unreachable path.
> 
> Similarly, the allocation in GetSnapshotData() will be too small, I
> think:
>     if (snapshot->xip == NULL)
>     {
>         /*
>          * First call for this snapshot. Snapshot is same size whether or not
>          * we are in recovery, see later comments.
>          */
>         snapshot->xip = (TransactionId *)
>             malloc(GetMaxSnapshotXidCount() * sizeof(TransactionId));
>         if (snapshot->xip == NULL)
>             ereport(ERROR,
>                     (errcode(ERRCODE_OUT_OF_MEMORY),
>                      errmsg("out of memory")));
>         Assert(snapshot->subxip == NULL);
>         snapshot->subxip = (TransactionId *)
>             malloc(GetMaxSnapshotSubxidCount() * sizeof(TransactionId));
>         if (snapshot->subxip == NULL)
>             ereport(ERROR,
>                     (errcode(ERRCODE_OUT_OF_MEMORY),
>                      errmsg("out of memory")));
>     }
> 
> I think basically all code using TOTAL_MAX_CACHED_SUBXIDS,
> GetMaxSnapshotSubxidCount(), PROCARRAY_MAXPROCS needs to be reviewed
> much more carefully than done here.
> 
> 
> Also, shouldn't dynahash be adjusted as well? There's e.g. the
> following HASH_ENTER path:
>                 /* report a generic message */
>                 if (hashp->isshared)
>                     ereport(ERROR,
>                             (errcode(ERRCODE_OUT_OF_MEMORY),
>                              errmsg("out of shared memory")));
> 
> 
> I'm also not sure it's ok to not have the waiting in
> ProcArrayAdd(). Is it guaranteed that can't be hit now, due to the WAL
> replay path sometimes adding procs?
> 
> 
> How is it safe to just block in StandbyParamErrorPauseRecovery() before
> raising an error? The error would have released lwlocks, but now we
> don't? If we e.g. block in PrepareRedoAdd() we'll continue to hold
> TwoPhaseStateLock(), but shutting the database down might also acquire
> that (CheckPointTwoPhase()). Similar with the error in
> KnownAssignedXidsAdd().
> 
> 
> This does not seem ready.
> 
> 
>> I encountered the trouble maybe related to this commit.
>>
>> Firstly I set up the master and the standby with max_connections=100 (default value).
>> Then I decreased max_connections to 1 only in the standby and restarted
>> the server. Thanks to the commit, I saw the following warning message
>> in the standby.
>>
>>      WARNING:  insufficient setting for parameter max_connections
>>      DETAIL:  max_connections = 1 is a lower setting than on the master server (where its value was 100).
>>      HINT:  Change parameters and restart the server, or there may be resource exhaustion errors sooner or later.
>>
>> Then I made the script that inserted 1,000,000 rows in one transaction,
>> and ran it 30 times at the same time. That is, 30 transactions inserting
>> lots of rows were running at the same time.
>>
>> I confirmed that there are expected number of rows in the master,
>> but found 0 row in the standby unxpectedly.
> 
> Have the relevant records actually been replicated?

Yeah, I thought I confirmed that, but when I tried to reproduce
the issue in the clean env, I failed to do that. So which means
that I did mistake... :( Sorry for noise..

Regards,

-- 
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

Re: pgsql: Improve handling of parameter differences in physical replicatio

From

Robert Haas

Date:

03 April 2020, 20:55:09

On Mon, Mar 30, 2020 at 2:10 PM Andres Freund <andres@anarazel.de> wrote:
> One important issue seems to me to be the size of the array that
> TransactionIdIsInProgress() allocates:
>         /*
>          * If first time through, get workspace to remember main XIDs in. We
>          * malloc it permanently to avoid repeated palloc/pfree overhead.
>          */
>         if (xids == NULL)
>         {
>                 /*
>                  * In hot standby mode, reserve enough space to hold all xids in the
>                  * known-assigned list. If we later finish recovery, we no longer need
>                  * the bigger array, but we don't bother to shrink it.
>                  */
>                 int                     maxxids = RecoveryInProgress() ? TOTAL_MAX_CACHED_SUBXIDS :
arrayP->maxProcs;
>
>                 xids = (TransactionId *) malloc(maxxids * sizeof(TransactionId));
>                 if (xids == NULL)
>                         ereport(ERROR,
>                                         (errcode(ERRCODE_OUT_OF_MEMORY),
>                                          errmsg("out of memory")));
>         }
>
> Which I think means we'll just overrun the xids array in some cases,
> e.g. if KnownAssignedXids overflowed.  Obviously we should have a
> crosscheck in the code (which we don't), but it was previously a
> supposedly unreachable path.

I think this patch needs to be reverted. The only places where it
changes anything are places where we were about to throw some error
anyway. But as Andres's analysis shows, that's not nearly good enough.
I am kind of surprised that Peter thought that would be good enough.
It is necessary, for something like this, to investigate all the
places where the code may be relying on a certain assumption, not just
assume that there's an error check everywhere that we rely on that
assumption and change only those places.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pgsql: Improve handling of parameter differences in physical replicatio

From

Robert Haas

Date:

03 April 2020, 20:55:09

On Mon, Mar 30, 2020 at 2:10 PM Andres Freund <andres@anarazel.de> wrote:
> One important issue seems to me to be the size of the array that
> TransactionIdIsInProgress() allocates:
>         /*
>          * If first time through, get workspace to remember main XIDs in. We
>          * malloc it permanently to avoid repeated palloc/pfree overhead.
>          */
>         if (xids == NULL)
>         {
>                 /*
>                  * In hot standby mode, reserve enough space to hold all xids in the
>                  * known-assigned list. If we later finish recovery, we no longer need
>                  * the bigger array, but we don't bother to shrink it.
>                  */
>                 int                     maxxids = RecoveryInProgress() ? TOTAL_MAX_CACHED_SUBXIDS :
arrayP->maxProcs;
>
>                 xids = (TransactionId *) malloc(maxxids * sizeof(TransactionId));
>                 if (xids == NULL)
>                         ereport(ERROR,
>                                         (errcode(ERRCODE_OUT_OF_MEMORY),
>                                          errmsg("out of memory")));
>         }
>
> Which I think means we'll just overrun the xids array in some cases,
> e.g. if KnownAssignedXids overflowed.  Obviously we should have a
> crosscheck in the code (which we don't), but it was previously a
> supposedly unreachable path.

I think this patch needs to be reverted. The only places where it
changes anything are places where we were about to throw some error
anyway. But as Andres's analysis shows, that's not nearly good enough.
I am kind of surprised that Peter thought that would be good enough.
It is necessary, for something like this, to investigate all the
places where the code may be relying on a certain assumption, not just
assume that there's an error check everywhere that we rely on that
assumption and change only those places.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pgsql: Improve handling of parameter differences in physicalreplicatio

From

Peter Eisentraut

Date:

04 April 2020, 10:11:25

On 2020-04-03 19:55, Robert Haas wrote:> I think this patch needs to be 
reverted. The only places where it
> changes anything are places where we were about to throw some error
> anyway. But as Andres's analysis shows, that's not nearly good enough.

OK, reverted.

Re: pgsql: Improve handling of parameter differences in physicalreplicatio

From

Peter Eisentraut

Date:

04 April 2020, 10:11:25

On 2020-04-03 19:55, Robert Haas wrote:> I think this patch needs to be 
reverted. The only places where it
> changes anything are places where we were about to throw some error
> anyway. But as Andres's analysis shows, that's not nearly good enough.

OK, reverted.

Re: pgsql: Improve handling of parameter differences in physicalreplicatio

From

Peter Eisentraut

Date:

14 April 2020, 10:17:39

On 2020-03-30 20:10, Andres Freund wrote:
> Also, shouldn't dynahash be adjusted as well? There's e.g. the
> following HASH_ENTER path:
>                 /* report a generic message */
>                 if (hashp->isshared)
>                     ereport(ERROR,
>                             (errcode(ERRCODE_OUT_OF_MEMORY),
>                              errmsg("out of shared memory")));

Could you explain further what you mean by this?  I don't understand how 
this is related.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services