Thread: conflict with recovery when delay is gone

conflict with recovery when delay is gone

From
Radoslav Nedyalkov
Date:
Hi Forum,all
On a very busy master-standby setup which runs typical olap processing -
long living , massive writes statements,  we're getting on the standby:

 ERROR:  canceling statement due to conflict with recovery
 FATAL:  terminating connection due to conflict with recovery

The weird thing is that cancellations happen usually after standby has experienced 
some huge delay(2h), still not at the allowed maximum(3h). Even recently run statements
got cancelled when the delay is already at zero.

Sometimes the situation got relaxed after an hour or so.
Restarting the server instantly helps.

It is pg11.8, centos7, hugepages, shared_buffers 196G from 748G.

What phenomenon could we be facing?

Thank you,
Rado

Re: conflict with recovery when delay is gone

From
Laurenz Albe
Date:
On Fri, 2020-11-13 at 15:24 +0200, Radoslav Nedyalkov wrote:
> On a very busy master-standby setup which runs typical olap processing -
> long living , massive writes statements,  we're getting on the standby:
> 
>  ERROR:  canceling statement due to conflict with recovery
>  FATAL:  terminating connection due to conflict with recovery
> 
> The weird thing is that cancellations happen usually after standby has experienced 
> some huge delay(2h), still not at the allowed maximum(3h). Even recently run statements
> got cancelled when the delay is already at zero.
> 
> Sometimes the situation got relaxed after an hour or so.
> Restarting the server instantly helps.
> 
> It is pg11.8, centos7, hugepages, shared_buffers 196G from 748G.
> 
> What phenomenon could we be facing?

Hard to say.  Perhaps an unusual kind of replication conflict?

What is in "pg_stat_database_conflicts" on the standby server?

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com




Re: conflict with recovery when delay is gone

From
Radoslav Nedyalkov
Date:


On Fri, Nov 13, 2020 at 7:37 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Fri, 2020-11-13 at 15:24 +0200, Radoslav Nedyalkov wrote:
> On a very busy master-standby setup which runs typical olap processing -
> long living , massive writes statements,  we're getting on the standby:
>
>  ERROR:  canceling statement due to conflict with recovery
>  FATAL:  terminating connection due to conflict with recovery
>
> The weird thing is that cancellations happen usually after standby has experienced
> some huge delay(2h), still not at the allowed maximum(3h). Even recently run statements
> got cancelled when the delay is already at zero.
>
> Sometimes the situation got relaxed after an hour or so.
> Restarting the server instantly helps.
>
> It is pg11.8, centos7, hugepages, shared_buffers 196G from 748G.
>
> What phenomenon could we be facing?

Hard to say.  Perhaps an unusual kind of replication conflict?

What is in "pg_stat_database_conflicts" on the standby server?

db01=# select * from pg_stat_database_conflicts;
 datid |  datname  | confl_tablespace | confl_lock | confl_snapshot | confl_bufferpin | confl_deadlock
-------+-----------+------------------+------------+----------------+-----------------+----------------
 13877 | template0 |                0 |          0 |              0 |               0 |              0
 16400 | template1 |                0 |          0 |              0 |               0 |              0
 16402 | postgres  |                0 |          0 |              0 |               0 |              0
 16401 | db01      |                0 |          0 |             51 |               0 |              0
(4 rows)

On a freshly restarted standby we've just got similar behaviour after a 2 hours delay and a slow catch-up.
confl_snapshots is 51 and we have exactly the same number cancelled statements.


Re: conflict with recovery when delay is gone

From
Radoslav Nedyalkov
Date:


On Fri, Nov 13, 2020 at 8:13 PM Radoslav Nedyalkov <rnedyalkov@gmail.com> wrote:


On Fri, Nov 13, 2020 at 7:37 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Fri, 2020-11-13 at 15:24 +0200, Radoslav Nedyalkov wrote:
> On a very busy master-standby setup which runs typical olap processing -
> long living , massive writes statements,  we're getting on the standby:
>
>  ERROR:  canceling statement due to conflict with recovery
>  FATAL:  terminating connection due to conflict with recovery
>
> The weird thing is that cancellations happen usually after standby has experienced
> some huge delay(2h), still not at the allowed maximum(3h). Even recently run statements
> got cancelled when the delay is already at zero.
>
> Sometimes the situation got relaxed after an hour or so.
> Restarting the server instantly helps.
>
> It is pg11.8, centos7, hugepages, shared_buffers 196G from 748G.
>
> What phenomenon could we be facing?

Hard to say.  Perhaps an unusual kind of replication conflict?

What is in "pg_stat_database_conflicts" on the standby server?

db01=# select * from pg_stat_database_conflicts;
 datid |  datname  | confl_tablespace | confl_lock | confl_snapshot | confl_bufferpin | confl_deadlock
-------+-----------+------------------+------------+----------------+-----------------+----------------
 13877 | template0 |                0 |          0 |              0 |               0 |              0
 16400 | template1 |                0 |          0 |              0 |               0 |              0
 16402 | postgres  |                0 |          0 |              0 |               0 |              0
 16401 | db01      |                0 |          0 |             51 |               0 |              0
(4 rows)

On a freshly restarted standby we've just got similar behaviour after a 2 hours delay and a slow catch-up.
confl_snapshots is 51 and we have exactly the same number cancelled statements.


No luck so far. Searching for the explanation i found we fail into the unexplained case when
snapshot conflicts happen even hot_standby_feedback is on.

Thanks,
Rado
 

Re: conflict with recovery when delay is gone

From
Mohamed Wael Khobalatte
Date:


On Sat, Nov 14, 2020 at 2:46 PM Radoslav Nedyalkov <rnedyalkov@gmail.com> wrote:


On Fri, Nov 13, 2020 at 8:13 PM Radoslav Nedyalkov <rnedyalkov@gmail.com> wrote:


On Fri, Nov 13, 2020 at 7:37 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Fri, 2020-11-13 at 15:24 +0200, Radoslav Nedyalkov wrote:
> On a very busy master-standby setup which runs typical olap processing -
> long living , massive writes statements,  we're getting on the standby:
>
>  ERROR:  canceling statement due to conflict with recovery
>  FATAL:  terminating connection due to conflict with recovery
>
> The weird thing is that cancellations happen usually after standby has experienced
> some huge delay(2h), still not at the allowed maximum(3h). Even recently run statements
> got cancelled when the delay is already at zero.
>
> Sometimes the situation got relaxed after an hour or so.
> Restarting the server instantly helps.
>
> It is pg11.8, centos7, hugepages, shared_buffers 196G from 748G.
>
> What phenomenon could we be facing?

Hard to say.  Perhaps an unusual kind of replication conflict?

What is in "pg_stat_database_conflicts" on the standby server?

db01=# select * from pg_stat_database_conflicts;
 datid |  datname  | confl_tablespace | confl_lock | confl_snapshot | confl_bufferpin | confl_deadlock
-------+-----------+------------------+------------+----------------+-----------------+----------------
 13877 | template0 |                0 |          0 |              0 |               0 |              0
 16400 | template1 |                0 |          0 |              0 |               0 |              0
 16402 | postgres  |                0 |          0 |              0 |               0 |              0
 16401 | db01      |                0 |          0 |             51 |               0 |              0
(4 rows)

On a freshly restarted standby we've just got similar behaviour after a 2 hours delay and a slow catch-up.
confl_snapshots is 51 and we have exactly the same number cancelled statements.


No luck so far. Searching for the explanation i found we fail into the unexplained case when
snapshot conflicts happen even hot_standby_feedback is on.

Thanks,
Rado
 

Perhaps you have a value set for old_snapshot_threshold? If not, do the walreceiver connections drop out? 

Re: conflict with recovery when delay is gone

From
Radoslav Nedyalkov
Date:


On Sun, Nov 15, 2020 at 12:48 AM Mohamed Wael Khobalatte <mkhobalatte@grubhub.com> wrote:


On Sat, Nov 14, 2020 at 2:46 PM Radoslav Nedyalkov <rnedyalkov@gmail.com> wrote:


On Fri, Nov 13, 2020 at 8:13 PM Radoslav Nedyalkov <rnedyalkov@gmail.com> wrote:


On Fri, Nov 13, 2020 at 7:37 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Fri, 2020-11-13 at 15:24 +0200, Radoslav Nedyalkov wrote:
> On a very busy master-standby setup which runs typical olap processing -
> long living , massive writes statements,  we're getting on the standby:
>
>  ERROR:  canceling statement due to conflict with recovery
>  FATAL:  terminating connection due to conflict with recovery
>
> The weird thing is that cancellations happen usually after standby has experienced
> some huge delay(2h), still not at the allowed maximum(3h). Even recently run statements
> got cancelled when the delay is already at zero.
>
> Sometimes the situation got relaxed after an hour or so.
> Restarting the server instantly helps.
>
> It is pg11.8, centos7, hugepages, shared_buffers 196G from 748G.
>
> What phenomenon could we be facing?

Hard to say.  Perhaps an unusual kind of replication conflict?

What is in "pg_stat_database_conflicts" on the standby server?

db01=# select * from pg_stat_database_conflicts;
 datid |  datname  | confl_tablespace | confl_lock | confl_snapshot | confl_bufferpin | confl_deadlock
-------+-----------+------------------+------------+----------------+-----------------+----------------
 13877 | template0 |                0 |          0 |              0 |               0 |              0
 16400 | template1 |                0 |          0 |              0 |               0 |              0
 16402 | postgres  |                0 |          0 |              0 |               0 |              0
 16401 | db01      |                0 |          0 |             51 |               0 |              0
(4 rows)

On a freshly restarted standby we've just got similar behaviour after a 2 hours delay and a slow catch-up.
confl_snapshots is 51 and we have exactly the same number cancelled statements.


No luck so far. Searching for the explanation i found we fail into the unexplained case when
snapshot conflicts happen even hot_standby_feedback is on.

Thanks,
Rado
 

Perhaps you have a value set for old_snapshot_threshold? If not, do the walreceiver connections drop out?
 
old_snapshot_threshold is -1 on both master and replica.
walreceiver does not drop.