Home > mailing lists

Thread: conflict with recovery when delay is gone

conflict with recovery when delay is gone

From

Radoslav Nedyalkov

Date:

13 November 2020, 13:24:14

Hi Forum,all

On a very busy master-standby setup which runs typical olap processing -

long living , massive writes statements, we're getting on the standby:

ERROR: canceling statement due to conflict with recovery
FATAL: terminating connection due to conflict with recovery

The weird thing is that cancellations happen usually after standby has experienced

some huge delay(2h), still not at the allowed maximum(3h). Even recently run statements

got cancelled when the delay is already at zero.

Sometimes the situation got relaxed after an hour or so.

Restarting the server instantly helps.

It is pg11.8, centos7, hugepages, shared_buffers 196G from 748G.

What phenomenon could we be facing?

Thank you,

Rado

Re: conflict with recovery when delay is gone

From

Laurenz Albe

Date:

13 November 2020, 17:37:36

On Fri, 2020-11-13 at 15:24 +0200, Radoslav Nedyalkov wrote:
> On a very busy master-standby setup which runs typical olap processing -
> long living , massive writes statements,  we're getting on the standby:
> 
>  ERROR:  canceling statement due to conflict with recovery
>  FATAL:  terminating connection due to conflict with recovery
> 
> The weird thing is that cancellations happen usually after standby has experienced 
> some huge delay(2h), still not at the allowed maximum(3h). Even recently run statements
> got cancelled when the delay is already at zero.
> 
> Sometimes the situation got relaxed after an hour or so.
> Restarting the server instantly helps.
> 
> It is pg11.8, centos7, hugepages, shared_buffers 196G from 748G.
> 
> What phenomenon could we be facing?

Hard to say.  Perhaps an unusual kind of replication conflict?

What is in "pg_stat_database_conflicts" on the standby server?

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com

Re: conflict with recovery when delay is gone

From

Radoslav Nedyalkov

Date:

13 November 2020, 18:13:42

On Fri, Nov 13, 2020 at 7:37 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

On Fri, 2020-11-13 at 15:24 +0200, Radoslav Nedyalkov wrote:
> On a very busy master-standby setup which runs typical olap processing -
> long living , massive writes statements, we're getting on the standby:
>
> ERROR: canceling statement due to conflict with recovery
> FATAL: terminating connection due to conflict with recovery
>
> The weird thing is that cancellations happen usually after standby has experienced
> some huge delay(2h), still not at the allowed maximum(3h). Even recently run statements
> got cancelled when the delay is already at zero.
>
> Sometimes the situation got relaxed after an hour or so.
> Restarting the server instantly helps.
>
> It is pg11.8, centos7, hugepages, shared_buffers 196G from 748G.
>
> What phenomenon could we be facing?

Hard to say. Perhaps an unusual kind of replication conflict?

What is in "pg_stat_database_conflicts" on the standby server?

db01=# select * from pg_stat_database_conflicts;

datid | datname | confl_tablespace | confl_lock | confl_snapshot | confl_bufferpin | confl_deadlock
-------+-----------+------------------+------------+----------------+-----------------+----------------
13877 | template0 | 0 | 0 | 0 | 0 | 0
16400 | template1 | 0 | 0 | 0 | 0 | 0
16402 | postgres | 0 | 0 | 0 | 0 | 0
16401 | db01 | 0 | 0 | 51 | 0 | 0
(4 rows)

On a freshly restarted standby we've just got similar behaviour after a 2 hours delay and a slow catch-up.

confl_snapshots is 51 and we have exactly the same number cancelled statements.

Re: conflict with recovery when delay is gone

From

Radoslav Nedyalkov

Date:

14 November 2020, 19:45:35

On Fri, Nov 13, 2020 at 8:13 PM Radoslav Nedyalkov <rnedyalkov@gmail.com> wrote:

On Fri, Nov 13, 2020 at 7:37 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Fri, 2020-11-13 at 15:24 +0200, Radoslav Nedyalkov wrote:
> On a very busy master-standby setup which runs typical olap processing -
> long living , massive writes statements, we're getting on the standby:
>
> ERROR: canceling statement due to conflict with recovery
> FATAL: terminating connection due to conflict with recovery
>
> The weird thing is that cancellations happen usually after standby has experienced
> some huge delay(2h), still not at the allowed maximum(3h). Even recently run statements
> got cancelled when the delay is already at zero.
>
> Sometimes the situation got relaxed after an hour or so.
> Restarting the server instantly helps.
>
> It is pg11.8, centos7, hugepages, shared_buffers 196G from 748G.
>
> What phenomenon could we be facing?

Hard to say. Perhaps an unusual kind of replication conflict?

What is in "pg_stat_database_conflicts" on the standby server?

db01=# select * from pg_stat_database_conflicts;
datid | datname | confl_tablespace | confl_lock | confl_snapshot | confl_bufferpin | confl_deadlock
-------+-----------+------------------+------------+----------------+-----------------+----------------
13877 | template0 | 0 | 0 | 0 | 0 | 0
16400 | template1 | 0 | 0 | 0 | 0 | 0
16402 | postgres | 0 | 0 | 0 | 0 | 0
16401 | db01 | 0 | 0 | 51 | 0 | 0
(4 rows)

On a freshly restarted standby we've just got similar behaviour after a 2 hours delay and a slow catch-up.
confl_snapshots is 51 and we have exactly the same number cancelled statements.

No luck so far. Searching for the explanation i found we fail into the unexplained case when

snapshot conflicts happen even hot_standby_feedback is on.

Thanks,

Rado

Re: conflict with recovery when delay is gone

From

Mohamed Wael Khobalatte

Date:

14 November 2020, 22:48:38

On Sat, Nov 14, 2020 at 2:46 PM Radoslav Nedyalkov <rnedyalkov@gmail.com> wrote:

On Fri, Nov 13, 2020 at 8:13 PM Radoslav Nedyalkov <rnedyalkov@gmail.com> wrote:

On Fri, Nov 13, 2020 at 7:37 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Fri, 2020-11-13 at 15:24 +0200, Radoslav Nedyalkov wrote:
> On a very busy master-standby setup which runs typical olap processing -
> long living , massive writes statements, we're getting on the standby:
>
> ERROR: canceling statement due to conflict with recovery
> FATAL: terminating connection due to conflict with recovery
>
> The weird thing is that cancellations happen usually after standby has experienced
> some huge delay(2h), still not at the allowed maximum(3h). Even recently run statements
> got cancelled when the delay is already at zero.
>
> Sometimes the situation got relaxed after an hour or so.
> Restarting the server instantly helps.
>
> It is pg11.8, centos7, hugepages, shared_buffers 196G from 748G.
>
> What phenomenon could we be facing?

Hard to say. Perhaps an unusual kind of replication conflict?

What is in "pg_stat_database_conflicts" on the standby server?

db01=# select * from pg_stat_database_conflicts;
datid | datname | confl_tablespace | confl_lock | confl_snapshot | confl_bufferpin | confl_deadlock
-------+-----------+------------------+------------+----------------+-----------------+----------------
13877 | template0 | 0 | 0 | 0 | 0 | 0
16400 | template1 | 0 | 0 | 0 | 0 | 0
16402 | postgres | 0 | 0 | 0 | 0 | 0
16401 | db01 | 0 | 0 | 51 | 0 | 0
(4 rows)

On a freshly restarted standby we've just got similar behaviour after a 2 hours delay and a slow catch-up.
confl_snapshots is 51 and we have exactly the same number cancelled statements.

No luck so far. Searching for the explanation i found we fail into the unexplained case when
snapshot conflicts happen even hot_standby_feedback is on.

Thanks,
Rado

Perhaps you have a value set for old_snapshot_threshold? If not, do the walreceiver connections drop out?

Re: conflict with recovery when delay is gone

From

Radoslav Nedyalkov

Date:

15 November 2020, 11:49:33

On Sun, Nov 15, 2020 at 12:48 AM Mohamed Wael Khobalatte <mkhobalatte@grubhub.com> wrote:

On Sat, Nov 14, 2020 at 2:46 PM Radoslav Nedyalkov <rnedyalkov@gmail.com> wrote:

On Fri, Nov 13, 2020 at 8:13 PM Radoslav Nedyalkov <rnedyalkov@gmail.com> wrote:

On Fri, Nov 13, 2020 at 7:37 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Fri, 2020-11-13 at 15:24 +0200, Radoslav Nedyalkov wrote:
> On a very busy master-standby setup which runs typical olap processing -
> long living , massive writes statements, we're getting on the standby:
>
> ERROR: canceling statement due to conflict with recovery
> FATAL: terminating connection due to conflict with recovery
>
> The weird thing is that cancellations happen usually after standby has experienced
> some huge delay(2h), still not at the allowed maximum(3h). Even recently run statements
> got cancelled when the delay is already at zero.
>
> Sometimes the situation got relaxed after an hour or so.
> Restarting the server instantly helps.
>
> It is pg11.8, centos7, hugepages, shared_buffers 196G from 748G.
>
> What phenomenon could we be facing?

Hard to say. Perhaps an unusual kind of replication conflict?

What is in "pg_stat_database_conflicts" on the standby server?

db01=# select * from pg_stat_database_conflicts;
datid | datname | confl_tablespace | confl_lock | confl_snapshot | confl_bufferpin | confl_deadlock
-------+-----------+------------------+------------+----------------+-----------------+----------------
13877 | template0 | 0 | 0 | 0 | 0 | 0
16400 | template1 | 0 | 0 | 0 | 0 | 0
16402 | postgres | 0 | 0 | 0 | 0 | 0
16401 | db01 | 0 | 0 | 51 | 0 | 0
(4 rows)

On a freshly restarted standby we've just got similar behaviour after a 2 hours delay and a slow catch-up.
confl_snapshots is 51 and we have exactly the same number cancelled statements.

No luck so far. Searching for the explanation i found we fail into the unexplained case when
snapshot conflicts happen even hot_standby_feedback is on.

Thanks,
Rado

Perhaps you have a value set for old_snapshot_threshold? If not, do the walreceiver connections drop out?

old_snapshot_threshold is -1 on both master and replica.

walreceiver does not drop.