Home > mailing lists

Re: conflict with recovery when delay is gone - Mailing list pgsql-general

From	Radoslav Nedyalkov
Subject	Re: conflict with recovery when delay is gone
Date	November 15, 2020 14:49:33
Msg-id	CANhtRia0Gu+qVVHoUWtj59pDN8yqowSC4qjmqCrLMgMPR-=pHQ@mail.gmail.com Whole thread Raw
In response to	Re: conflict with recovery when delay is gone (Mohamed Wael Khobalatte <mkhobalatte@grubhub.com>)
List	pgsql-general

Tree view

On Sun, Nov 15, 2020 at 12:48 AM Mohamed Wael Khobalatte <mkhobalatte@grubhub.com> wrote:

On Sat, Nov 14, 2020 at 2:46 PM Radoslav Nedyalkov <rnedyalkov@gmail.com> wrote:

On Fri, Nov 13, 2020 at 8:13 PM Radoslav Nedyalkov <rnedyalkov@gmail.com> wrote:

On Fri, Nov 13, 2020 at 7:37 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Fri, 2020-11-13 at 15:24 +0200, Radoslav Nedyalkov wrote:
> On a very busy master-standby setup which runs typical olap processing -
> long living , massive writes statements, we're getting on the standby:
>
> ERROR: canceling statement due to conflict with recovery
> FATAL: terminating connection due to conflict with recovery
>
> The weird thing is that cancellations happen usually after standby has experienced
> some huge delay(2h), still not at the allowed maximum(3h). Even recently run statements
> got cancelled when the delay is already at zero.
>
> Sometimes the situation got relaxed after an hour or so.
> Restarting the server instantly helps.
>
> It is pg11.8, centos7, hugepages, shared_buffers 196G from 748G.
>
> What phenomenon could we be facing?

Hard to say. Perhaps an unusual kind of replication conflict?

What is in "pg_stat_database_conflicts" on the standby server?

db01=# select * from pg_stat_database_conflicts;
datid | datname | confl_tablespace | confl_lock | confl_snapshot | confl_bufferpin | confl_deadlock
-------+-----------+------------------+------------+----------------+-----------------+----------------
13877 | template0 | 0 | 0 | 0 | 0 | 0
16400 | template1 | 0 | 0 | 0 | 0 | 0
16402 | postgres | 0 | 0 | 0 | 0 | 0
16401 | db01 | 0 | 0 | 51 | 0 | 0
(4 rows)

On a freshly restarted standby we've just got similar behaviour after a 2 hours delay and a slow catch-up.
confl_snapshots is 51 and we have exactly the same number cancelled statements.

No luck so far. Searching for the explanation i found we fail into the unexplained case when
snapshot conflicts happen even hot_standby_feedback is on.

Thanks,
Rado

Perhaps you have a value set for old_snapshot_threshold? If not, do the walreceiver connections drop out?

old_snapshot_threshold is -1 on both master and replica.

walreceiver does not drop.

pgsql-general by date:

From: Dilip Kumar
Date: 15 November 2020, 12:47:12
Subject: Re: Race condition with restore_command on streaming replica

From: Josef Šimánek
Date: 15 November 2020, 15:11:24
Subject: Re: Bi-directional Replica updates

Re: conflict with recovery when delay is gone - Mailing list pgsql-general

Previous

Next