Thread: Conflict with recovery on PG version 11.6

Conflict with recovery on PG version 11.6

From
Toomas Kristin
Date:
Hi!

Basically after upgrade to version 11.5 from 10.6 I experience error messages on streaming replica host “FATAL:  terminating connection due to conflict with recovery” and “ERROR: canceling statement due to conflict with recovery”. There is no changes for vacuuming on master nor max_standby_streaming_delay for replica. I tried to correlate errors with vacuuming process on master but according to logs there is no link between them. Somehow I have feeling that when query runs longer than value for parameter max_standby_streaming_delay the query will be terminated regardless vacuuming process on master.

Is there any changes on version 11.5 what may cause it?

Is there any good solution without setting max_standby_streaming_delay=-1 or enabling hot_standby_feedback?

BR,
Toomas

Re: Conflict with recovery on PG version 11.6

From
Laurenz Albe
Date:
On Wed, 2020-06-17 at 08:42 +0300, Toomas Kristin wrote:
> Basically after upgrade to version 11.5 from 10.6 I experience error messages on streaming
> replica host “FATAL:  terminating connection due to conflict with recovery” and
> “ERROR: canceling statement due to conflict with recovery”. There is no changes for
> vacuuming on master nor max_standby_streaming_delay for replica. I tried to correlate
> errors with vacuuming process on master but according to logs there is no link between
> them. Somehow I have feeling that when query runs longer than value for parameter
> max_standby_streaming_delay the query will be terminated regardless vacuuming process on master.
> 
> Is there any changes on version 11.5 what may cause it?
> 
> Is there any good solution without setting max_standby_streaming_delay=-1 or enabling hot_standby_feedback?

The basic behavior shouldn't have changed since v10.

Check "pg_stat_database_conflicts" to see what kinds of conflicts that are.

The only solutions to avoid queries being canceled due to replication conflicts are:

1. avoid that such conflicts happen:
   - set "hot_standby_feedback = on" on the standby and/or
     "vacuum_defer_cleanup_age" on the primary to avoid VACUUM conflicts
   - Don't lock tables in access exclusive mode

2. set "max_standby_streaming_delay" to -1

Note that it can be quite hard to completely avoid replication conflicts.
Trying to have both no delay in applying changes and no cancelled queries
is often not possible without seriously crippling autovacuum.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com




Re: Conflict with recovery on PG version 11.6

From
Toomas Kristin
Date:
Hi Laurenz,

Thank you!

What are reasons for conflicts? Based on documentation seems that the only reason can be that vacuum removed unused
tuplesthat are in use at standby host and due to that standby host cannot apply modifications while blocking query
eitherfinishes or will be terminated. isnt it? Or there can be some other reasons? 

I just wondering what would be impact when I increase value for autovacuum_vacuum_scale_factor in order force vacuuming
processpostpone the clean up process. 

BR,
Toomas

> On 17. Jun 2020, at 12:42, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> On Wed, 2020-06-17 at 08:42 +0300, Toomas Kristin wrote:
>> Basically after upgrade to version 11.5 from 10.6 I experience error messages on streaming
>> replica host “FATAL:  terminating connection due to conflict with recovery” and
>> “ERROR: canceling statement due to conflict with recovery”. There is no changes for
>> vacuuming on master nor max_standby_streaming_delay for replica. I tried to correlate
>> errors with vacuuming process on master but according to logs there is no link between
>> them. Somehow I have feeling that when query runs longer than value for parameter
>> max_standby_streaming_delay the query will be terminated regardless vacuuming process on master.
>>
>> Is there any changes on version 11.5 what may cause it?
>>
>> Is there any good solution without setting max_standby_streaming_delay=-1 or enabling hot_standby_feedback?
>
> The basic behavior shouldn't have changed since v10.
>
> Check "pg_stat_database_conflicts" to see what kinds of conflicts that are.
>
> The only solutions to avoid queries being canceled due to replication conflicts are:
>
> 1. avoid that such conflicts happen:
>   - set "hot_standby_feedback = on" on the standby and/or
>     "vacuum_defer_cleanup_age" on the primary to avoid VACUUM conflicts
>   - Don't lock tables in access exclusive mode
>
> 2. set "max_standby_streaming_delay" to -1
>
> Note that it can be quite hard to completely avoid replication conflicts.
> Trying to have both no delay in applying changes and no cancelled queries
> is often not possible without seriously crippling autovacuum.
>
> Yours,
> Laurenz Albe
> --
> Cybertec | https://www.cybertec-postgresql.com
>




Re: Conflict with recovery on PG version 11.6

From
Toomas Kristin
Date:
Hi!

What is difference between following error messages?

ERROR:  canceling statement due to conflict with recovery
FATAL:  terminating connection due to conflict with recovery

I tried to correlate process termination on standby and vacuuming on master.

Here is a sample timeline
1) 2020-06-17 22:45:42 - Last vacuuming before session
2) 2020-06-18 01:01:12 - Session was started at standby
3) 2020-06-18 01:27:53 - Session was terminated at standby
4) 2020-06-18 02:59:32 - Next vacuuming

Based on that I would say there is no correlation between vacuuming and process termination.

How can I identify where is the conflict that stops replication and forces to terminate a query on standby host?

BR,
Toomas

> On 17. Jun 2020, at 16:38, Toomas Kristin <toomas.kristin@gmail.com> wrote:
>
> Hi Laurenz,
>
> Thank you!
>
> What are reasons for conflicts? Based on documentation seems that the only reason can be that vacuum removed unused
tuplesthat are in use at standby host and due to that standby host cannot apply modifications while blocking query
eitherfinishes or will be terminated. isnt it? Or there can be some other reasons? 
>
> I just wondering what would be impact when I increase value for autovacuum_vacuum_scale_factor in order force
vacuumingprocess postpone the clean up process. 
>
> BR,
> Toomas
>
>> On 17. Jun 2020, at 12:42, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>>
>> On Wed, 2020-06-17 at 08:42 +0300, Toomas Kristin wrote:
>>> Basically after upgrade to version 11.5 from 10.6 I experience error messages on streaming
>>> replica host “FATAL:  terminating connection due to conflict with recovery” and
>>> “ERROR: canceling statement due to conflict with recovery”. There is no changes for
>>> vacuuming on master nor max_standby_streaming_delay for replica. I tried to correlate
>>> errors with vacuuming process on master but according to logs there is no link between
>>> them. Somehow I have feeling that when query runs longer than value for parameter
>>> max_standby_streaming_delay the query will be terminated regardless vacuuming process on master.
>>>
>>> Is there any changes on version 11.5 what may cause it?
>>>
>>> Is there any good solution without setting max_standby_streaming_delay=-1 or enabling hot_standby_feedback?
>>
>> The basic behavior shouldn't have changed since v10.
>>
>> Check "pg_stat_database_conflicts" to see what kinds of conflicts that are.
>>
>> The only solutions to avoid queries being canceled due to replication conflicts are:
>>
>> 1. avoid that such conflicts happen:
>>  - set "hot_standby_feedback = on" on the standby and/or
>>    "vacuum_defer_cleanup_age" on the primary to avoid VACUUM conflicts
>>  - Don't lock tables in access exclusive mode
>>
>> 2. set "max_standby_streaming_delay" to -1
>>
>> Note that it can be quite hard to completely avoid replication conflicts.
>> Trying to have both no delay in applying changes and no cancelled queries
>> is often not possible without seriously crippling autovacuum.
>>
>> Yours,
>> Laurenz Albe
>> --
>> Cybertec | https://www.cybertec-postgresql.com
>>
>




Re: Conflict with recovery on PG version 11.6

From
Laurenz Albe
Date:
Toomas Kristin wrote:
> > > Basically after upgrade to version 11.5 from 10.6 I experience error messages on streaming
> > > replica host “FATAL:  terminating connection due to conflict with recovery” and
> > > “ERROR: canceling statement due to conflict with recovery”. There is no changes for
> > > vacuuming on master nor max_standby_streaming_delay for replica. I tried to correlate
> > > errors with vacuuming process on master but according to logs there is no link between
> > > them. Somehow I have feeling that when query runs longer than value for parameter
> > > max_standby_streaming_delay the query will be terminated regardless vacuuming process on master.
> > > 
> > > Is there any changes on version 11.5 what may cause it?
> > > 
> > > Is there any good solution without setting max_standby_streaming_delay=-1 or enabling hot_standby_feedback?
> > 
> > The basic behavior shouldn't have changed since v10.
> > 
> > Check "pg_stat_database_conflicts" to see what kinds of conflicts that are.
> > 
> > The only solutions to avoid queries being canceled due to replication conflicts are:
> > 
> > 1. avoid that such conflicts happen:
> >    - set "hot_standby_feedback = on" on the standby and/or
> >      "vacuum_defer_cleanup_age" on the primary to avoid VACUUM conflicts
> >    - Don't lock tables in access exclusive mode
> > 
> > 2. set "max_standby_streaming_delay" to -1
> > 
> > Note that it can be quite hard to completely avoid replication conflicts.
> > Trying to have both no delay in applying changes and no cancelled queries
> > is often not possible without seriously crippling autovacuum.
>
> What are reasons for conflicts? Based on documentation seems that the only reason can be that vacuum removed
> unused tuples that are in use at standby host and due to that standby host cannot apply
> modifications while blocking query either finishes or will be terminated. isnt it? Or there can be some other
reasons?

There can be other reasons:

- replicated ACCESS EXCLUSIVE locks that conflict with queries
- replicated ACCESS EXCLUSIVE locks that cause deadlocks
- buffer pins that are needed for replication but held by a query
- dropped tablespaces that hold temporary files on the standby

> I just wondering what would be impact when I increase value for autovacuum_vacuum_scale_factor
> in order force vacuuming process postpone the clean up process.

That won't help, it will just get your primary bloated.

I told you the remedies above, why don't you like them?

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com




Re: Conflict with recovery on PG version 11.6

From
Laurenz Albe
Date:
On Thu, 2020-06-18 at 10:09 +0300, Toomas Kristin wrote:
> What is difference between following error messages?
> 
> ERROR:  canceling statement due to conflict with recovery
> FATAL:  terminating connection due to conflict with recovery

One kills your database session, the other doesn't.

> I tried to correlate process termination on standby and vacuuming on master.
> 
> Here is a sample timeline
> 1) 2020-06-17 22:45:42 - Last vacuuming before session
> 2) 2020-06-18 01:01:12 - Session was started at standby
> 3) 2020-06-18 01:27:53 - Session was terminated at standby
> 4) 2020-06-18 02:59:32 - Next vacuuming
> 
> Based on that I would say there is no correlation between vacuuming and process termination.
> 
> How can I identify where is the conflict that stops replication and forces to terminate a query on standby host?

It need not be caused by VACUUM; look which counter in
"pg_stat_database_conflicts" has increased.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com




Re: Conflict with recovery on PG version 11.6

From
Toomas Kristin
Date:
Hi,

> There can be other reasons:
>
> - replicated ACCESS EXCLUSIVE locks that conflict with queries
> - replicated ACCESS EXCLUSIVE locks that cause deadlocks
> - buffer pins that are needed for replication but held by a query
> - dropped tablespaces that hold temporary files on the standby

Thank you for ideas what to verify.

> I told you the remedies above, why don't you like them?

Basically I want to achieve situation where replication is not suspended  (lag is not more than 3 minutes) and
statementson standby are not terminated. Based on collected information I don’t see any connection between vacuuming on
masterand termination of statements on standby. I can temporarily disable vacuuming in order to be 100% sure this is
thecase. And when I set max_standby_streaming_delay either -1 or as a very big number then it helps avoid query
terminationbut doesn’t help me about suspended replication. All worked with same configuration on Postgres version
10.6,the issue started after version upgrade. 

This is the reason why I am very keen to find out real cause for the conflict.

BR,
Toomas




Re: Conflict with recovery on PG version 11.6

From
Kyotaro Horiguchi
Date:
At Thu, 18 Jun 2020 23:29:49 +0300, Toomas Kristin <toomas.kristin@gmail.com> wrote in 
> Hi,
> 
> > There can be other reasons:
> > 
> > - replicated ACCESS EXCLUSIVE locks that conflict with queries
> > - replicated ACCESS EXCLUSIVE locks that cause deadlocks
> > - buffer pins that are needed for replication but held by a query
> > - dropped tablespaces that hold temporary files on the standby
> 
> Thank you for ideas what to verify.
> 
> > I told you the remedies above, why don't you like them?
> 
> Basically I want to achieve situation where replication is not suspended  (lag is not more than 3 minutes) and
statementson standby are not terminated. Based on collected information I don’t see any connection between vacuuming on
masterand termination of statements on standby. I can temporarily disable vacuuming in order to be 100% sure this is
thecase. And when I set max_standby_streaming_delay either -1 or as a very big number then it helps avoid query
terminationbut doesn’t help me about suspended replication. All worked with same configuration on Postgres version
10.6,the issue started after version upgrade.
 
> 
> This is the reason why I am very keen to find out real cause for the conflict.

FWIW in case you haven't tried yet, if you could find a DETAILS: line
following to the ERROR: canceling.." message in server log, it would
narrow the possibility.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Conflict with recovery on PG version 11.6

From
Toomas Kristin
Date:
Hi,

Thank you all for help.

> FWIW in case you haven't tried yet, if you could find a DETAILS: line
> following to the ERROR: canceling.." message in server log, it would
> narrow the possibility.


I executed a database dump with 4 jobs and here are logs how this ended.

2020-06-19 02:38:28 UTC:[30953]:ERROR:40001:canceling statement due to conflict with recovery
2020-06-19 02:38:28 UTC:[30953]:DETAIL:User query might have needed to see row versions that must be removed.
2020-06-19 02:38:28 UTC:[30953]:LOCATION:ProcessInterrupts, postgres.c:3057
2020-06-19 02:38:28 UTC:[30953]:STATEMENT:
2020-06-19 02:38:28 UTC:[30926]:FATAL:40001:terminating connection due to conflict with recovery
2020-06-19 02:38:28 UTC:[30926]:DETAIL:User query might have needed to see row versions that must be removed.
2020-06-19 02:38:28 UTC:[30926]:HINT:In a moment you should be able to reconnect to the database and repeat your
command.
2020-06-19 02:38:28 UTC:[30926]:LOCATION:ProcessInterrupts, postgres.c:2987
2020-06-19 02:38:28 UTC:[30952]:ERROR:40001:canceling statement due to conflict with recovery
2020-06-19 02:38:28 UTC:[30952]:DETAIL:User query might have needed to see row versions that must be removed.
2020-06-19 02:38:28 UTC:[30952]:LOCATION:ProcessInterrupts, postgres.c:3057
2020-06-19 02:38:28 UTC:[30952]:STATEMENT:
2020-06-19 02:38:28 UTC:[30953]:LOG:08006:could not receive data from client:Connection reset by peer
2020-06-19 02:38:28 UTC:[30953]:LOCATION:pq_recvbuf, pqcomm.c:978

Does it give any hints for anyone? Or how to find right version of source code what to analyse?

BR,
Toomas




Re: Conflict with recovery on PG version 11.6

From
Laurenz Albe
Date:
On Fri, 2020-06-19 at 11:46 +0300, Toomas Kristin wrote:
> I executed a database dump with 4 jobs and here are logs how this ended.
> 
> 2020-06-19 02:38:28 UTC:[30953]:ERROR:40001:canceling statement due to conflict with recovery
> 2020-06-19 02:38:28 UTC:[30953]:DETAIL:User query might have needed to see row versions that must be removed.
> 2020-06-19 02:38:28 UTC:[30953]:LOCATION:ProcessInterrupts, postgres.c:3057
> 2020-06-19 02:38:28 UTC:[30953]:STATEMENT:
> 2020-06-19 02:38:28 UTC:[30926]:FATAL:40001:terminating connection due to conflict with recovery
> 2020-06-19 02:38:28 UTC:[30926]:DETAIL:User query might have needed to see row versions that must be removed.
> 2020-06-19 02:38:28 UTC:[30926]:HINT:In a moment you should be able to reconnect to the database and repeat your
command.
> 2020-06-19 02:38:28 UTC:[30926]:LOCATION:ProcessInterrupts, postgres.c:2987
> 2020-06-19 02:38:28 UTC:[30952]:ERROR:40001:canceling statement due to conflict with recovery
> 2020-06-19 02:38:28 UTC:[30952]:DETAIL:User query might have needed to see row versions that must be removed.
> 2020-06-19 02:38:28 UTC:[30952]:LOCATION:ProcessInterrupts, postgres.c:3057
> 2020-06-19 02:38:28 UTC:[30952]:STATEMENT:
> 2020-06-19 02:38:28 UTC:[30953]:LOG:08006:could not receive data from client:Connection reset by peer
> 2020-06-19 02:38:28 UTC:[30953]:LOCATION:pq_recvbuf, pqcomm.c:978
> 
> Does it give any hints for anyone? Or how to find right version of source code what to analyse?

Yes, that are conflicts with VACUUM.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com




Re: Conflict with recovery on PG version 11.6

From
Toomas Kristin
Date:
Hi Laurenz,

> Yes, that are conflicts with VACUUM.

Thank you for help. hot_standby_feedback did what I expected (no lag on replication and no session termination) and
caseis closed. Only it is foggy for me how it can be when no logs about vacuuming from that time window when session is
activeon standby host. 

BR,
Toomas





Re: Conflict with recovery on PG version 11.6

From
Laurenz Albe
Date:
On Sat, 2020-06-20 at 06:51 +0300, Toomas Kristin wrote:
> Thank you for help. hot_standby_feedback did what I expected (no lag on replication
> and no session termination) and case is closed. Only it is foggy for me how it can
> be when no logs about vacuuming from that time window when session is active on standby host.

Perhaps heap only tuple chain pruning.

Have you got "n_tup_hot_upd" > 0 somewhere in "pg_stat_user_tables"?
Then any reader can "micro-vacuum" blocks with such tuples.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com