Thread: Conflict with recovery on PG version 11.6
Hi!
Basically after upgrade to version 11.5 from 10.6 I experience error messages on streaming replica host “FATAL: terminating connection due to conflict with recovery” and “ERROR: canceling statement due to conflict with recovery”. There is no changes for vacuuming on master nor max_standby_streaming_delay for replica. I tried to correlate errors with vacuuming process on master but according to logs there is no link between them. Somehow I have feeling that when query runs longer than value for parameter max_standby_streaming_delay the query will be terminated regardless vacuuming process on master.
Is there any changes on version 11.5 what may cause it?
Is there any good solution without setting max_standby_streaming_delay=-1 or enabling hot_standby_feedback?
BR,
Toomas
On Wed, 2020-06-17 at 08:42 +0300, Toomas Kristin wrote: > Basically after upgrade to version 11.5 from 10.6 I experience error messages on streaming > replica host “FATAL: terminating connection due to conflict with recovery” and > “ERROR: canceling statement due to conflict with recovery”. There is no changes for > vacuuming on master nor max_standby_streaming_delay for replica. I tried to correlate > errors with vacuuming process on master but according to logs there is no link between > them. Somehow I have feeling that when query runs longer than value for parameter > max_standby_streaming_delay the query will be terminated regardless vacuuming process on master. > > Is there any changes on version 11.5 what may cause it? > > Is there any good solution without setting max_standby_streaming_delay=-1 or enabling hot_standby_feedback? The basic behavior shouldn't have changed since v10. Check "pg_stat_database_conflicts" to see what kinds of conflicts that are. The only solutions to avoid queries being canceled due to replication conflicts are: 1. avoid that such conflicts happen: - set "hot_standby_feedback = on" on the standby and/or "vacuum_defer_cleanup_age" on the primary to avoid VACUUM conflicts - Don't lock tables in access exclusive mode 2. set "max_standby_streaming_delay" to -1 Note that it can be quite hard to completely avoid replication conflicts. Trying to have both no delay in applying changes and no cancelled queries is often not possible without seriously crippling autovacuum. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com
Hi Laurenz, Thank you! What are reasons for conflicts? Based on documentation seems that the only reason can be that vacuum removed unused tuplesthat are in use at standby host and due to that standby host cannot apply modifications while blocking query eitherfinishes or will be terminated. isnt it? Or there can be some other reasons? I just wondering what would be impact when I increase value for autovacuum_vacuum_scale_factor in order force vacuuming processpostpone the clean up process. BR, Toomas > On 17. Jun 2020, at 12:42, Laurenz Albe <laurenz.albe@cybertec.at> wrote: > > On Wed, 2020-06-17 at 08:42 +0300, Toomas Kristin wrote: >> Basically after upgrade to version 11.5 from 10.6 I experience error messages on streaming >> replica host “FATAL: terminating connection due to conflict with recovery” and >> “ERROR: canceling statement due to conflict with recovery”. There is no changes for >> vacuuming on master nor max_standby_streaming_delay for replica. I tried to correlate >> errors with vacuuming process on master but according to logs there is no link between >> them. Somehow I have feeling that when query runs longer than value for parameter >> max_standby_streaming_delay the query will be terminated regardless vacuuming process on master. >> >> Is there any changes on version 11.5 what may cause it? >> >> Is there any good solution without setting max_standby_streaming_delay=-1 or enabling hot_standby_feedback? > > The basic behavior shouldn't have changed since v10. > > Check "pg_stat_database_conflicts" to see what kinds of conflicts that are. > > The only solutions to avoid queries being canceled due to replication conflicts are: > > 1. avoid that such conflicts happen: > - set "hot_standby_feedback = on" on the standby and/or > "vacuum_defer_cleanup_age" on the primary to avoid VACUUM conflicts > - Don't lock tables in access exclusive mode > > 2. set "max_standby_streaming_delay" to -1 > > Note that it can be quite hard to completely avoid replication conflicts. > Trying to have both no delay in applying changes and no cancelled queries > is often not possible without seriously crippling autovacuum. > > Yours, > Laurenz Albe > -- > Cybertec | https://www.cybertec-postgresql.com >
Hi! What is difference between following error messages? ERROR: canceling statement due to conflict with recovery FATAL: terminating connection due to conflict with recovery I tried to correlate process termination on standby and vacuuming on master. Here is a sample timeline 1) 2020-06-17 22:45:42 - Last vacuuming before session 2) 2020-06-18 01:01:12 - Session was started at standby 3) 2020-06-18 01:27:53 - Session was terminated at standby 4) 2020-06-18 02:59:32 - Next vacuuming Based on that I would say there is no correlation between vacuuming and process termination. How can I identify where is the conflict that stops replication and forces to terminate a query on standby host? BR, Toomas > On 17. Jun 2020, at 16:38, Toomas Kristin <toomas.kristin@gmail.com> wrote: > > Hi Laurenz, > > Thank you! > > What are reasons for conflicts? Based on documentation seems that the only reason can be that vacuum removed unused tuplesthat are in use at standby host and due to that standby host cannot apply modifications while blocking query eitherfinishes or will be terminated. isnt it? Or there can be some other reasons? > > I just wondering what would be impact when I increase value for autovacuum_vacuum_scale_factor in order force vacuumingprocess postpone the clean up process. > > BR, > Toomas > >> On 17. Jun 2020, at 12:42, Laurenz Albe <laurenz.albe@cybertec.at> wrote: >> >> On Wed, 2020-06-17 at 08:42 +0300, Toomas Kristin wrote: >>> Basically after upgrade to version 11.5 from 10.6 I experience error messages on streaming >>> replica host “FATAL: terminating connection due to conflict with recovery” and >>> “ERROR: canceling statement due to conflict with recovery”. There is no changes for >>> vacuuming on master nor max_standby_streaming_delay for replica. I tried to correlate >>> errors with vacuuming process on master but according to logs there is no link between >>> them. Somehow I have feeling that when query runs longer than value for parameter >>> max_standby_streaming_delay the query will be terminated regardless vacuuming process on master. >>> >>> Is there any changes on version 11.5 what may cause it? >>> >>> Is there any good solution without setting max_standby_streaming_delay=-1 or enabling hot_standby_feedback? >> >> The basic behavior shouldn't have changed since v10. >> >> Check "pg_stat_database_conflicts" to see what kinds of conflicts that are. >> >> The only solutions to avoid queries being canceled due to replication conflicts are: >> >> 1. avoid that such conflicts happen: >> - set "hot_standby_feedback = on" on the standby and/or >> "vacuum_defer_cleanup_age" on the primary to avoid VACUUM conflicts >> - Don't lock tables in access exclusive mode >> >> 2. set "max_standby_streaming_delay" to -1 >> >> Note that it can be quite hard to completely avoid replication conflicts. >> Trying to have both no delay in applying changes and no cancelled queries >> is often not possible without seriously crippling autovacuum. >> >> Yours, >> Laurenz Albe >> -- >> Cybertec | https://www.cybertec-postgresql.com >> >
Toomas Kristin wrote: > > > Basically after upgrade to version 11.5 from 10.6 I experience error messages on streaming > > > replica host “FATAL: terminating connection due to conflict with recovery” and > > > “ERROR: canceling statement due to conflict with recovery”. There is no changes for > > > vacuuming on master nor max_standby_streaming_delay for replica. I tried to correlate > > > errors with vacuuming process on master but according to logs there is no link between > > > them. Somehow I have feeling that when query runs longer than value for parameter > > > max_standby_streaming_delay the query will be terminated regardless vacuuming process on master. > > > > > > Is there any changes on version 11.5 what may cause it? > > > > > > Is there any good solution without setting max_standby_streaming_delay=-1 or enabling hot_standby_feedback? > > > > The basic behavior shouldn't have changed since v10. > > > > Check "pg_stat_database_conflicts" to see what kinds of conflicts that are. > > > > The only solutions to avoid queries being canceled due to replication conflicts are: > > > > 1. avoid that such conflicts happen: > > - set "hot_standby_feedback = on" on the standby and/or > > "vacuum_defer_cleanup_age" on the primary to avoid VACUUM conflicts > > - Don't lock tables in access exclusive mode > > > > 2. set "max_standby_streaming_delay" to -1 > > > > Note that it can be quite hard to completely avoid replication conflicts. > > Trying to have both no delay in applying changes and no cancelled queries > > is often not possible without seriously crippling autovacuum. > > What are reasons for conflicts? Based on documentation seems that the only reason can be that vacuum removed > unused tuples that are in use at standby host and due to that standby host cannot apply > modifications while blocking query either finishes or will be terminated. isnt it? Or there can be some other reasons? There can be other reasons: - replicated ACCESS EXCLUSIVE locks that conflict with queries - replicated ACCESS EXCLUSIVE locks that cause deadlocks - buffer pins that are needed for replication but held by a query - dropped tablespaces that hold temporary files on the standby > I just wondering what would be impact when I increase value for autovacuum_vacuum_scale_factor > in order force vacuuming process postpone the clean up process. That won't help, it will just get your primary bloated. I told you the remedies above, why don't you like them? Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com
On Thu, 2020-06-18 at 10:09 +0300, Toomas Kristin wrote: > What is difference between following error messages? > > ERROR: canceling statement due to conflict with recovery > FATAL: terminating connection due to conflict with recovery One kills your database session, the other doesn't. > I tried to correlate process termination on standby and vacuuming on master. > > Here is a sample timeline > 1) 2020-06-17 22:45:42 - Last vacuuming before session > 2) 2020-06-18 01:01:12 - Session was started at standby > 3) 2020-06-18 01:27:53 - Session was terminated at standby > 4) 2020-06-18 02:59:32 - Next vacuuming > > Based on that I would say there is no correlation between vacuuming and process termination. > > How can I identify where is the conflict that stops replication and forces to terminate a query on standby host? It need not be caused by VACUUM; look which counter in "pg_stat_database_conflicts" has increased. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com
Hi, > There can be other reasons: > > - replicated ACCESS EXCLUSIVE locks that conflict with queries > - replicated ACCESS EXCLUSIVE locks that cause deadlocks > - buffer pins that are needed for replication but held by a query > - dropped tablespaces that hold temporary files on the standby Thank you for ideas what to verify. > I told you the remedies above, why don't you like them? Basically I want to achieve situation where replication is not suspended (lag is not more than 3 minutes) and statementson standby are not terminated. Based on collected information I don’t see any connection between vacuuming on masterand termination of statements on standby. I can temporarily disable vacuuming in order to be 100% sure this is thecase. And when I set max_standby_streaming_delay either -1 or as a very big number then it helps avoid query terminationbut doesn’t help me about suspended replication. All worked with same configuration on Postgres version 10.6,the issue started after version upgrade. This is the reason why I am very keen to find out real cause for the conflict. BR, Toomas
At Thu, 18 Jun 2020 23:29:49 +0300, Toomas Kristin <toomas.kristin@gmail.com> wrote in > Hi, > > > There can be other reasons: > > > > - replicated ACCESS EXCLUSIVE locks that conflict with queries > > - replicated ACCESS EXCLUSIVE locks that cause deadlocks > > - buffer pins that are needed for replication but held by a query > > - dropped tablespaces that hold temporary files on the standby > > Thank you for ideas what to verify. > > > I told you the remedies above, why don't you like them? > > Basically I want to achieve situation where replication is not suspended (lag is not more than 3 minutes) and statementson standby are not terminated. Based on collected information I don’t see any connection between vacuuming on masterand termination of statements on standby. I can temporarily disable vacuuming in order to be 100% sure this is thecase. And when I set max_standby_streaming_delay either -1 or as a very big number then it helps avoid query terminationbut doesn’t help me about suspended replication. All worked with same configuration on Postgres version 10.6,the issue started after version upgrade. > > This is the reason why I am very keen to find out real cause for the conflict. FWIW in case you haven't tried yet, if you could find a DETAILS: line following to the ERROR: canceling.." message in server log, it would narrow the possibility. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, Thank you all for help. > FWIW in case you haven't tried yet, if you could find a DETAILS: line > following to the ERROR: canceling.." message in server log, it would > narrow the possibility. I executed a database dump with 4 jobs and here are logs how this ended. 2020-06-19 02:38:28 UTC:[30953]:ERROR:40001:canceling statement due to conflict with recovery 2020-06-19 02:38:28 UTC:[30953]:DETAIL:User query might have needed to see row versions that must be removed. 2020-06-19 02:38:28 UTC:[30953]:LOCATION:ProcessInterrupts, postgres.c:3057 2020-06-19 02:38:28 UTC:[30953]:STATEMENT: 2020-06-19 02:38:28 UTC:[30926]:FATAL:40001:terminating connection due to conflict with recovery 2020-06-19 02:38:28 UTC:[30926]:DETAIL:User query might have needed to see row versions that must be removed. 2020-06-19 02:38:28 UTC:[30926]:HINT:In a moment you should be able to reconnect to the database and repeat your command. 2020-06-19 02:38:28 UTC:[30926]:LOCATION:ProcessInterrupts, postgres.c:2987 2020-06-19 02:38:28 UTC:[30952]:ERROR:40001:canceling statement due to conflict with recovery 2020-06-19 02:38:28 UTC:[30952]:DETAIL:User query might have needed to see row versions that must be removed. 2020-06-19 02:38:28 UTC:[30952]:LOCATION:ProcessInterrupts, postgres.c:3057 2020-06-19 02:38:28 UTC:[30952]:STATEMENT: 2020-06-19 02:38:28 UTC:[30953]:LOG:08006:could not receive data from client:Connection reset by peer 2020-06-19 02:38:28 UTC:[30953]:LOCATION:pq_recvbuf, pqcomm.c:978 Does it give any hints for anyone? Or how to find right version of source code what to analyse? BR, Toomas
On Fri, 2020-06-19 at 11:46 +0300, Toomas Kristin wrote: > I executed a database dump with 4 jobs and here are logs how this ended. > > 2020-06-19 02:38:28 UTC:[30953]:ERROR:40001:canceling statement due to conflict with recovery > 2020-06-19 02:38:28 UTC:[30953]:DETAIL:User query might have needed to see row versions that must be removed. > 2020-06-19 02:38:28 UTC:[30953]:LOCATION:ProcessInterrupts, postgres.c:3057 > 2020-06-19 02:38:28 UTC:[30953]:STATEMENT: > 2020-06-19 02:38:28 UTC:[30926]:FATAL:40001:terminating connection due to conflict with recovery > 2020-06-19 02:38:28 UTC:[30926]:DETAIL:User query might have needed to see row versions that must be removed. > 2020-06-19 02:38:28 UTC:[30926]:HINT:In a moment you should be able to reconnect to the database and repeat your command. > 2020-06-19 02:38:28 UTC:[30926]:LOCATION:ProcessInterrupts, postgres.c:2987 > 2020-06-19 02:38:28 UTC:[30952]:ERROR:40001:canceling statement due to conflict with recovery > 2020-06-19 02:38:28 UTC:[30952]:DETAIL:User query might have needed to see row versions that must be removed. > 2020-06-19 02:38:28 UTC:[30952]:LOCATION:ProcessInterrupts, postgres.c:3057 > 2020-06-19 02:38:28 UTC:[30952]:STATEMENT: > 2020-06-19 02:38:28 UTC:[30953]:LOG:08006:could not receive data from client:Connection reset by peer > 2020-06-19 02:38:28 UTC:[30953]:LOCATION:pq_recvbuf, pqcomm.c:978 > > Does it give any hints for anyone? Or how to find right version of source code what to analyse? Yes, that are conflicts with VACUUM. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com
Hi Laurenz, > Yes, that are conflicts with VACUUM. Thank you for help. hot_standby_feedback did what I expected (no lag on replication and no session termination) and caseis closed. Only it is foggy for me how it can be when no logs about vacuuming from that time window when session is activeon standby host. BR, Toomas
On Sat, 2020-06-20 at 06:51 +0300, Toomas Kristin wrote: > Thank you for help. hot_standby_feedback did what I expected (no lag on replication > and no session termination) and case is closed. Only it is foggy for me how it can > be when no logs about vacuuming from that time window when session is active on standby host. Perhaps heap only tuple chain pruning. Have you got "n_tup_hot_upd" > 0 somewhere in "pg_stat_user_tables"? Then any reader can "micro-vacuum" blocks with such tuples. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com