BUG #13473: VACUUM FREEZE mistakenly cancel standby sessions - Mailing list pgsql-bugs

From marco.nenciarini@2ndquadrant.it
Subject BUG #13473: VACUUM FREEZE mistakenly cancel standby sessions
Date
Msg-id 20150626134310.3876.82768@wrigleys.postgresql.org
Whole thread Raw
Responses Re: BUG #13473: VACUUM FREEZE mistakenly cancel standby sessions  (Marco Nenciarini <marco.nenciarini@2ndquadrant.it>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      13473
Logged by:          Marco Nenciarini
Email address:      marco.nenciarini@2ndquadrant.it
PostgreSQL version: 9.4.4
Operating system:   all
Description:

= Symptoms

Let's have a simple master -> standby setup, with hot_standby_feedback
activated,
if a backend on standby is holding the cluster xmin and the master runs a
VACUUM FREEZE
on the same database of the standby's backend, it will generate a conflict
and the query
running on standby will be canceled.

= How to reproduce it

Run the following operation on an idle cluster.

1) connect to the standby and simulate a long running query:

   select pg_sleep(3600);

2) connect to the master and run the following script

   create table t(id int primary key);
   insert into t select generate_series(1, 10000);
   vacuum freeze verbose t;
   drop table t;

3) after 30 seconds the pg_sleep query on standby will be canceled.

= Expected output

The hot standby feedback should have prevented the query cancellation

= Analysis

Ive run postgres at DEBUG2 logging level, and I can confirm that the vacuum
correctly see the OldestXmin propagated by the standby through the hot
standby feedback.
The issue is in heap_xlog_freeze function, which calls
ResolveRecoveryConflictWithSnapshot as first thing, passing the cutoff_xid
value as first argument.
The cutoff_xid is the OldestXmin active when the vacuum, so it represents a
running xid.
The issue is that the function ResolveRecoveryConflictWithSnapshot expects
as first argument of is latestRemovedXid, which represent the higher xid
that has been actually removed, so there is an off-by-one error.

I've been able to reproduce this issue for every version of postgres since
9.0 (9.0, 9.1, 9.2, 9.3, 9.4 and current master)

= Proposed solution

In the heap_xlog_freeze we need to subtract one to the value of cutoff_xid
before passing it to ResolveRecoveryConflictWithSnapshot.

pgsql-bugs by date:

Previous
From: matthew.seaman@adestra.com
Date:
Subject: BUG #13472: VACUUM ANALYZE hangs on certain tables
Next
From: Marco Nenciarini
Date:
Subject: Re: BUG #13473: VACUUM FREEZE mistakenly cancel standby sessions