Re: [BUGS] BUG #13473: VACUUM FREEZE mistakenly cancel standby sessions - Mailing list pgsql-hackers

From Marco Nenciarini
Subject Re: [BUGS] BUG #13473: VACUUM FREEZE mistakenly cancel standby sessions
Date
Msg-id 558D58B1.70400@2ndquadrant.it
Whole thread Raw
Responses Re: [BUGS] BUG #13473: VACUUM FREEZE mistakenly cancel standby sessions
List pgsql-hackers
Il 26/06/15 15:43, marco.nenciarini@2ndquadrant.it ha scritto:
> The following bug has been logged on the website:
>
> Bug reference:      13473
> Logged by:          Marco Nenciarini
> Email address:      marco.nenciarini@2ndquadrant.it
> PostgreSQL version: 9.4.4
> Operating system:   all
> Description:
>
> = Symptoms
>
> Let's have a simple master -> standby setup, with hot_standby_feedback
> activated,
> if a backend on standby is holding the cluster xmin and the master runs a
> VACUUM FREEZE
> on the same database of the standby's backend, it will generate a conflict
> and the query
> running on standby will be canceled.
>
> = How to reproduce it
>
> Run the following operation on an idle cluster.
>
> 1) connect to the standby and simulate a long running query:
>
>    select pg_sleep(3600);
>
> 2) connect to the master and run the following script
>
>    create table t(id int primary key);
>    insert into t select generate_series(1, 10000);
>    vacuum freeze verbose t;
>    drop table t;
>
> 3) after 30 seconds the pg_sleep query on standby will be canceled.
>
> = Expected output
>
> The hot standby feedback should have prevented the query cancellation
>
> = Analysis
>
> Ive run postgres at DEBUG2 logging level, and I can confirm that the vacuum
> correctly see the OldestXmin propagated by the standby through the hot
> standby feedback.
> The issue is in heap_xlog_freeze function, which calls
> ResolveRecoveryConflictWithSnapshot as first thing, passing the cutoff_xid
> value as first argument.
> The cutoff_xid is the OldestXmin active when the vacuum, so it represents a
> running xid.
> The issue is that the function ResolveRecoveryConflictWithSnapshot expects
> as first argument of is latestRemovedXid, which represent the higher xid
> that has been actually removed, so there is an off-by-one error.
>
> I've been able to reproduce this issue for every version of postgres since
> 9.0 (9.0, 9.1, 9.2, 9.3, 9.4 and current master)
>
> = Proposed solution
>
> In the heap_xlog_freeze we need to subtract one to the value of cutoff_xid
> before passing it to ResolveRecoveryConflictWithSnapshot.
>
>
>

Attached a proposed patch that solves the issue.

Regards,
Marco

--
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciarini@2ndQuadrant.it | www.2ndQuadrant.it

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Nitpicking: unnecessary NULL-pointer check in pg_upgrade's controldata.c
Next
From: Robert Haas
Date:
Subject: Re: Should we back-patch SSL renegotiation fixes?