Re: [HACKERS] StandbyRecoverPreparedTransactions recovers subtranslinks incorrectly - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: [HACKERS] StandbyRecoverPreparedTransactions recovers subtranslinks incorrectly
Date
Msg-id CANP8+jLy1VdopsdCi3c0=msJCfu-oHAhgPJgtKJ84fFN_8EmOw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] StandbyRecoverPreparedTransactions recovers subtranslinks incorrectly  (Andres Freund <andres@anarazel.de>)
Responses Re: [HACKERS] StandbyRecoverPreparedTransactions recovers subtranslinks incorrectly
List pgsql-hackers
On 23 April 2017 at 01:19, Andres Freund <andres@anarazel.de> wrote:
> On 2017-04-22 19:55:18 -0400, Tom Lane wrote:
>> Now that we've got consistent failure reports about the 009_twophase.pl
>> recovery test, I set out to find out why it's failing.  It looks to me
>> like the reason is that this (twophase.c:2145):
>>
>>             SubTransSetParent(xid, subxid, overwriteOK);
>>
>> ought to be this:
>>
>>             SubTransSetParent(subxid, xid, overwriteOK);
>>
>> because the definition of SubTransSetParent is
>>
>> void
>> SubTransSetParent(TransactionId xid, TransactionId parent, bool overwriteOK)
>>
>> not the other way 'round.
>>
>> While "git blame" blames this line on the recent commit 728bd991c,
>> that just moved the call from somewhere else.  AFAICS this has actually
>> been wrong since StandbyRecoverPreparedTransactions was written,
>> in 361bd1662 of 2010-04-13.
>
>> Also, when I fix that, it gets further but still crashes at the same
>> Assert in SubTransSetParent.  The proximate cause this time seems to be
>> that RecoverPreparedTransactions's calculation of overwriteOK is wrong:
>> it's computing that as "false", but in reality the subtrans link in
>> question has already been set.
>>
>
> Yikes.  This is clearly way undertested.  It's also pretty scary that
> the code has recently been whacked out quite heavily (both 9.6 and
> master), without hitting anything around this - doesn't seem to bode
> well for how in-depth the testing was.

Obviously if there is a bug it's because tests didn't find it and
therefore it is by definition undertested for that specific bug.

I'm not sure what other conclusion you wish to draw, if any?


>> It's not clear to me how much potential this has to create user data
>> corruption, but it doesn't look good at first glance.  Discuss.
>
> Hm. I think it can cause wrong tqual.c results in some edge cases.
> During HS, lastOverflowedXid will be set in some cases, and then
> XidInMVCCSnapshot etc will not find the actual toplevel xid, which'll in
> turn cause lookups snapshot->subxip (where all HS xids reside)
> potentially return wrong results.
>
> I was about to say that I don't see how it could result in persistent
> corruption however - the subtrans lookups are only done for
> (snapshot->xmin, snapshot->xmax] and subtrans is truncated
> regularly. But these days CHECKPOINT_END_OF_RECOVERY isn't obligatory
> anymore, so that might be delayed.  Hm.

I've not found any reason, yet, to believe we return wrong answers in
any case, even though the transient data structure pg_subtrans is
corrupted by the switched call Tom discovers.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: [HACKERS] StandbyRecoverPreparedTransactions recovers subtranslinks incorrectly
Next
From: Simon Riggs
Date:
Subject: Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders