On 23 April 2017 at 01:19, Andres Freund <andres@anarazel.de> wrote:
> On 2017-04-22 19:55:18 -0400, Tom Lane wrote:
>> Now that we've got consistent failure reports about the 009_twophase.pl
>> recovery test, I set out to find out why it's failing. It looks to me
>> like the reason is that this (twophase.c:2145):
>>
>> SubTransSetParent(xid, subxid, overwriteOK);
>>
>> ought to be this:
>>
>> SubTransSetParent(subxid, xid, overwriteOK);
>>
>> because the definition of SubTransSetParent is
>>
>> void
>> SubTransSetParent(TransactionId xid, TransactionId parent, bool overwriteOK)
>>
>> not the other way 'round.
>>
>> While "git blame" blames this line on the recent commit 728bd991c,
>> that just moved the call from somewhere else. AFAICS this has actually
>> been wrong since StandbyRecoverPreparedTransactions was written,
>> in 361bd1662 of 2010-04-13.
>
>> Also, when I fix that, it gets further but still crashes at the same
>> Assert in SubTransSetParent. The proximate cause this time seems to be
>> that RecoverPreparedTransactions's calculation of overwriteOK is wrong:
>> it's computing that as "false", but in reality the subtrans link in
>> question has already been set.
>>
>
> Yikes. This is clearly way undertested. It's also pretty scary that
> the code has recently been whacked out quite heavily (both 9.6 and
> master), without hitting anything around this - doesn't seem to bode
> well for how in-depth the testing was.
Obviously if there is a bug it's because tests didn't find it and
therefore it is by definition undertested for that specific bug.
I'm not sure what other conclusion you wish to draw, if any?
>> It's not clear to me how much potential this has to create user data
>> corruption, but it doesn't look good at first glance. Discuss.
>
> Hm. I think it can cause wrong tqual.c results in some edge cases.
> During HS, lastOverflowedXid will be set in some cases, and then
> XidInMVCCSnapshot etc will not find the actual toplevel xid, which'll in
> turn cause lookups snapshot->subxip (where all HS xids reside)
> potentially return wrong results.
>
> I was about to say that I don't see how it could result in persistent
> corruption however - the subtrans lookups are only done for
> (snapshot->xmin, snapshot->xmax] and subtrans is truncated
> regularly. But these days CHECKPOINT_END_OF_RECOVERY isn't obligatory
> anymore, so that might be delayed. Hm.
I've not found any reason, yet, to believe we return wrong answers in
any case, even though the transient data structure pg_subtrans is
corrupted by the switched call Tom discovers.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services