Re: Re: SPGiST versus hot standby - question about conflict resolution rules - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Re: SPGiST versus hot standby - question about conflict resolution rules
Date
Msg-id CA+U5nMKVu2BRug_ORh1FGYEiQ5jKPfkUecCzUUp=VAwU0VTg8w@mail.gmail.com
Whole thread Raw
In response to Re: SPGiST versus hot standby - question about conflict resolution rules  (Noah Misch <noah@leadboat.com>)
List pgsql-hackers
On Thu, Apr 19, 2012 at 7:55 AM, Noah Misch <noah@leadboat.com> wrote:
> On Mon, Mar 12, 2012 at 10:50:36PM -0400, Tom Lane wrote:
>> There is one more (known) stop-ship problem in SPGiST, which I'd kind of
>> like to get out of the way now before I let my knowledge of that code
>> get swapped out again.  This is that SPGiST is unsafe for use by hot
>> standby slaves.
>
> I suspect that swap-out has passed, but ...
>
>> The problem comes from "redirect" tuples, which are short-lifespan
>> objects that replace a tuple that's been moved to another page.
>> A redirect tuple can be recycled as soon as no active indexscan could
>> be "in flight" from the parent index page to the moved tuple.  SPGiST
>> implements this by marking each redirect tuple with the XID of the
>> creating transaction, and assuming that the tuple can be recycled once
>> that XID is below the OldestXmin horizon (implying that all active
>> transactions started after it ended).  This is fine as far as
>> transactions on the master are concerned, but there is no guarantee that
>> the recycling WAL record couldn't be replayed on a hot standby slave
>> while there are still HS transactions that saw the old state of the
>> parent index tuple.
>>
>> Now, btree has a very similar problem with deciding when it's safe to
>> recycle a deleted index page: it has to wait out transactions that could
>> be in flight to the page, and it does that by marking deleted pages with
>> XIDs.  I see that the problem has been patched for btree by emitting a
>> special WAL record just before a page is recycled.  However, I'm a bit
>> nervous about copying that solution, because the details are a bit
>> different.  In particular, I see that btree marks deleted pages with
>> ReadNewTransactionId() --- that is, the next-to-be-assigned XID ---
>> rather than the XID of the originating transaction, and then it
>> subtracts one from the XID before sending it to the WAL stream.
>> The comments about this are not clear enough for me, and so I'm
>
> Attempting to write an explanation for that btree code led me conclude that
> the code is incorrect.  (FWIW, I caused that wrongness.)  I will start a
> separate thread to fix it.

Wrong or not, we need to better document why we picked
ReadNewTransactionID(), rather than OldestXmin which seems the more
obvious and cheaper choice.

> All hot standby transactions holding snapshots taken before the startup
> process applies the tuple-mover transaction's commit record will have xmin <=
> its XID.  Therefore, passing that XID to ResolveRecoveryConflictWithSnapshot()
> meets the need here precisely.

Yes, agreed. i.e. don't subtract 1.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: 9.3 Pre-proposal: Range Merge Join
Next
From: Susanne Ebrecht
Date:
Subject: Re: Bug tracker tool we need