Re: SPGiST versus hot standby - question about conflict resolution rules - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: SPGiST versus hot standby - question about conflict resolution rules |
Date | |
Msg-id | 20120419065516.GC12337@tornado.leadboat.com Whole thread Raw |
In response to | SPGiST versus hot standby - question about conflict resolution rules (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Re: SPGiST versus hot standby - question about conflict
resolution rules
Re: Re: SPGiST versus hot standby - question about conflict resolution rules |
List | pgsql-hackers |
On Mon, Mar 12, 2012 at 10:50:36PM -0400, Tom Lane wrote: > There is one more (known) stop-ship problem in SPGiST, which I'd kind of > like to get out of the way now before I let my knowledge of that code > get swapped out again. This is that SPGiST is unsafe for use by hot > standby slaves. I suspect that swap-out has passed, but ... > The problem comes from "redirect" tuples, which are short-lifespan > objects that replace a tuple that's been moved to another page. > A redirect tuple can be recycled as soon as no active indexscan could > be "in flight" from the parent index page to the moved tuple. SPGiST > implements this by marking each redirect tuple with the XID of the > creating transaction, and assuming that the tuple can be recycled once > that XID is below the OldestXmin horizon (implying that all active > transactions started after it ended). This is fine as far as > transactions on the master are concerned, but there is no guarantee that > the recycling WAL record couldn't be replayed on a hot standby slave > while there are still HS transactions that saw the old state of the > parent index tuple. > > Now, btree has a very similar problem with deciding when it's safe to > recycle a deleted index page: it has to wait out transactions that could > be in flight to the page, and it does that by marking deleted pages with > XIDs. I see that the problem has been patched for btree by emitting a > special WAL record just before a page is recycled. However, I'm a bit > nervous about copying that solution, because the details are a bit > different. In particular, I see that btree marks deleted pages with > ReadNewTransactionId() --- that is, the next-to-be-assigned XID --- > rather than the XID of the originating transaction, and then it > subtracts one from the XID before sending it to the WAL stream. > The comments about this are not clear enough for me, and so I'm Attempting to write an explanation for that btree code led me conclude that the code is incorrect. (FWIW, I caused that wrongness.) I will start a separate thread to fix it. > wondering whether it's okay to use the originating transaction XID > in a similar way, or if we need to modify SPGiST's rule for how to > mark redirection tuples. I think that the use of ReadNewTransactionId > is because btree page deletion happens in VACUUM, which does not have > its own XID; this is unlike the situation for SPGiST where creation of > redirects is caused by index tuple insertion, so there is a surrounding > transaction with a real XID. But it's not clear to me how > GetConflictingVirtualXIDs makes use of the limitXmin and whether a live > XID is okay to pass to it, or whether we actually need "next XID - 1". > > Info appreciated. GetConflictingVirtualXIDs() selects transactions with pgaxt->xmin <= limitXmin. The prototype use case was VACUUM, where limitXmin is the xmax of a dead tuple cleaned from a page. Transactions with xmin <= limitXmin might still regard the limitXmin XID as uncommitted; we conflict so they cannot fail to see the tuple we're about to purge. All hot standby transactions holding snapshots taken before the startup process applies the tuple-mover transaction's commit record will have xmin <= its XID. Therefore, passing that XID to ResolveRecoveryConflictWithSnapshot() meets the need here precisely. vacuum_defer_cleanup_age and hot_standby_feedback inform GetOldestXmin(), so the master will delay cleanup long enough to prevent conflicts when so configured. Thanks, nm
pgsql-hackers by date: