Re: Index Skip Scan (new UniqueKeys) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Index Skip Scan (new UniqueKeys) |
Date | |
Msg-id | CAH2-Wzkyx9A28bb5yNiN+6iRDJpGS0sQdAhyWEAA_yjrsh98-Q@mail.gmail.com Whole thread Raw |
In response to | Re: Index Skip Scan (new UniqueKeys) (Dmitry Dolgov <9erthalion6@gmail.com>) |
Responses |
Re: Index Skip Scan (new UniqueKeys)
|
List | pgsql-hackers |
On Sat, Jul 11, 2020 at 9:10 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > + currItem = &so->currPos.items[so->currPos.lastItem]; > > > + itup = (IndexTuple) (so->currTuples + currItem->tupleOffset); > > > + nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid); > > Do you mean this last part with t_tid, which could also have a tid array > in case of posting tuple format? Yeah. There is a TID array at the end of the index when the tuple is a posting list tuple (as indicated by BTreeTupleIsPivot()). It isn't safe to assume that t_tid is a heap TID for this reason, even in code that only ever considers data items (that is, non high-key tuples AKA non-pivot tuples) on a leaf page. (Though new/incoming tuples cannot be posting list tuples either, so you'll see the assumption that t_tid is just a heap TID in parts of nbtinsert.c -- though only for the new/incoming item.) > Well, it's obviously wrong, thanks for noticing. What is necessary is to > compare two index tuples, the start and the next one, to test if they're > the same (in which case if I'm not mistaken probably we can compare item > pointers). I've got this question when I was about to post a new version > with changes to address feedback from Andy, now I'll combine them and > send a cumulative patch. This sounds like approximately the same problem as the one that _bt_killitems() has to deal with as of Postgres 13. This is handled in a way that is admittedly pretty tricky, even though the code does not need to be 100% certain that it's "the same" tuple. Deduplication kind of makes that a fuzzy concept. In principle there could be one big index tuple instead of 5 tuples, even though the logical contents of the page have not been changed between the time we recording heap TIDs in local and the time _bt_killitems() tried to match on those heap TIDs to kill_prior_tuple-kill some index tuples -- a concurrent deduplication pass could do that. Your code needs to be prepared for stuff like that. Ultimately posting list tuples are just a matter of understanding the on-disk representation -- a "Small Matter of Programming". Even without deduplication there are potential hazards from the physical deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is not code that runs in VACUUM, despite the name). Make sure that you hold a buffer pin on the leaf page throughout, because you need to do that to make sure that VACUUM cannot concurrently recycle heap TIDs. If VACUUM *is* able to concurrently recycle heap TIDs then it'll be subtly broken. _bt_killitems() is safe because it either holds on to a pin or gives up when the LSN changes at all. (ISTM that your only choice is to hold on to a leaf page pin, since you cannot just decide to give up in the way that _bt_killitems() sometimes can.) Note that the rules surrounding buffer locks/pins for nbtree were tightened up a tiny bit today -- see commit 4a70f829. Also, it's no longer okay to use raw LockBuffer() calls in nbtree, so you're going to have to fix that up when you next rebase -- you must use the new _bt_lockbuf() wrapper function instead, so that the new Valgrind instrumentation is used. This shouldn't be hard. Perhaps you can use Valgrind to verify that this patch doesn't have any unsafe buffer accesses. I recall problems like that in earlier versions of the patch series. Valgrind should be able to detect most bugs like that (though see comments within _bt_lockbuf() for details of a bug in this area that Valgrind cannot detect even now). -- Peter Geoghegan
pgsql-hackers by date: