Re: Index Skip Scan (new UniqueKeys) - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Index Skip Scan (new UniqueKeys)
Date
Msg-id CAH2-Wzkyx9A28bb5yNiN+6iRDJpGS0sQdAhyWEAA_yjrsh98-Q@mail.gmail.com
Whole thread Raw
In response to Re: Index Skip Scan (new UniqueKeys)  (Dmitry Dolgov <9erthalion6@gmail.com>)
Responses Re: Index Skip Scan (new UniqueKeys)
List pgsql-hackers
On Sat, Jul 11, 2020 at 9:10 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > > +               currItem = &so->currPos.items[so->currPos.lastItem];
> > > +               itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
> > > +               nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid);
>
> Do you mean this last part with t_tid, which could also have a tid array
> in case of posting tuple format?

Yeah. There is a TID array at the end of the index when the tuple is a
posting list tuple (as indicated by BTreeTupleIsPivot()). It isn't
safe to assume that t_tid is a heap TID for this reason, even in code
that only ever considers data items (that is, non high-key tuples AKA
non-pivot tuples) on a leaf page. (Though new/incoming tuples cannot
be posting list tuples either, so you'll see the assumption that t_tid
is just a heap TID in parts of nbtinsert.c -- though only for the
new/incoming item.)

> Well, it's obviously wrong, thanks for noticing. What is necessary is to
> compare two index tuples, the start and the next one, to test if they're
> the same (in which case if I'm not mistaken probably we can compare item
> pointers). I've got this question when I was about to post a new version
> with changes to address feedback from Andy, now I'll combine them and
> send a cumulative patch.

This sounds like approximately the same problem as the one that
_bt_killitems() has to deal with as of Postgres 13. This is handled in
a way that is admittedly pretty tricky, even though the code does not
need to be 100% certain that it's "the same" tuple. Deduplication kind
of makes that a fuzzy concept. In principle there could be one big
index tuple instead of 5 tuples, even though the logical contents of
the page have not been changed between the time we recording heap TIDs
in local and the time _bt_killitems() tried to match on those heap
TIDs to kill_prior_tuple-kill some index tuples -- a concurrent
deduplication pass could do that. Your code needs to be prepared for
stuff like that.

Ultimately posting list tuples are just a matter of understanding the
on-disk representation -- a "Small Matter of Programming". Even
without deduplication there are potential hazards from the physical
deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is
not code that runs in VACUUM, despite the name). Make sure that you
hold a buffer pin on the leaf page throughout, because you need to do
that to make sure that VACUUM cannot concurrently recycle heap TIDs.
If VACUUM *is* able to concurrently recycle heap TIDs then it'll be
subtly broken. _bt_killitems() is safe because it either holds on to a
pin or gives up when the LSN changes at all. (ISTM that your only
choice is to hold on to a leaf page pin, since you cannot just decide
to give up in the way that _bt_killitems() sometimes can.)

Note that the rules surrounding buffer locks/pins for nbtree were
tightened up a tiny bit today -- see commit 4a70f829. Also, it's no
longer okay to use raw LockBuffer() calls in nbtree, so you're going
to have to fix that up when you next rebase -- you must use the new
_bt_lockbuf() wrapper function instead, so that the new Valgrind
instrumentation is used. This shouldn't be hard.

Perhaps you can use Valgrind to verify that this patch doesn't have
any unsafe buffer accesses. I recall problems like that in earlier
versions of the patch series. Valgrind should be able to detect most
bugs like that (though see comments within _bt_lockbuf() for details
of a bug in this area that Valgrind cannot detect even now).

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Using Valgrind to detect faulty buffer accesses (no pin or buffer content lock held)
Next
From: David Rowley
Date:
Subject: Re: Parallel Seq Scan vs kernel read ahead