Re: Index Skip Scan - Mailing list pgsql-hackers

From Dmitry Dolgov
Subject Re: Index Skip Scan
Date
Msg-id 20200127103016.ieflsboqveb6iduc@localhost
Whole thread Raw
In response to Re: Index Skip Scan  (Dmitry Dolgov <9erthalion6@gmail.com>)
Responses RE: Index Skip Scan  (Floris Van Nee <florisvannee@Optiver.com>)
List pgsql-hackers
> On Wed, Jan 22, 2020 at 10:36:03PM +0100, Dmitry Dolgov wrote:
>
> > Let me try to clarify. After we return the first tuple, so->currPos.buf is pointing to page=1 in my example (it's
theonly page after all). We've returned item=8. Then the split happens and the items get rearranged as in my example.
We'restill pointing with so->currPos.buf to page=1, but the page now contains [2,4]. The split happened to the right,
sothere's a page=2 with [5,6,8], however the ongoing index scan is unaware of that.
 
> > Now _bt_skip gets called to fetch the next tuple. It starts by checking _bt_scankey_within_page(scan,
so->skipScanKey,so->currPos.buf, dir), the result of which will be 'true': we're comparing the skip key to the low key
ofthe page. So it thinks the next key can be found on the current page. It locks the page and does a _binsrch, finding
item=4to be returned.
 
> >
> > The problem here is that _bt_scankey_within_page mistakenly returns true, thereby limiting the search to just the
pagethat it's pointing to already.
 
> > It may be fine to just fix this function to return the proper value (I guess it'd also need to look at the high key
inthis example). It could also be fixed by not looking at the lo/hi key of the page, but to use the local tuple buffer
instead.We already did a _read_page once, so if we have any matching tuples on that specific page, we have them locally
inthe buffer already. That way we never need to lock the same page twice.
 
>
> Oh, that's what you mean. Yes, I was somehow tricked by the name of this
> function and didn't notice that it checks only one boundary, so in case
> of backward scan it returns wrong result. I think in the situation
> you've describe it would actually not find any item on the current page
> and restart from the root, but nevertheless we need to check for both
> keys in _bt_scankey_within_page.

Thanks again everyone for commentaries and clarification. Here is the
version, where hopefully I've addressed all the mentioned issues.

As mentioned in the _bt_skip commentaries, before we were moving left to
check the next page to avoid significant issues in case if ndistinct was
underestimated and we need to skip too often. To make it work safe in
presense of splits we need to remember an original page and move right
again until we find a page with the right link pointing to it. It's not
clear whether it's worth to increase complexity for such sort of "edge
case" with ndistinct estimation while moving left, so at least for now
we ignore this in the implementation and just start from the root
immediately.

Offset based code in moving forward/reading backward was replaced with
remembering a start index tuple and an attempt to find it on the new
page. Also a missing page lock before _bt_scankey_within_page was added
and _bt_scankey_within_page checks for both page boundaries.

Attachment

pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: [HACKERS] WAL logging problem in 9.4.3?
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: [PATCH] /src/backend/access/transam/xlog.c, tiny improvements