Re: Index Skip Scan - Mailing list pgsql-hackers
From | Dmitry Dolgov |
---|---|
Subject | Re: Index Skip Scan |
Date | |
Msg-id | 20200122213603.ndlm6t4ryivdwxs7@localhost Whole thread Raw |
In response to | Re: Index Skip Scan (Floris Van Nee <florisvannee@Optiver.com>) |
Responses |
Re: Index Skip Scan
Re: Index Skip Scan |
List | pgsql-hackers |
> On Wed, Jan 22, 2020 at 05:24:43PM +0000, Floris Van Nee wrote: > > > > Anyone please correct me if I'm wrong, but I think one case where the current patch relies on some data from the pageit has locked before it in checking this hi/lo key. I think it's possible for the following sequence to happen. Supposewe have a very simple one leaf-page btree containing four elements: leaf page 1 = [2,4,6,8] > > > We do a backwards index skip scan on this and have just returned our first tuple (8). The buffer is left pinned butunlocked. Now, someone else comes in and inserts a tuple (value 5) into this page, but suppose the page happens to befull. So a page split occurs. As far as I know, a page split could happen at any random element in the page. One of thesituations we could be left with is: > > > Leaf page 1 = [2,4] > > > Leaf page 2 = [5,6,8] > > > However, our scan is still pointing to leaf page 1. > > > In case if we just returned a tuple, the next action would be either > >> check the next page for another key or search down to the tree. Maybe > > But it won't look at the 'next page for another key', but rather at the 'same page or another key', right? In the _bt_scankey_within_pageshortcut we're taking, there's no stepping to a next page involved. It just locks the page again thatit previously also locked. Yep, it would look only on the same page. Not sure what do you mean by "another key", if the current key is not found within the current page at the first stage, we restart from the root. > > I'm missing something in your scenario, but the latter will land us on a > > required page (we do not point to any leaf here), and before the former > > there is a check for high/low key. Is there anything else missing? > > Let me try to clarify. After we return the first tuple, so->currPos.buf is pointing to page=1 in my example (it's the onlypage after all). We've returned item=8. Then the split happens and the items get rearranged as in my example. We're stillpointing with so->currPos.buf to page=1, but the page now contains [2,4]. The split happened to the right, so there'sa page=2 with [5,6,8], however the ongoing index scan is unaware of that. > Now _bt_skip gets called to fetch the next tuple. It starts by checking _bt_scankey_within_page(scan, so->skipScanKey,so->currPos.buf, dir), the result of which will be 'true': we're comparing the skip key to the low key ofthe page. So it thinks the next key can be found on the current page. It locks the page and does a _binsrch, finding item=4to be returned. > > The problem here is that _bt_scankey_within_page mistakenly returns true, thereby limiting the search to just the pagethat it's pointing to already. > It may be fine to just fix this function to return the proper value (I guess it'd also need to look at the high key inthis example). It could also be fixed by not looking at the lo/hi key of the page, but to use the local tuple buffer instead.We already did a _read_page once, so if we have any matching tuples on that specific page, we have them locally inthe buffer already. That way we never need to lock the same page twice. Oh, that's what you mean. Yes, I was somehow tricked by the name of this function and didn't notice that it checks only one boundary, so in case of backward scan it returns wrong result. I think in the situation you've describe it would actually not find any item on the current page and restart from the root, but nevertheless we need to check for both keys in _bt_scankey_within_page.
pgsql-hackers by date: