Thread: Index Skip Scan (new UniqueKeys)
Hi, Here is a new version of index skip scan patch, based on v8 patch for UniqueKeys implementation from [1]. I want to start a new thread to simplify navigation, hopefully I didn't forget anyone who actively participated in the discussion. To simplify reviewing I've split it into several parts: * First two are taken from [1] just for the reference and to make cfbot happy. * Extend-UniqueKeys consists changes that needs to be done for UniqueKeys to be used in skip scan. Essentially this is a reduced version of previous implementation from Jesper & David, based on the new UniqueKeys infrastructure, as it does the very same thing. * Index-Skip-Scan contains not am specific code and the overall infrastructure, including configuration elements and so on. * Btree-implementation contains btree specific code to implement amskip, introduced in the previous patch. * The last one contains just documentation bits. Interesting enough with a new UniqueKey implementation skipping is applied in some tests unexpectedly. For those I've disabled indexskipscan to avoid confusion. [1]: https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com
Attachment
On Tue, Jun 9, 2020 at 6:20 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
Hi,
Here is a new version of index skip scan patch, based on v8 patch for
UniqueKeys implementation from [1]. I want to start a new thread to
simplify navigation, hopefully I didn't forget anyone who actively
participated in the discussion.
To simplify reviewing I've split it into several parts:
* First two are taken from [1] just for the reference and to make cfbot happy.
* Extend-UniqueKeys consists changes that needs to be done for
UniqueKeys to be used in skip scan. Essentially this is a reduced
version of previous implementation from Jesper & David, based on the
new UniqueKeys infrastructure, as it does the very same thing.
* Index-Skip-Scan contains not am specific code and the overall
infrastructure, including configuration elements and so on.
* Btree-implementation contains btree specific code to implement amskip,
introduced in the previous patch.
* The last one contains just documentation bits.
Interesting enough with a new UniqueKey implementation skipping is
applied in some tests unexpectedly. For those I've disabled
indexskipscan to avoid confusion.
[1]: https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com
Thanks for the patch.
I just get the rough idea of patch, looks we have to narrow down the user cases
where we can use this method. Consider the below example:
create table j1(i int, im5 int, im100 int, im1000 int);
insert into j1 select i, i%5, i%100, i%1000 from generate_series(1, 10000000)i;
create index on j1(im5, i);
insert into j1 values (1, 1, 0, 0);
analyze j1;
demo=# select distinct * from j1 where i < 2;
i | im5 | im100 | im1000
---+-----+-------+--------
1 | 1 | 1 | 1
(1 row)
demo=# set enable_indexskipscan to off;
SET
demo=# select distinct * from j1 where i < 2;
i | im5 | im100 | im1000
---+-----+-------+--------
1 | 1 | 0 | 0
1 | 1 | 1 | 1
(2 rows)
drop index j1_im5_i_idx;
create index on j1(im5, i, im100);
demo=# select distinct im5, i, im100 from j1 where i < 2;
im5 | i | im100
-----+---+-------
1 | 1 | 0
1 | 1 | 1
(2 rows)
demo=# set enable_indexskipscan to on;
SET
demo=# select distinct im5, i, im100 from j1 where i < 2;
im5 | i | im100
-----+---+-------
1 | 1 | 0
(1 row)
I just get the rough idea of patch, looks we have to narrow down the user cases
where we can use this method. Consider the below example:
create table j1(i int, im5 int, im100 int, im1000 int);
insert into j1 select i, i%5, i%100, i%1000 from generate_series(1, 10000000)i;
create index on j1(im5, i);
insert into j1 values (1, 1, 0, 0);
analyze j1;
demo=# select distinct * from j1 where i < 2;
i | im5 | im100 | im1000
---+-----+-------+--------
1 | 1 | 1 | 1
(1 row)
demo=# set enable_indexskipscan to off;
SET
demo=# select distinct * from j1 where i < 2;
i | im5 | im100 | im1000
---+-----+-------+--------
1 | 1 | 0 | 0
1 | 1 | 1 | 1
(2 rows)
drop index j1_im5_i_idx;
create index on j1(im5, i, im100);
demo=# select distinct im5, i, im100 from j1 where i < 2;
im5 | i | im100
-----+---+-------
1 | 1 | 0
1 | 1 | 1
(2 rows)
demo=# set enable_indexskipscan to on;
SET
demo=# select distinct im5, i, im100 from j1 where i < 2;
im5 | i | im100
-----+---+-------
1 | 1 | 0
(1 row)
Best Regards
Andy Fan
> On Thu, Jun 11, 2020 at 04:14:07PM +0800, Andy Fan wrote: > > I just get the rough idea of patch, looks we have to narrow down the > user cases where we can use this method. Consider the below example: Hi Not exactly narrow down, but rather get rid of wrong usage of skipping for index scan. Since skipping for it was added later than for index only scan I can imagine there are still blind spots, so good that you've looked. In this particular case, when index expressions do not fully cover those expressionse result need to be distinct on, skipping just doesn't have enough information and should not be used. I'll add it to the next version, thanks!
On Tue, Jun 9, 2020 at 3:20 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > * Btree-implementation contains btree specific code to implement amskip, > introduced in the previous patch. The way that you're dealing with B-Tree tuples here needs to account for posting list tuples: > + currItem = &so->currPos.items[so->currPos.lastItem]; > + itup = (IndexTuple) (so->currTuples + currItem->tupleOffset); > + nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid); But I wonder more generally what the idea here is. The following comments that immediately follow provide some hints: > + /* > + * To check if we returned the same tuple, try to find a > + * startItup on the current page. For that we need to update > + * scankey to match the whole tuple and set nextkey to return > + * an exact tuple, not the next one. If the nextOffset is the > + * same as before, it means we are in the loop, return offnum > + * to the original position and jump further > + */ Why does it make sense to use the offset number like this? It isn't stable or reliable. The patch goes on to do this: > + startOffset = _bt_binsrch(scan->indexRelation, > + so->skipScanKey, > + so->currPos.buf); > + > + page = BufferGetPage(so->currPos.buf); > + maxoff = PageGetMaxOffsetNumber(page); > + > + if (nextOffset <= startOffset) > + { Why compare a heap TID's offset number (an offset number for a heap page) to another offset number for a B-Tree leaf page? They're fundamentally different things. -- Peter Geoghegan
Hi Dmitry, Also took another look at the patch now, and found a case of incorrect data. It looks related to the new way of creatingthe paths, as I can't recall seeing this in earlier versions. create table t1 as select a,b,b%5 as c, random() as d from generate_series(1, 10) a, generate_series(1,100) b; create index on t1 (a,b,c); postgres=# explain select distinct on (a) * from t1 order by a,b desc,c; QUERY PLAN ------------------------------------------------------------------------------- Sort (cost=2.92..2.94 rows=10 width=20) Sort Key: a, b DESC, c -> Index Scan using t1_a_b_c_idx on t1 (cost=0.28..2.75 rows=10 width=20) Skip scan: true (4 rows) With the 'order by a, b desc, c' we expect the value of column 'b' to always be 100. With index_skipscan enabled, it alwaysgives 1 though. It's not correct that the planner chooses a skip scan followed by sort in this case. -Floris
> On Wed, Jul 08, 2020 at 03:44:26PM -0700, Peter Geoghegan wrote: > > On Tue, Jun 9, 2020 at 3:20 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > * Btree-implementation contains btree specific code to implement amskip, > > introduced in the previous patch. > > The way that you're dealing with B-Tree tuples here needs to account > for posting list tuples: > > > + currItem = &so->currPos.items[so->currPos.lastItem]; > > + itup = (IndexTuple) (so->currTuples + currItem->tupleOffset); > > + nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid); Do you mean this last part with t_tid, which could also have a tid array in case of posting tuple format? > > + /* > > + * To check if we returned the same tuple, try to find a > > + * startItup on the current page. For that we need to update > > + * scankey to match the whole tuple and set nextkey to return > > + * an exact tuple, not the next one. If the nextOffset is the > > + * same as before, it means we are in the loop, return offnum > > + * to the original position and jump further > > + */ > > Why does it make sense to use the offset number like this? It isn't > stable or reliable. The patch goes on to do this: > > > + startOffset = _bt_binsrch(scan->indexRelation, > > + so->skipScanKey, > > + so->currPos.buf); > > + > > + page = BufferGetPage(so->currPos.buf); > > + maxoff = PageGetMaxOffsetNumber(page); > > + > > + if (nextOffset <= startOffset) > > + { > > Why compare a heap TID's offset number (an offset number for a heap > page) to another offset number for a B-Tree leaf page? They're > fundamentally different things. Well, it's obviously wrong, thanks for noticing. What is necessary is to compare two index tuples, the start and the next one, to test if they're the same (in which case if I'm not mistaken probably we can compare item pointers). I've got this question when I was about to post a new version with changes to address feedback from Andy, now I'll combine them and send a cumulative patch.
> On Fri, Jul 10, 2020 at 05:03:37PM +0000, Floris Van Nee wrote: > > Also took another look at the patch now, and found a case of incorrect > data. It looks related to the new way of creating the paths, as I > can't recall seeing this in earlier versions. > > create table t1 as select a,b,b%5 as c, random() as d from generate_series(1, 10) a, generate_series(1,100) b; > create index on t1 (a,b,c); > > postgres=# explain select distinct on (a) * from t1 order by a,b desc,c; > QUERY PLAN > ------------------------------------------------------------------------------- > Sort (cost=2.92..2.94 rows=10 width=20) > Sort Key: a, b DESC, c > -> Index Scan using t1_a_b_c_idx on t1 (cost=0.28..2.75 rows=10 width=20) > Skip scan: true > (4 rows) Good point, thanks for looking at this. With the latest planner version there are indeed more possibilities to use skipping. It never occured to me that some of those paths will still rely on index scan returning full data set. I'll look in details and add verification to prevent putting something like this on top of skip scan in the next version.
> > Good point, thanks for looking at this. With the latest planner version there > are indeed more possibilities to use skipping. It never occured to me that > some of those paths will still rely on index scan returning full data set. I'll look > in details and add verification to prevent putting something like this on top of > skip scan in the next version. I believe the required changes are something like in attached patch. There were a few things I've changed: - build_uniquekeys was constructing the list incorrectly. For a DISTINCT a,b, it would create two unique keys, one with aand one with b. However, it should be one unique key with (a,b). - the uniquekeys that is built, still contains some redundant keys, that are normally eliminated from the path keys lists. - the distinct_pathkeys may be NULL, even though there's a possibility for skipping. But it wouldn't create the uniquekeysin this case. This makes the planner not choose skip scans even though it could. For example in queries that doSELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a is constant, it's eliminated from regular pathkeys. - to combat the issues mentioned earlier, there's now a check in build_index_paths that checks if the query_pathkeys matchesthe useful_pathkeys. Note that we have to use the path keys here rather than any of the unique keys. The unique keysare only Expr nodes - they do not contain the necessary information about ordering. Due to elimination of some constantpath keys, we have to search the attributes of the index to find the correct prefix to use in skipping. - creating the skip scan path did not actually fill the Path's unique keys. It should just set this to query_uniquekeys. I've attached the first two unique-keys patches (v9, 0001, 0002)), your patches, but rebased on v9 of unique keys (0003-0006)+ a diff patch (0007) that applies my suggested changes on top of it. -Floris
Attachment
> > I've attached the first two unique-keys patches (v9, 0001, 0002)), your > patches, but rebased on v9 of unique keys (0003-0006) + a diff patch (0007) > that applies my suggested changes on top of it. > I just realized there's another thing that looks a bit strange too. From reading the thread, I thought it should be the casethat in create_distinct_paths, it is checked whether the uniqueness in the unique_pathlist matches the uniqueness thatis needed by the query. This means that I think what it should be comparing is this: - The generated index path should have some path-level unique keys set - The path-level unique keys must be at least as strict as the path-level unique keys. Eg. if path-level is unique on (a),then query-level must be (a), or possibly (a,b). I've changed the patch to compare the path-level keys (set in create index path) with the query-level keys in create_distinct_path.Currently, I don't think the previous implementation was an actual issue leading to incorrect queries,but it would be causing problems if we tried to extend the uniqueness for distinct to join rels etc. One question about the unique keys - probably for Andy or David: I've looked in the archives to find arguments for/againstusing Expr nodes or EquivalenceClasses in the Unique Keys patch. However, I couldn't really find a clear answerabout why the current patch uses Expr rather than EquivalenceClasses. At some point David mentioned "that probablyExpr nodes were needed rather than EquivalenceClasses", but it's not really clear to me why. What were the thoughtsbehind this? -Floris
Attachment
> On Sun, Jul 12, 2020 at 12:48:47PM +0000, Floris Van Nee wrote: > > > > Good point, thanks for looking at this. With the latest planner version there > > are indeed more possibilities to use skipping. It never occured to me that > > some of those paths will still rely on index scan returning full data set. I'll look > > in details and add verification to prevent putting something like this on top of > > skip scan in the next version. > > I believe the required changes are something like in attached patch. There were a few things I've changed: > - build_uniquekeys was constructing the list incorrectly. For a DISTINCT a,b, it would create two unique keys, one witha and one with b. However, it should be one unique key with (a,b). Yes, I've also noticed that while preparing fix for index scan not covered by index and included it. > - the uniquekeys that is built, still contains some redundant keys, that are normally eliminated from the path keys lists. I guess you're talking about: + if (EC_MUST_BE_REDUNDANT(ec)) + continue; Can you add some test cases to your changes to show the effect of it? It seem to me redundant keys are already eliminated at this point by either make_pathkeys_for_uniquekeys or even earlier for distinct on, but could be I've missed something. Along the lines I'm also curious about this part: - ListCell *k; - List *exprs = NIL; - - foreach(k, ec->ec_members) - { - EquivalenceMember *mem = (EquivalenceMember *) lfirst(k); - exprs = lappend(exprs, mem->em_expr); - } - - result = lappend(result, makeUniqueKey(exprs, false, false)); + EquivalenceMember *mem = (EquivalenceMember*) lfirst(list_head(ec->ec_members)); I'm curious about this myself, maybe someone can clarify. It looks like generaly speaking there could be more than one member (if not ec_has_volatile), which "representing knowledge that multiple items are effectively equal". Is this information is not interesting enough to preserve it in unique keys? > - the distinct_pathkeys may be NULL, even though there's a possibility for skipping. But it wouldn't create the uniquekeysin this case. This makes the planner not choose skip scans even though it could. For example in queries that doSELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a is constant, it's eliminated from regular pathkeys. What would be the point of skipping if it's a constant? > - to combat the issues mentioned earlier, there's now a check in build_index_paths that checks if the query_pathkeys matchesthe useful_pathkeys. Note that we have to use the path keys here rather than any of the unique keys. The unique keysare only Expr nodes - they do not contain the necessary information about ordering. Due to elimination of some constantpath keys, we have to search the attributes of the index to find the correct prefix to use in skipping. IIUC here you mean this function, right? + prefix = find_index_prefix_for_pathkey(root, + index, + BackwardScanDirection, + llast_node(PathKey, + root->distinct_pathkeys)); Doesn't it duplicate the job already done in build_index_pathkeys by building those pathkeys again? If yes, probably it's possible to reuse useful_pathkeys. Not sure about unordered indexes, but looks like query_pathkeys should also match in this case. Will also look at the follow up questions in the next email.
> > > - the uniquekeys that is built, still contains some redundant keys, that are > normally eliminated from the path keys lists. > > I guess you're talking about: > > + if (EC_MUST_BE_REDUNDANT(ec)) > + continue; > > Can you add some test cases to your changes to show the effect of it? It > seem to me redundant keys are already eliminated at this point by either > make_pathkeys_for_uniquekeys or even earlier for distinct on, but could be > I've missed something. > The build_uniquekeys function calls make_pathkeys_for_uniquekeys, which checks for uniqueness using pathkey_is_unique, butnot for constantness. Consider a query like: SELECT DISTINCT ON (a,b) * FROM t1 WHERE a=1 ORDER BY a,b,c All the regular path keys filter out 'a' for constantness here - so they would end up with a distinct pathkeys of (b) andsort path keys of (b,c). The unique keys would end up with (a,b) though. In the previous patch, it'd compared in create_distinct_paths, the pathkeysto the unique keys, so it wouldn't consider the skip scan. Due to the other changes I made in create_distinct_paths/query_has_uniquekeys_for, it will choose a correct plan now, evenwithout the EC_MUST_BE_REDUNDANT check though, so it's difficult to give an actual failing test case now. However, sinceall code filters out constant keys, I think uniqueness should do it too - otherwise you could get into problems lateron. It's also more consistent. If you already know something is unique by just (b), it doesn't make sense to store thatit's unique by (a,b). Now that I think of it, the best place to do this EC_MUST_BE_REDUNDANT check is probably insidemake_pathkeys_for_uniquekeys, rather than build_uniquekeys though. It's probably good to move it there. > Along the lines I'm also curious about this part: > > - ListCell *k; > - List *exprs = NIL; > - > - foreach(k, ec->ec_members) > - { > - EquivalenceMember *mem = (EquivalenceMember *) > lfirst(k); > - exprs = lappend(exprs, mem->em_expr); > - } > - > - result = lappend(result, makeUniqueKey(exprs, false, false)); > + EquivalenceMember *mem = (EquivalenceMember*) > +lfirst(list_head(ec->ec_members)); > > I'm curious about this myself, maybe someone can clarify. It looks like > generaly speaking there could be more than one member (if not > ec_has_volatile), which "representing knowledge that multiple items are > effectively equal". Is this information is not interesting enough to preserve it > in unique keys? Yeah, that's a good question. Hence my question about the choice for Expr rather than EquivalenceClass for the Unique Keyspatch to Andy/David. When storing just Expr, it is rather difficult to check equivalence in joins for example. Suppose,later on we decide to add join support to the distinct skip scan. Consider a query like this: SELECT DISTINCT t1.a FROM t1 JOIN t2 ON t1.a=t2.a As far as my understanding goes (I didn't look into it in detail though), I think here the distinct_pathkey will have anEqClass {t1.a, t2.a}. That results in a UniqueKey with expr (t1.a) (because currently we only take the first Expr in thelist to construct the UniqueKey). We could also construct *two* unique keys, one with Expr (t1.a) and one with Expr (t2.a),but I don't think that's the correct approach either, as it will explode when you have multiple pathkeys, each havingmultiple Expr inside their EqClasses. That makes it difficult to check if we can perform the DISTINCT skip scan on table t2 as well (theoretically we could, butfor that we need to check equivalence classes rather than expressions). > > > - the distinct_pathkeys may be NULL, even though there's a possibility for > skipping. But it wouldn't create the uniquekeys in this case. This makes the > planner not choose skip scans even though it could. For example in queries > that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a > is constant, it's eliminated from regular pathkeys. > > What would be the point of skipping if it's a constant? For the query: SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; There may be 1000s of records with a=1. We're only interested in the first one though. The traditional non-skip approachwould still scan all records with a=1. Skip would just fetch the first one with a=1 and then skip to the next prefixand stop the scan. > > > - to combat the issues mentioned earlier, there's now a check in > build_index_paths that checks if the query_pathkeys matches the > useful_pathkeys. Note that we have to use the path keys here rather than > any of the unique keys. The unique keys are only Expr nodes - they do not > contain the necessary information about ordering. Due to elimination of > some constant path keys, we have to search the attributes of the index to > find the correct prefix to use in skipping. > > IIUC here you mean this function, right? > > + prefix = find_index_prefix_for_pathkey(root, > + > index, > + > BackwardScanDirection, > + > llast_node(PathKey, > + > root->distinct_pathkeys)); > > Doesn't it duplicate the job already done in build_index_pathkeys by building > those pathkeys again? If yes, probably it's possible to reuse useful_pathkeys. > Not sure about unordered indexes, but looks like query_pathkeys should > also match in this case. > Yeah, there's definitely some double work there, but the actual impact may be limited - it doesn't actually allocate a newpath key, but it looks it up in root->canon_pathkeys and returns that path key. I wrote it like this, because I couldn't find a way to identify from a certain PathKey the actual location in the index ofthat column. The constructed path keys list filters out all redundant path keys. An index on (a,a,b,a,b) becomes path keys(a,b). Now if we skip on (a,b) we actually need to use prefix=3. But how to get from PathKey=b to that number 3, I didn'tfind a solid way except doing this. Maybe there is though? -Floris
On Sat, Jul 11, 2020 at 9:10 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > + currItem = &so->currPos.items[so->currPos.lastItem]; > > > + itup = (IndexTuple) (so->currTuples + currItem->tupleOffset); > > > + nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid); > > Do you mean this last part with t_tid, which could also have a tid array > in case of posting tuple format? Yeah. There is a TID array at the end of the index when the tuple is a posting list tuple (as indicated by BTreeTupleIsPivot()). It isn't safe to assume that t_tid is a heap TID for this reason, even in code that only ever considers data items (that is, non high-key tuples AKA non-pivot tuples) on a leaf page. (Though new/incoming tuples cannot be posting list tuples either, so you'll see the assumption that t_tid is just a heap TID in parts of nbtinsert.c -- though only for the new/incoming item.) > Well, it's obviously wrong, thanks for noticing. What is necessary is to > compare two index tuples, the start and the next one, to test if they're > the same (in which case if I'm not mistaken probably we can compare item > pointers). I've got this question when I was about to post a new version > with changes to address feedback from Andy, now I'll combine them and > send a cumulative patch. This sounds like approximately the same problem as the one that _bt_killitems() has to deal with as of Postgres 13. This is handled in a way that is admittedly pretty tricky, even though the code does not need to be 100% certain that it's "the same" tuple. Deduplication kind of makes that a fuzzy concept. In principle there could be one big index tuple instead of 5 tuples, even though the logical contents of the page have not been changed between the time we recording heap TIDs in local and the time _bt_killitems() tried to match on those heap TIDs to kill_prior_tuple-kill some index tuples -- a concurrent deduplication pass could do that. Your code needs to be prepared for stuff like that. Ultimately posting list tuples are just a matter of understanding the on-disk representation -- a "Small Matter of Programming". Even without deduplication there are potential hazards from the physical deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is not code that runs in VACUUM, despite the name). Make sure that you hold a buffer pin on the leaf page throughout, because you need to do that to make sure that VACUUM cannot concurrently recycle heap TIDs. If VACUUM *is* able to concurrently recycle heap TIDs then it'll be subtly broken. _bt_killitems() is safe because it either holds on to a pin or gives up when the LSN changes at all. (ISTM that your only choice is to hold on to a leaf page pin, since you cannot just decide to give up in the way that _bt_killitems() sometimes can.) Note that the rules surrounding buffer locks/pins for nbtree were tightened up a tiny bit today -- see commit 4a70f829. Also, it's no longer okay to use raw LockBuffer() calls in nbtree, so you're going to have to fix that up when you next rebase -- you must use the new _bt_lockbuf() wrapper function instead, so that the new Valgrind instrumentation is used. This shouldn't be hard. Perhaps you can use Valgrind to verify that this patch doesn't have any unsafe buffer accesses. I recall problems like that in earlier versions of the patch series. Valgrind should be able to detect most bugs like that (though see comments within _bt_lockbuf() for details of a bug in this area that Valgrind cannot detect even now). -- Peter Geoghegan
> On Tue, Jul 14, 2020 at 06:18:50PM +0000, Floris Van Nee wrote: > > Due to the other changes I made in create_distinct_paths/query_has_uniquekeys_for, it will choose a correct plan now, evenwithout the EC_MUST_BE_REDUNDANT check though, so it's difficult to give an actual failing test case now. However, sinceall code filters out constant keys, I think uniqueness should do it too - otherwise you could get into problems lateron. It's also more consistent. If you already know something is unique by just (b), it doesn't make sense to store thatit's unique by (a,b). Now that I think of it, the best place to do this EC_MUST_BE_REDUNDANT check is probably insidemake_pathkeys_for_uniquekeys, rather than build_uniquekeys though. It's probably good to move it there. That would be my suggestion as well. > > Along the lines I'm also curious about this part: > > > > - ListCell *k; > > - List *exprs = NIL; > > - > > - foreach(k, ec->ec_members) > > - { > > - EquivalenceMember *mem = (EquivalenceMember *) > > lfirst(k); > > - exprs = lappend(exprs, mem->em_expr); > > - } > > - > > - result = lappend(result, makeUniqueKey(exprs, false, false)); > > + EquivalenceMember *mem = (EquivalenceMember*) > > +lfirst(list_head(ec->ec_members)); > > > > I'm curious about this myself, maybe someone can clarify. It looks like > > generaly speaking there could be more than one member (if not > > ec_has_volatile), which "representing knowledge that multiple items are > > effectively equal". Is this information is not interesting enough to preserve it > > in unique keys? > > Yeah, that's a good question. Hence my question about the choice for Expr rather than EquivalenceClass for the Unique Keyspatch to Andy/David. When storing just Expr, it is rather difficult to check equivalence in joins for example. Suppose,later on we decide to add join support to the distinct skip scan. Consider a query like this: > SELECT DISTINCT t1.a FROM t1 JOIN t2 ON t1.a=t2.a > As far as my understanding goes (I didn't look into it in detail though), I think here the distinct_pathkey will have anEqClass {t1.a, t2.a}. That results in a UniqueKey with expr (t1.a) (because currently we only take the first Expr in thelist to construct the UniqueKey). We could also construct *two* unique keys, one with Expr (t1.a) and one with Expr (t2.a),but I don't think that's the correct approach either, as it will explode when you have multiple pathkeys, each havingmultiple Expr inside their EqClasses. One UniqueKey can have multiple corresponding expressions, which gives us also possibility of having one unique key with (t1.a, t2.a) and it looks now similar to EquivalenceClass. > > > - the distinct_pathkeys may be NULL, even though there's a possibility for > > skipping. But it wouldn't create the uniquekeys in this case. This makes the > > planner not choose skip scans even though it could. For example in queries > > that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a > > is constant, it's eliminated from regular pathkeys. > > > > What would be the point of skipping if it's a constant? > > For the query: SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; > There may be 1000s of records with a=1. We're only interested in the first one though. The traditional non-skip approachwould still scan all records with a=1. Skip would just fetch the first one with a=1 and then skip to the next prefixand stop the scan. The idea behind this query sounds questionable to me, more transparent would be to do this without distinct, skipping will actually do exactly the same stuff just under another name. But if allowing skipping on constants do not bring significant changes in the code probably it's fine. > > > - to combat the issues mentioned earlier, there's now a check in > > build_index_paths that checks if the query_pathkeys matches the > > useful_pathkeys. Note that we have to use the path keys here rather than > > any of the unique keys. The unique keys are only Expr nodes - they do not > > contain the necessary information about ordering. Due to elimination of > > some constant path keys, we have to search the attributes of the index to > > find the correct prefix to use in skipping. > > > > IIUC here you mean this function, right? > > > > + prefix = find_index_prefix_for_pathkey(root, > > + > > index, > > + > > BackwardScanDirection, > > + > > llast_node(PathKey, > > + > > root->distinct_pathkeys)); > > > > Doesn't it duplicate the job already done in build_index_pathkeys by building > > those pathkeys again? If yes, probably it's possible to reuse useful_pathkeys. > > Not sure about unordered indexes, but looks like query_pathkeys should > > also match in this case. > > > > Yeah, there's definitely some double work there, but the actual impact may be limited - it doesn't actually allocate anew path key, but it looks it up in root->canon_pathkeys and returns that path key. > I wrote it like this, because I couldn't find a way to identify from a certain PathKey the actual location in the indexof that column. The constructed path keys list filters out all redundant path keys. An index on (a,a,b,a,b) becomespath keys (a,b). Now if we skip on (a,b) we actually need to use prefix=3. But how to get from PathKey=b to that number3, I didn't find a solid way except doing this. Maybe there is though? I don't think there is a direct way, but why not modify build_index_paths to also provide this information, or compare index_pathkeys expressions with indextlist without actually create those pathkeys again? And couple of words about this thread [1]. It looks to me like a strange way of interacting with the community. Are you going to duplicate there everything, or what are your plans? At the very least you could try to include everyone involved in the recipients list, not exclude some of the authors. [1]: https://www.postgresql.org/message-id/flat/e4b623692a1447d4a13ac80fa271c8e6%40opammb0561.comp.optiver.com
> > One UniqueKey can have multiple corresponding expressions, which gives us > also possibility of having one unique key with (t1.a, t2.a) and it looks now > similar to EquivalenceClass. > I believe the current definition of a unique key with two expressions (t1.a, t2.a) means that it's unique on the tuple (t1.a,t2.a) - this gives weaker guarantees than uniqueness on (t1.a) and uniqueness on (t2.a). > > The idea behind this query sounds questionable to me, more transparent > would be to do this without distinct, skipping will actually do exactly the same > stuff just under another name. But if allowing skipping on constants do not > bring significant changes in the code probably it's fine. > Yeah indeed, I didn't say it's a query that people should generally write. :-) It's better to write as a regular SELECT withLIMIT 1 of course. However, it's more to be consistent and predictable to the user: if a SELECT DISTINCT ON (a) * FROMt1 runs fast, then it doesn't make sense to the user if a SELECT DISTINCT ON (a) * FROM t1 WHERE a=2 runs slow. And tosupport it also makes the implementation more consistent with little code changes. > > > > Yeah, there's definitely some double work there, but the actual impact may > be limited - it doesn't actually allocate a new path key, but it looks it up in > root->canon_pathkeys and returns that path key. > > I wrote it like this, because I couldn't find a way to identify from a certain > PathKey the actual location in the index of that column. The constructed path > keys list filters out all redundant path keys. An index on (a,a,b,a,b) becomes > path keys (a,b). Now if we skip on (a,b) we actually need to use prefix=3. But > how to get from PathKey=b to that number 3, I didn't find a solid way except > doing this. Maybe there is though? > > I don't think there is a direct way, but why not modify build_index_paths to > also provide this information, or compare index_pathkeys expressions with > indextlist without actually create those pathkeys again? > I agree there could be other ways - I don't currently have a strong preference for either. I can have a look at this later. > And couple of words about this thread [1]. It looks to me like a strange way > of interacting with the community. Are you going to duplicate there > everything, or what are your plans? At the very least you could try to include > everyone involved in the recipients list, not exclude some of the authors. > When I wrote the first mail in the thread, I went to this thread [1] and included everyone from there, but I see now thatI only included the to: and cc: people and forgot the original thread author, you. I'm sorry about that - I should'velooked better to make sure I had everyone. In any case, my plan is to keep the patch at least applicable to master, as I believe it can be helpful for discussions aboutboth patches. [1] https://www.postgresql.org/message-id/20200609102247.jdlatmfyeecg52fi%40localhost
> On Tue, Jul 21, 2020 at 04:35:55PM -0700, Peter Geoghegan wrote: > > > Well, it's obviously wrong, thanks for noticing. What is necessary is to > > compare two index tuples, the start and the next one, to test if they're > > the same (in which case if I'm not mistaken probably we can compare item > > pointers). I've got this question when I was about to post a new version > > with changes to address feedback from Andy, now I'll combine them and > > send a cumulative patch. > > This sounds like approximately the same problem as the one that > _bt_killitems() has to deal with as of Postgres 13. This is handled in > a way that is admittedly pretty tricky, even though the code does not > need to be 100% certain that it's "the same" tuple. Deduplication kind > of makes that a fuzzy concept. In principle there could be one big > index tuple instead of 5 tuples, even though the logical contents of > the page have not been changed between the time we recording heap TIDs > in local and the time _bt_killitems() tried to match on those heap > TIDs to kill_prior_tuple-kill some index tuples -- a concurrent > deduplication pass could do that. Your code needs to be prepared for > stuff like that. > > Ultimately posting list tuples are just a matter of understanding the > on-disk representation -- a "Small Matter of Programming". Even > without deduplication there are potential hazards from the physical > deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is > not code that runs in VACUUM, despite the name). Make sure that you > hold a buffer pin on the leaf page throughout, because you need to do > that to make sure that VACUUM cannot concurrently recycle heap TIDs. > If VACUUM *is* able to concurrently recycle heap TIDs then it'll be > subtly broken. _bt_killitems() is safe because it either holds on to a > pin or gives up when the LSN changes at all. (ISTM that your only > choice is to hold on to a leaf page pin, since you cannot just decide > to give up in the way that _bt_killitems() sometimes can.) I see, thanks for clarification. You're right, in this part of implementation there is no way to give up if LSN changes like _bt_killitems does. As far as I can see the leaf page is already pinned all the time between reading relevant tuples and comparing them, I only need to handle posting list tuples.
On Mon, 13 Jul 2020 at 10:18, Floris Van Nee <florisvannee@optiver.com> wrote: > One question about the unique keys - probably for Andy or David: I've looked in the archives to find arguments for/againstusing Expr nodes or EquivalenceClasses in the Unique Keys patch. However, I couldn't really find a clear answerabout why the current patch uses Expr rather than EquivalenceClasses. At some point David mentioned "that probablyExpr nodes were needed rather than EquivalenceClasses", but it's not really clear to me why. What were the thoughtsbehind this? I'm still not quite sure on this either way. I did think EquivalenceClasses were more suitable before I wrote the POC patch for unique keys. But after that, I had in mind that Exprs might be better. The reason I thought this was due to the fact that the DISTINCT clause list is a bunch of Exprs and if the UniqueKeys were EquivalenceClasses then checking to see if the DISTINCT can be skipped turned into something more complex that required looking through lists of ec_members rather than just checking if the uniquekey exprs were a subset of the DISTINCT clause. Thinking about it a bit harder, if we did use Exprs then it would mean it a case like the following wouldn't work for Andy's DISTINCT no-op stuff. CREATE TABLE xy (x int primary key, y int not null); SELECT DISTINCT y FROM xy WHERE x=y; whereas if we use EquivalenceClasses then we'll find that we have an EC with x,y in it and can skip the DISTINCT since we have a UniqueKey containing that EquivalenceClass. Also, looking at what Andy wrote to make a case like the following work in his populate_baserel_uniquekeys() function in the 0002 patch: CREATE TABLE ab (a int, b int, primary key(a,b)); SELECT DISTINCT a FROM ab WHERE b = 1; it's a bit uninspiring. Really what we want here when checking if we can skip doing the DISTINCT is a UniqueKey set using EquivalenceClasses as we can just insist that any unmatched UniqueKey items have an ec_is_const == true. However, that means we have to loop through the ec_members of the EquivalenceClasses in the uniquekeys during the DISTINCT check. That's particularly bad when you consider that in a partitioned table case there might be an ec_member for each child partition and there could be 1000s of child partitions and following those ec_members chains is going to be too slow. My current thoughts are that we should be using EquivalenceClasses but we should first add some infrastructure to make them perform better. My current thoughts are that we do something like what I mentioned in [1] or something more like what Andres mentions in [2]. After that, we could either make EquivalenceClass.ec_members a hash table or binary search tree. Or even perhaps just have a single hash table/BST for all EquivalenceClasses that allows very fast lookups from {Expr} -> {EquivalenceClass}. I think an Expr can only belong in a single non-merged EquivalenceClass. So when we do merging of EquivalenceClasses we could just repoint that data structure to point to the new EquivalenceClass. We'd never point to ones that have ec_merged != NULL. This would also allow us to fix the poor performance in regards to get_eclass_for_sort_expr() for partitioned tables. So, it seems the patch dependency chain for skip scans just got a bit longer :-( David [1] https://www.postgresql.org/message-id/CAApHDvrEXcadNYAAdq6RO0eKZUG6rRHXJGAbpzj8y432gCD9bA%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/20190920051857.2fhnvhvx4qdddviz%40alap3.anarazel.de#c3add3919c534591eae2179a6c82742c
Hi David:
On Fri, Jul 31, 2020 at 11:07 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Mon, 13 Jul 2020 at 10:18, Floris Van Nee <florisvannee@optiver.com> wrote:
> One question about the unique keys - probably for Andy or David: I've looked in the archives to find arguments for/against using Expr nodes or EquivalenceClasses in the Unique Keys patch. However, I couldn't really find a clear answer about why the current patch uses Expr rather than EquivalenceClasses. At some point David mentioned "that probably Expr nodes were needed rather than EquivalenceClasses", but it's not really clear to me why. What were the thoughts behind this?
I'm still not quite sure on this either way. I did think
EquivalenceClasses were more suitable before I wrote the POC patch for
unique keys. But after that, I had in mind that Exprs might be
better. The reason I thought this was due to the fact that the
DISTINCT clause list is a bunch of Exprs and if the UniqueKeys were
EquivalenceClasses then checking to see if the DISTINCT can be skipped
turned into something more complex that required looking through lists
of ec_members rather than just checking if the uniquekey exprs were a
subset of the DISTINCT clause.
Thinking about it a bit harder, if we did use Exprs then it would mean
it a case like the following wouldn't work for Andy's DISTINCT no-op
stuff.
CREATE TABLE xy (x int primary key, y int not null);
SELECT DISTINCT y FROM xy WHERE x=y;
whereas if we use EquivalenceClasses then we'll find that we have an
EC with x,y in it and can skip the DISTINCT since we have a UniqueKey
containing that EquivalenceClass.
Also, looking at what Andy wrote to make a case like the following
work in his populate_baserel_uniquekeys() function in the 0002 patch:
CREATE TABLE ab (a int, b int, primary key(a,b));
SELECT DISTINCT a FROM ab WHERE b = 1;
it's a bit uninspiring. Really what we want here when checking if we
can skip doing the DISTINCT is a UniqueKey set using
EquivalenceClasses as we can just insist that any unmatched UniqueKey
items have an ec_is_const == true. However, that means we have to loop
through the ec_members of the EquivalenceClasses in the uniquekeys
during the DISTINCT check. That's particularly bad when you consider
that in a partitioned table case there might be an ec_member for each
child partition and there could be 1000s of child partitions and
following those ec_members chains is going to be too slow.
My current thoughts are that we should be using EquivalenceClasses but
we should first add some infrastructure to make them perform better.
My current thoughts are that we do something like what I mentioned in
[1] or something more like what Andres mentions in [2]. After that,
we could either make EquivalenceClass.ec_members a hash table or
binary search tree. Or even perhaps just have a single hash table/BST
for all EquivalenceClasses that allows very fast lookups from {Expr}
-> {EquivalenceClass}. I think an Expr can only belong in a single
non-merged EquivalenceClass. So when we do merging of
EquivalenceClasses we could just repoint that data structure to point
to the new EquivalenceClass. We'd never point to ones that have
ec_merged != NULL. This would also allow us to fix the poor
performance in regards to get_eclass_for_sort_expr() for partitioned
tables.
So, it seems the patch dependency chain for skip scans just got a bit longer :-(
I admit that EquivalenceClasses has a better expressive power. There are 2 more
cases we can handle better with EquivalenceClasses. SELECT DISTINCT a, b, c
FROM t WHERE a = b; Currently the UniqueKey is (a, b, c), but it is better be (a, c)
and (b, c). The other case happens similarly in group by case.
After realizing this, I am still hesitant to do that, due to the complexity. If we do that,
we may have to maintain a EquivalenceClasses in one more place or make the existing
EquivalenceClasses List longer, for example: SELECT pk FROM t; The current
infrastructure doesn't create any EquivalenceClasses for pk. So we have to create
a new one in this case and reuse some existing ones in other cases. Finally since the
EquivalenceClasses is not so straight to upper user, we have to depend on the
infrastructure change to look up an EquivalenceClasses quickly from an Expr.
I rethink more about the case you provide above, IIUC, there is such issue for joinrel.
then we can just add a EC checking for populate_baserel_uniquekeys. As for the
DISTINCT/GROUP BY case, we should build the UniqueKeys from root->distinct_pathkeys
and root->group_pathkeys where the EquivalenceClasses are already there.
I am still not insisting on either Expr or EquivalenceClasses right now, if we need to
change it to EquivalenceClasses, I'd see if we need to have more places to take
care before doing that.
Best Regards
Andy Fan
> On Mon, Jul 27, 2020 at 12:24:31PM +0200, Dmitry Dolgov wrote: > > I see, thanks for clarification. You're right, in this part of > implementation there is no way to give up if LSN changes like > _bt_killitems does. As far as I can see the leaf page is already pinned > all the time between reading relevant tuples and comparing them, I only > need to handle posting list tuples. Here is a new version that hopefully address most of the concerns mentioned in this thread so far. As before, first two patches are taken from UniqueKeys thread and attached only for the reference. List of changes includes: * fix for index scan not being fully covered * rebase on the latest UniqueKey patch * taking into account posting tuples (although I must say I couldn't produce a test that will hit this part, so I would appreciate if someone can take a look) * fixes suggested by Floris with adjustments as discussed in the thread There are no changes related to EquivalenceClasses vs expressions, which would probably be my next target. Having this in mind I must admit I'm not super excited about possibility of including another patch as a dependency without clear prospects and plans for it. Thanks for the feedback folks!
Attachment
On Sat, Aug 15, 2020 at 7:09 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Here is a new version that hopefully address most of the concerns > mentioned in this thread so far. As before, first two patches are taken > from UniqueKeys thread and attached only for the reference. List of > changes includes: Some thoughts on this version of the patch series (I'm focussing on v36-0005-Btree-implementation-of-skipping.patch again): * I see the following compiler warning: /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c: In function ‘populate_baserel_uniquekeys’: /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13: warning: ‘expr’ may be used uninitialized in this function [-Wmaybe-uninitialized] 797 | else if (!list_member(unique_index->rel->reltarget->exprs, expr)) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * Perhaps the warning is related to this nearby code that I noticed Valgrind complains about: ==1083468== VALGRINDERROR-BEGIN ==1083468== Invalid read of size 4 ==1083468== at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771) ==1083468== by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140) ==1083468== by 0x56AEA5: set_plain_rel_size (allpaths.c:586) ==1083468== by 0x56AADB: set_rel_size (allpaths.c:412) ==1083468== by 0x56A8CD: set_base_rel_sizes (allpaths.c:323) ==1083468== by 0x56A5A7: make_one_rel (allpaths.c:185) ==1083468== by 0x5AB426: query_planner (planmain.c:269) ==1083468== by 0x5AF02C: grouping_planner (planner.c:2058) ==1083468== by 0x5AD202: subquery_planner (planner.c:1015) ==1083468== by 0x5ABABF: standard_planner (planner.c:405) ==1083468== by 0x5AB7F8: planner (planner.c:275) ==1083468== by 0x6E6F84: pg_plan_query (postgres.c:875) ==1083468== by 0x6E70C4: pg_plan_queries (postgres.c:966) ==1083468== by 0x6E7497: exec_simple_query (postgres.c:1158) ==1083468== by 0x6EBCD3: PostgresMain (postgres.c:4309) ==1083468== by 0x624284: BackendRun (postmaster.c:4541) ==1083468== by 0x623995: BackendStartup (postmaster.c:4225) ==1083468== by 0x61FB70: ServerLoop (postmaster.c:1742) ==1083468== by 0x61F309: PostmasterMain (postmaster.c:1415) ==1083468== by 0x514AF2: main (main.c:209) ==1083468== Address 0x75f13e0 is 4,448 bytes inside a block of size 8,192 alloc'd ==1083468== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==1083468== by 0x8C15C8: AllocSetAlloc (aset.c:919) ==1083468== by 0x8CEA52: palloc (mcxt.c:964) ==1083468== by 0x267F25: systable_beginscan (genam.c:373) ==1083468== by 0x8682CE: SearchCatCacheMiss (catcache.c:1359) ==1083468== by 0x868167: SearchCatCacheInternal (catcache.c:1299) ==1083468== by 0x867E2C: SearchCatCache1 (catcache.c:1167) ==1083468== by 0x8860B2: SearchSysCache1 (syscache.c:1123) ==1083468== by 0x8BD482: check_enable_rls (rls.c:66) ==1083468== by 0x68A113: get_row_security_policies (rowsecurity.c:134) ==1083468== by 0x683C2C: fireRIRrules (rewriteHandler.c:2045) ==1083468== by 0x687340: QueryRewrite (rewriteHandler.c:3962) ==1083468== by 0x6E6EB1: pg_rewrite_query (postgres.c:784) ==1083468== by 0x6E6D23: pg_analyze_and_rewrite (postgres.c:700) ==1083468== by 0x6E7476: exec_simple_query (postgres.c:1155) ==1083468== by 0x6EBCD3: PostgresMain (postgres.c:4309) ==1083468== by 0x624284: BackendRun (postmaster.c:4541) ==1083468== by 0x623995: BackendStartup (postmaster.c:4225) ==1083468== by 0x61FB70: ServerLoop (postmaster.c:1742) ==1083468== by 0x61F309: PostmasterMain (postmaster.c:1415) ==1083468== ==1083468== VALGRINDERROR-END (You'll see the same error if you run Postgres Valgrind + "make installcheck", though I don't think that the queries in question are tests that you yourself wrote.) * IndexScanDescData.xs_itup comments could stand to be updated here -- IndexScanDescData.xs_want_itup is no longer just about index-only scans. * Do we really need the AM-level boolean flag/argument named "scanstart"? Why not just follow the example of btgettuple(), which determines whether or not the scan has been initialized based on the current scan position? Just because you set so->currPos.buf to InvalidBuffer doesn't mean you cannot or should not take the same approach as btgettuple(). And even if you can't take exactly the same approach, I would still think that the scan's opaque B-Tree state should remember if it's the first call to _bt_skip() (rather than some subsequent call) in some other way (e.g. carrying a "scanstart" bool flag directly). A part of my objection to "scanstart" is that it seems to require that much of the code within _bt_skip() get another level of indentation...which makes it even more difficult to follow. * I don't understand what _bt_scankey_within_page() comments mean when they refer to "the page highkey". It looks like this function examines the highest data item on the page, not the high key. It is highly confusing to refer to a tuple as the page high key if it isn't the tuple from the P_HIKEY offset number on a non-rightmost page, which is a pivot tuple even on the leaf level (as indicated by BTreeTupleIsPivot()). * Why does _bt_scankey_within_page() have an unused "ScanDirection dir" argument? * Why is it okay to do anything important based on the _bt_scankey_within_page() return value? If the page is empty, then how can we know that it's okay to go to the next value? I'm concerned that there could be subtle bugs in this area. VACUUM will usually just delete the empty page. But it won't always do so, for a variety of reasons that aren't worth going into now. This could mask bugs in this area. I'm concerned about patterns like this one from _bt_skip(): while (!nextFound) { .... if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir)) { ... } else /* * If startItup could be not found within the current page, * assume we found something new */ nextFound = true; .... } Why would you assume that "we found something new" here? In general I just don't understand the design of _bt_skip(). I get the basic idea of what you're trying to do, but it could really use better comments. *The "jump one more time if it's the same as at the beginning" thing seems scary to me. Maybe you should be doing something with the actual high key here. * Tip: You might find cases involving "empty but not yet deleted" pages a bit easier to test by temporarily disabling page deletion. You can modify nbtree.c to look like this: index a1ad22f785..db977a0300 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -1416,6 +1416,7 @@ backtrack: Assert(!attempt_pagedel || nhtidslive == 0); } + attempt_pagedel = false; if (attempt_pagedel) { MemoryContext oldcontext; That's all I have for now. -- Peter Geoghegan
On Mon, Sep 21, 2020 at 5:59 PM Peter Geoghegan <pg@bowt.ie> wrote: > That's all I have for now. One more thing. I don't think that this should be a bitwise AND: if ((offnum > maxoff) & (so->currPos.nextPage == P_NONE)) { .... } -- Peter Geoghegan
> On Mon, Sep 21, 2020 at 05:59:32PM -0700, Peter Geoghegan wrote: > > * I see the following compiler warning: > > /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c: > In function ‘populate_baserel_uniquekeys’: > /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13: > warning: ‘expr’ may be used uninitialized in this function > [-Wmaybe-uninitialized] > 797 | else if (!list_member(unique_index->rel->reltarget->exprs, expr)) > | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This is mostly for UniqueKeys patch, which is attached here only as a dependency, but I'll prepare changes for that. Interesting enough I can't reproduce this warning, but if I understand correctly gcc has some history of spurious uninitialized warnings, so I guess it could be version dependent. > * Perhaps the warning is related to this nearby code that I noticed > Valgrind complains about: > > ==1083468== VALGRINDERROR-BEGIN > ==1083468== Invalid read of size 4 > ==1083468== at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771) > ==1083468== by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140) This also belongs to UniqueKeys patch, but at least I can reproduce this one. My guess is that nkeycolums should be used there, not ncolums, which is visible in index_incuding tests. The same as previous one, will prepare corresponding changes. > * Do we really need the AM-level boolean flag/argument named > "scanstart"? Why not just follow the example of btgettuple(), which > determines whether or not the scan has been initialized based on the > current scan position? > > Just because you set so->currPos.buf to InvalidBuffer doesn't mean you > cannot or should not take the same approach as btgettuple(). And even > if you can't take exactly the same approach, I would still think that > the scan's opaque B-Tree state should remember if it's the first call > to _bt_skip() (rather than some subsequent call) in some other way > (e.g. carrying a "scanstart" bool flag directly). Yes, agree, carrying this flag inside the opaque state would be better. > * Why is it okay to do anything important based on the > _bt_scankey_within_page() return value? > > If the page is empty, then how can we know that it's okay to go to the > next value? I'm concerned that there could be subtle bugs in this > area. VACUUM will usually just delete the empty page. But it won't > always do so, for a variety of reasons that aren't worth going into > now. This could mask bugs in this area. I'm concerned about patterns > like this one from _bt_skip(): > > while (!nextFound) > { > .... > > if (_bt_scankey_within_page(scan, so->skipScanKey, > so->currPos.buf, dir)) > { > ... > } > else > /* > * If startItup could be not found within the current page, > * assume we found something new > */ > nextFound = true; > .... > } > > Why would you assume that "we found something new" here? In general I > just don't understand the design of _bt_skip(). I get the basic idea > of what you're trying to do, but it could really use better comments. Yeah, I'll put more efforts into clear comments. There are two different ways in which _bt_scankey_within_page is being used. The first one is to check if it's possible to skip traversal of the tree from root in case if what we're looking for could be on the current page. In this case an empty page would mean we need to search from the root, so not sure what could be the issue here? The second one (that you've highlighted above) I admit is probably the most questionable part of the patch and open for suggestions how to improve it. It's required for one particular case with a cursor when scan advances forward but reads backward. What could happen here is we found one valid item, but the next one e.g. do not pass scan key conditions, and we end up with the previous item again. I'm not entirely sure how presence of an empty page could change this scenario, could you please show an example? > *The "jump one more time if it's the same as at the beginning" thing > seems scary to me. Maybe you should be doing something with the actual > high key here. Same as for the previous question, can you give a hint what do you mean by "doing something with the actual high key"?
> On Tue, Oct 06, 2020 at 05:20:39PM +0200, Dmitry Dolgov wrote: > > On Mon, Sep 21, 2020 at 05:59:32PM -0700, Peter Geoghegan wrote: > > > > * I see the following compiler warning: > > > > /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c: > > In function ‘populate_baserel_uniquekeys’: > > /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13: > > warning: ‘expr’ may be used uninitialized in this function > > [-Wmaybe-uninitialized] > > 797 | else if (!list_member(unique_index->rel->reltarget->exprs, expr)) > > | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > This is mostly for UniqueKeys patch, which is attached here only as a > dependency, but I'll prepare changes for that. Interesting enough I > can't reproduce this warning, but if I understand correctly gcc has some > history of spurious uninitialized warnings, so I guess it could be > version dependent. > > > * Perhaps the warning is related to this nearby code that I noticed > > Valgrind complains about: > > > > ==1083468== VALGRINDERROR-BEGIN > > ==1083468== Invalid read of size 4 > > ==1083468== at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771) > > ==1083468== by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140) > > This also belongs to UniqueKeys patch, but at least I can reproduce this > one. My guess is that nkeycolums should be used there, not ncolums, > which is visible in index_incuding tests. The same as previous one, will > prepare corresponding changes. > > > * Do we really need the AM-level boolean flag/argument named > > "scanstart"? Why not just follow the example of btgettuple(), which > > determines whether or not the scan has been initialized based on the > > current scan position? > > > > Just because you set so->currPos.buf to InvalidBuffer doesn't mean you > > cannot or should not take the same approach as btgettuple(). And even > > if you can't take exactly the same approach, I would still think that > > the scan's opaque B-Tree state should remember if it's the first call > > to _bt_skip() (rather than some subsequent call) in some other way > > (e.g. carrying a "scanstart" bool flag directly). > > Yes, agree, carrying this flag inside the opaque state would be better. Here is a new version which doesn't require "scanstart" argument and contains few other changes to address the issues mentioned earlier. It's also based on the latest UniqueKeys patches with the valgrind issue fixed (as before they're attached also just for the references, you can find more in the original thread). I didn't rework commentaries yet, will post it soon (need to get an inspiration first, probably via reading Shakespeare unless someone has better suggestions). > > * Why is it okay to do anything important based on the > > _bt_scankey_within_page() return value? > > > > If the page is empty, then how can we know that it's okay to go to the > > next value? I'm concerned that there could be subtle bugs in this > > area. VACUUM will usually just delete the empty page. But it won't > > always do so, for a variety of reasons that aren't worth going into > > now. This could mask bugs in this area. I'm concerned about patterns > > like this one from _bt_skip(): > > > > while (!nextFound) > > { > > .... > > > > if (_bt_scankey_within_page(scan, so->skipScanKey, > > so->currPos.buf, dir)) > > { > > ... > > } > > else > > /* > > * If startItup could be not found within the current page, > > * assume we found something new > > */ > > nextFound = true; > > .... > > } > > > > Why would you assume that "we found something new" here? In general I > > just don't understand the design of _bt_skip(). I get the basic idea > > of what you're trying to do, but it could really use better comments. > > Yeah, I'll put more efforts into clear comments. There are two different > ways in which _bt_scankey_within_page is being used. > > The first one is to check if it's possible to skip traversal of the tree > from root in case if what we're looking for could be on the current > page. In this case an empty page would mean we need to search from the > root, so not sure what could be the issue here? > > The second one (that you've highlighted above) I admit is probably the > most questionable part of the patch and open for suggestions how to > improve it. It's required for one particular case with a cursor when > scan advances forward but reads backward. What could happen here is we > found one valid item, but the next one e.g. do not pass scan key > conditions, and we end up with the previous item again. I'm not entirely > sure how presence of an empty page could change this scenario, could you > please show an example? > > > *The "jump one more time if it's the same as at the beginning" thing > > seems scary to me. Maybe you should be doing something with the actual > > high key here. > > Same as for the previous question, can you give a hint what do you mean > by "doing something with the actual high key"? The question is still there and I would really appreciate clarification about what exactly scenarios I need to look for with empty pages. I've tried to perform testing with "attempt_pagedel = false" suggestion, but didn't find anything suspicious.
Attachment
On 24/10/2020 19:45, Dmitry Dolgov wrote: > Here is a new version which doesn't require "scanstart" argument and > contains few other changes to address the issues mentioned earlier. It's > also based on the latest UniqueKeys patches with the valgrind issue > fixed (as before they're attached also just for the references, you can > find more in the original thread). I didn't rework commentaries yet, > will post it soon (need to get an inspiration first, probably via > reading Shakespeare unless someone has better suggestions). I had a quick look at this patch. I haven't been following this thread, so sorry if I'm repeating old arguments, but here we go: - I'm surprised you need a new index AM function (amskip) for this. Can't you just restart the scan with index_rescan()? The btree AM can check if the new keys are on the same page, and optimize the rescan accordingly, like amskip does. That would speed up e.g. nested loop scans too, where the keys just happen to be clustered. - Does this optimization apply to bitmap index scans? - This logic in build_index_paths() is not correct: > + /* > + * Skip scan is not supported when there are qual conditions, which are not > + * covered by index. The reason for that is that those conditions are > + * evaluated later, already after skipping was applied. > + * > + * TODO: This implementation is too restrictive, and doesn't allow e.g. > + * index expressions. For that we need to examine index_clauses too. > + */ > + if (root->parse->jointree != NULL) > + { > + ListCell *lc; > + > + foreach(lc, (List *)root->parse->jointree->quals) > + { > + Node *expr, *qual = (Node *) lfirst(lc); > + Var *var; > + bool found = false; > + > + if (!is_opclause(qual)) > + { > + not_empty_qual = true; > + break; > + } > + > + expr = get_leftop(qual); > + > + if (!IsA(expr, Var)) > + { > + not_empty_qual = true; > + break; > + } > + > + var = (Var *) expr; > + > + for (int i = 0; i < index->ncolumns; i++) > + { > + if (index->indexkeys[i] == var->varattno) > + { > + found = true; > + break; > + } > + } > + > + if (!found) > + { > + not_empty_qual = true; > + break; > + } > + } > + } If you care whether the qual is evaluated by the index AM or not, you need to also check that the operator is indexable. Attached is a query that demonstrates that problem. I'm actually a bit confused why we need this condition. The IndexScan executor node should call amskip() only after checking the additional quals, no? Also, you should probably check that the index quals are in the operator family as that used for the DISTINCT. - Heikki
Attachment
> On Mon, Nov 30, 2020 at 04:42:20PM +0200, Heikki Linnakangas wrote: > > I had a quick look at this patch. I haven't been following this thread, so > sorry if I'm repeating old arguments, but here we go: Thanks! > - I'm surprised you need a new index AM function (amskip) for this. Can't > you just restart the scan with index_rescan()? The btree AM can check if the > new keys are on the same page, and optimize the rescan accordingly, like > amskip does. That would speed up e.g. nested loop scans too, where the keys > just happen to be clustered. An interesting point. At the moment I'm not sure whether it's possible to implement skipping via index_rescan or not, need to take a look. But checking if the new keys are on the same page would introduce some overhead I guess, wouldn't it be too invasive to add it into already existing btree AM? > - Does this optimization apply to bitmap index scans? No, from what I understand it doesn't. > - This logic in build_index_paths() is not correct: > > > + /* > > + * Skip scan is not supported when there are qual conditions, which are not > > + * covered by index. The reason for that is that those conditions are > > + * evaluated later, already after skipping was applied. > > + * > > + * TODO: This implementation is too restrictive, and doesn't allow e.g. > > + * index expressions. For that we need to examine index_clauses too. > > + */ > > + if (root->parse->jointree != NULL) > > + { > > + ListCell *lc; > > + > > + foreach(lc, (List *)root->parse->jointree->quals) > > + { > > + Node *expr, *qual = (Node *) lfirst(lc); > > + Var *var; > > + bool found = false; > > + > > + if (!is_opclause(qual)) > > + { > > + not_empty_qual = true; > > + break; > > + } > > + > > + expr = get_leftop(qual); > > + > > + if (!IsA(expr, Var)) > > + { > > + not_empty_qual = true; > > + break; > > + } > > + > > + var = (Var *) expr; > > + > > + for (int i = 0; i < index->ncolumns; i++) > > + { > > + if (index->indexkeys[i] == var->varattno) > > + { > > + found = true; > > + break; > > + } > > + } > > + > > + if (!found) > > + { > > + not_empty_qual = true; > > + break; > > + } > > + } > > + } > > If you care whether the qual is evaluated by the index AM or not, you need > to also check that the operator is indexable. Attached is a query that > demonstrates that problem. > ... > Also, you should probably check that the index quals are in the operator > family as that used for the DISTINCT. Yes, good point, will change this in the next version. > I'm actually a bit confused why we need this condition. The IndexScan > executor node should call amskip() only after checking the additional quals, > no? This part I don't quite get, what exactly you mean by checking the additional quals in the executor node? But at the end of the day this condition was implemented exactly to address the described issue, which was found later and added to the tests.
On 01/12/2020 22:21, Dmitry Dolgov wrote: >> On Mon, Nov 30, 2020 at 04:42:20PM +0200, Heikki Linnakangas wrote: >> >> I had a quick look at this patch. I haven't been following this thread, so >> sorry if I'm repeating old arguments, but here we go: > > Thanks! > >> - I'm surprised you need a new index AM function (amskip) for this. Can't >> you just restart the scan with index_rescan()? The btree AM can check if the >> new keys are on the same page, and optimize the rescan accordingly, like >> amskip does. That would speed up e.g. nested loop scans too, where the keys >> just happen to be clustered. > > An interesting point. At the moment I'm not sure whether it's possible > to implement skipping via index_rescan or not, need to take a look. But > checking if the new keys are on the same page would introduce some > overhead I guess, wouldn't it be too invasive to add it into already > existing btree AM? I think it'll be OK. But if it's not, you could add a hint argument to index_rescan() to hint the index AM that the new key is known to be greater than the previous key. >> - Does this optimization apply to bitmap index scans? > > No, from what I understand it doesn't. Would it be hard to add? Don't need to solve everything in the first version of this, but I think in principle you could do the same optimization for bitmap index scans, so if the current API can't do it, that's maybe an indication that the API isn't quite right. >> - This logic in build_index_paths() is not correct: >> >>> + /* >>> + * Skip scan is not supported when there are qual conditions, which are not >>> + * covered by index. The reason for that is that those conditions are >>> + * evaluated later, already after skipping was applied. >>> + * >>> + * TODO: This implementation is too restrictive, and doesn't allow e.g. >>> + * index expressions. For that we need to examine index_clauses too. >>> + */ >>> + if (root->parse->jointree != NULL) >>> + { >>> + ListCell *lc; >>> + >>> + foreach(lc, (List *)root->parse->jointree->quals) >>> + { >>> + Node *expr, *qual = (Node *) lfirst(lc); >>> + Var *var; >>> + bool found = false; >>> + >>> + if (!is_opclause(qual)) >>> + { >>> + not_empty_qual = true; >>> + break; >>> + } >>> + >>> + expr = get_leftop(qual); >>> + >>> + if (!IsA(expr, Var)) >>> + { >>> + not_empty_qual = true; >>> + break; >>> + } >>> + >>> + var = (Var *) expr; >>> + >>> + for (int i = 0; i < index->ncolumns; i++) >>> + { >>> + if (index->indexkeys[i] == var->varattno) >>> + { >>> + found = true; >>> + break; >>> + } >>> + } >>> + >>> + if (!found) >>> + { >>> + not_empty_qual = true; >>> + break; >>> + } >>> + } >>> + } >> >> If you care whether the qual is evaluated by the index AM or not, you need >> to also check that the operator is indexable. Attached is a query that >> demonstrates that problem. >> ... >> Also, you should probably check that the index quals are in the operator >> family as that used for the DISTINCT. > > Yes, good point, will change this in the next version. > >> I'm actually a bit confused why we need this condition. The IndexScan >> executor node should call amskip() only after checking the additional quals, >> no? > > This part I don't quite get, what exactly you mean by checking the > additional quals in the executor node? But at the end of the day this > condition was implemented exactly to address the described issue, which > was found later and added to the tests. As I understand this, the executor logic goes like this: query: SELECT DISTINCT ON (a, b) a, b FROM foo where c like '%y%' and a like 'a%' and b = 'b'; 1. Call index_beginscan, keys: a >= 'a', b = 'b' 2. Call index_getnext, which returns first row to the Index Scan node 3. Evaluates the qual "c like '%y%'" on the tuple. If it's false, goto step 2 to get next tuple. 4. Return tuple to parent node 5. index_amskip(), to the next tuple with a > 'a'. Goto 2. The logic should work fine, even if there are quals that are not indexable, like "c like '%y'" in the above example. So why doesn't it work? What am I missing? - Heikki
> On Tue, Dec 01, 2020 at 10:59:22PM +0200, Heikki Linnakangas wrote: > > > > - Does this optimization apply to bitmap index scans? > > > > No, from what I understand it doesn't. > > Would it be hard to add? Don't need to solve everything in the first > version of this, but I think in principle you could do the same > optimization for bitmap index scans, so if the current API can't do it, > that's maybe an indication that the API isn't quite right. I would expect it should not be hard as at the moment all parts seems relatively generic. But of course I need to check, while it seems no one had bitmap index scans in mind while developing this patch. > > > I'm actually a bit confused why we need this condition. The IndexScan > > > executor node should call amskip() only after checking the additional quals, > > > no? > > > > This part I don't quite get, what exactly you mean by checking the > > additional quals in the executor node? But at the end of the day this > > condition was implemented exactly to address the described issue, which > > was found later and added to the tests. > > As I understand this, the executor logic goes like this: > > query: SELECT DISTINCT ON (a, b) a, b FROM foo where c like '%y%' and a > like 'a%' and b = 'b'; > > 1. Call index_beginscan, keys: a >= 'a', b = 'b' > > 2. Call index_getnext, which returns first row to the Index Scan node > > 3. Evaluates the qual "c like '%y%'" on the tuple. If it's false, goto step > 2 to get next tuple. > > 4. Return tuple to parent node > > 5. index_amskip(), to the next tuple with a > 'a'. Goto 2. > > The logic should work fine, even if there are quals that are not indexable, > like "c like '%y'" in the above example. So why doesn't it work? What am I > missing? To remind myself how it works I went through this sequence, and from what I understand the qual "c like '%y%'" is evaluated in this case in ExecQual, not after index_getnext_tid (and values returned after skipping are reported as filtered out). So when it comes to index_skip only quals on a & b were evaluated. Or did you mean something else? Another small detail is that in the current implementation there is no goto 2 in the last step. Originally it was like that, but since skipping return an exact position that we need there was something like "find a value, then do one step back so that index_getnext will find it". Unfortunately this stepping back part turns out to be a source of troubles, and getting rid of it even allowed to make code somewhat more concise. But of course I'm open for suggestions about improvements.
On Wed, Dec 2, 2020 at 9:59 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 01/12/2020 22:21, Dmitry Dolgov wrote: > >> On Mon, Nov 30, 2020 at 04:42:20PM +0200, Heikki Linnakangas wrote: > >> - I'm surprised you need a new index AM function (amskip) for this. Can't > >> you just restart the scan with index_rescan()? The btree AM can check if the > >> new keys are on the same page, and optimize the rescan accordingly, like > >> amskip does. That would speed up e.g. nested loop scans too, where the keys > >> just happen to be clustered. > > > > An interesting point. At the moment I'm not sure whether it's possible > > to implement skipping via index_rescan or not, need to take a look. But > > checking if the new keys are on the same page would introduce some > > overhead I guess, wouldn't it be too invasive to add it into already > > existing btree AM? > > I think it'll be OK. But if it's not, you could add a hint argument to > index_rescan() to hint the index AM that the new key is known to be > greater than the previous key. FWIW here's what I wrote about that years ago[1]: > It works by adding a new index operation 'skip' which the executor > code can use during a scan to advance to the next value (for some > prefix of the index's columns). That may be a terrible idea and > totally unnecessary... but let me explain my > reasoning: > > 1. Perhaps some index implementations can do something better than a > search for the next key value from the root. Is it possible or > desirable to use the current position as a starting point for a btree > traversal? I don't know. > > 2. It seemed that I'd need to create a new search ScanKey to use the > 'rescan' interface for skipping to the next value, but I already had > an insertion ScanKey so I wanted a way to just reuse that. But maybe > there is some other way to reuse existing index interfaces, or maybe > there is an easy way to make a new search ScanKey from the existing >insertion ScanKey? [1] https://www.postgresql.org/message-id/CADLWmXWALK8NPZqdnRQiPnrzAnic7NxYKynrkzO_vxYr8enWww%40mail.gmail.com
Hi Dmitry, On Sun, Oct 25, 2020 at 1:45 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Tue, Oct 06, 2020 at 05:20:39PM +0200, Dmitry Dolgov wrote: > > > On Mon, Sep 21, 2020 at 05:59:32PM -0700, Peter Geoghegan wrote: > > > > > > * I see the following compiler warning: > > > > > > /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c: > > > In function ‘populate_baserel_uniquekeys’: > > > /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13: > > > warning: ‘expr’ may be used uninitialized in this function > > > [-Wmaybe-uninitialized] > > > 797 | else if (!list_member(unique_index->rel->reltarget->exprs, expr)) > > > | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > This is mostly for UniqueKeys patch, which is attached here only as a > > dependency, but I'll prepare changes for that. Interesting enough I > > can't reproduce this warning, but if I understand correctly gcc has some > > history of spurious uninitialized warnings, so I guess it could be > > version dependent. > > > > > * Perhaps the warning is related to this nearby code that I noticed > > > Valgrind complains about: > > > > > > ==1083468== VALGRINDERROR-BEGIN > > > ==1083468== Invalid read of size 4 > > > ==1083468== at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771) > > > ==1083468== by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140) > > > > This also belongs to UniqueKeys patch, but at least I can reproduce this > > one. My guess is that nkeycolums should be used there, not ncolums, > > which is visible in index_incuding tests. The same as previous one, will > > prepare corresponding changes. > > > > > * Do we really need the AM-level boolean flag/argument named > > > "scanstart"? Why not just follow the example of btgettuple(), which > > > determines whether or not the scan has been initialized based on the > > > current scan position? > > > > > > Just because you set so->currPos.buf to InvalidBuffer doesn't mean you > > > cannot or should not take the same approach as btgettuple(). And even > > > if you can't take exactly the same approach, I would still think that > > > the scan's opaque B-Tree state should remember if it's the first call > > > to _bt_skip() (rather than some subsequent call) in some other way > > > (e.g. carrying a "scanstart" bool flag directly). > > > > Yes, agree, carrying this flag inside the opaque state would be better. > > Here is a new version which doesn't require "scanstart" argument and > contains few other changes to address the issues mentioned earlier. It's > also based on the latest UniqueKeys patches with the valgrind issue > fixed (as before they're attached also just for the references, you can > find more in the original thread). I didn't rework commentaries yet, > will post it soon (need to get an inspiration first, probably via > reading Shakespeare unless someone has better suggestions). > > > > * Why is it okay to do anything important based on the > > > _bt_scankey_within_page() return value? > > > > > > If the page is empty, then how can we know that it's okay to go to the > > > next value? I'm concerned that there could be subtle bugs in this > > > area. VACUUM will usually just delete the empty page. But it won't > > > always do so, for a variety of reasons that aren't worth going into > > > now. This could mask bugs in this area. I'm concerned about patterns > > > like this one from _bt_skip(): > > > > > > while (!nextFound) > > > { > > > .... > > > > > > if (_bt_scankey_within_page(scan, so->skipScanKey, > > > so->currPos.buf, dir)) > > > { > > > ... > > > } > > > else > > > /* > > > * If startItup could be not found within the current page, > > > * assume we found something new > > > */ > > > nextFound = true; > > > .... > > > } > > > > > > Why would you assume that "we found something new" here? In general I > > > just don't understand the design of _bt_skip(). I get the basic idea > > > of what you're trying to do, but it could really use better comments. > > > > Yeah, I'll put more efforts into clear comments. There are two different > > ways in which _bt_scankey_within_page is being used. > > > > The first one is to check if it's possible to skip traversal of the tree > > from root in case if what we're looking for could be on the current > > page. In this case an empty page would mean we need to search from the > > root, so not sure what could be the issue here? > > > > The second one (that you've highlighted above) I admit is probably the > > most questionable part of the patch and open for suggestions how to > > improve it. It's required for one particular case with a cursor when > > scan advances forward but reads backward. What could happen here is we > > found one valid item, but the next one e.g. do not pass scan key > > conditions, and we end up with the previous item again. I'm not entirely > > sure how presence of an empty page could change this scenario, could you > > please show an example? > > > > > *The "jump one more time if it's the same as at the beginning" thing > > > seems scary to me. Maybe you should be doing something with the actual > > > high key here. > > > > Same as for the previous question, can you give a hint what do you mean > > by "doing something with the actual high key"? > > The question is still there and I would really appreciate clarification > about what exactly scenarios I need to look for with empty pages. I've > tried to perform testing with "attempt_pagedel = false" suggestion, but > didn't find anything suspicious. Status update for a commitfest entry. This patch entry has been "Waiting on Author" on CF app and the discussion seems inactive from the last CF. Could you share the current status of this patch? Heikki already sent review comments and there was a discussion but the WoA status is correct? If it needs reviews, please rebase the patches and set it to "Needs Reviews" on CF app. If you're not working on this, I'm going to set it to "Returned with Feedback", barring objections. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
> On Thu, Jan 28, 2021 at 09:49:26PM +0900, Masahiko Sawada wrote: > Hi Dmitry, > > Status update for a commitfest entry. > > This patch entry has been "Waiting on Author" on CF app and the > discussion seems inactive from the last CF. Could you share the > current status of this patch? Heikki already sent review comments and > there was a discussion but the WoA status is correct? If it needs > reviews, please rebase the patches and set it to "Needs Reviews" on CF > app. If you're not working on this, I'm going to set it to "Returned > with Feedback", barring objections. Yes, I'm still on it. In fact, I've sketched up almost immediately couple of changes to address Heikki feedback, but was distracted by subscripting stuff. Will try to send new version of the patch soon.
> On Wed, Mar 17, 2021 at 03:28:00AM +0100, Tomas Vondra wrote: > Hi, > > I took a look at the new patch series, focusing mostly on the uniquekeys > part. It'd be a bit tedious to explain all the review comments here, so > attached is a patch series with a "review" patch for some of the parts. Great, thanks. > Most of it is fairly small (corrections to comments etc.), I'll go over > the more serious part so that we can discuss it here. I'll keep it split > per parts of the original patch series. > I suggest looking for XXX and FIXME comments in all the review patches. > > > 0001 > ---- > > .... > > 0002 > ---- > In fact both 0001 & 0002 belong to another thread, which these days span [1], [2]. I've included them only because they happened to be a dependency for index skip scan following David suggestions, sorry if it's confusing. At the same time the author behind 0001 & 0002 is present in this thread as well, maybe Andy can answer these comments right here and better than me. > 0003 > ---- > > Just some comments/whitespace. > > > 0004 > ---- > > I wonder why we don't include this in explain TEXT format? Seems it > might make it harder to write regression tests for this? It's easier to > just check that we deduced the right unique key(s) than having to > construct an example where it actually changes the plan. Yeah, good point. I believe originally it was like that to not make explain too verbose for skip scans, but displaying prefix definitely could be helpful for testing, so will do this (and address other comments as well). [1]: https://www.postgresql.org/message-id/flat/CAKU4AWpQjAqJwQ2X-aR9g3+ZHRzU1k8hNP7A+_mLuOv-n5aVKA@mail.gmail.com [2]: https://www.postgresql.org/message-id/flat/CAKU4AWrU35c9g3cE15JmVwh6B2Hzf4hf7cZUkRsiktv7AKR3Ag@mail.gmail.com
On 3/17/21 6:02 PM, Dmitry Dolgov wrote: >> On Wed, Mar 17, 2021 at 03:28:00AM +0100, Tomas Vondra wrote: >> Hi, >> >> I took a look at the new patch series, focusing mostly on the uniquekeys >> part. It'd be a bit tedious to explain all the review comments here, so >> attached is a patch series with a "review" patch for some of the parts. > > Great, thanks. > >> Most of it is fairly small (corrections to comments etc.), I'll go over >> the more serious part so that we can discuss it here. I'll keep it split >> per parts of the original patch series. >> I suggest looking for XXX and FIXME comments in all the review patches. >> >> >> 0001 >> ---- >> >> .... >> >> 0002 >> ---- >> > > In fact both 0001 & 0002 belong to another thread, which these days > span [1], [2]. I've included them only because they happened to be a > dependency for index skip scan following David suggestions, sorry if > it's confusing. > > At the same time the author behind 0001 & 0002 is present in this thread > as well, maybe Andy can answer these comments right here and better than me. > Ah, sorry for the confusion. In that case the review comments probably belong to the other threads, so we should move the discussion there. It's not clear to me which of the threads is the right one. >> 0003 >> ---- >> >> Just some comments/whitespace. >> >> >> 0004 >> ---- >> >> I wonder why we don't include this in explain TEXT format? Seems it >> might make it harder to write regression tests for this? It's easier to >> just check that we deduced the right unique key(s) than having to >> construct an example where it actually changes the plan. > > Yeah, good point. I believe originally it was like that to not make > explain too verbose for skip scans, but displaying prefix definitely > could be helpful for testing, so will do this (and address other > comments as well). > Cool. Thanks. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, Here is another take on the patch with a couple of changes: * I've removed for now UniqueKeys parts. The interaction of skip scan & unique keys patch was actually not that big, so the main difference is that now the structure itself went away, a list of unique expressions is used instead. All the suggestions about how this feature should look like from the planning perspective are still there. On the one hand it will allow to develop both patches independently and avoid confusion for reviewers, on the other UniqueKeys could be easily incorporated back when needed. * Support for skipping in case of moving backward on demand (scroll cursor) is moved into a separate patch. This is implemented via returning false from IndexSupportsBackwardScan in case if it's a skip scan node, which in turn adds Materialize node on top when needed. The name SupportsBackwardScan was a bit confusing for me, but it seems it's only being used with a cursorOptions check for CURSOR_OPT_SCROLL. Eventually those cases when BackwardScanDirection is used are still handled by amskip. This change didn't affect the test coverage, all the test cases supported in previous patch versions are still there. About Materialize node, I guess it could be less performant than a "native" support, but it simplifies the implementation significantly to the point that most parts, which were causing questions before, are now located in the isolated patch. My idea here is to concentrate efforts on the first three patches in this series, and consider the rest of them as an experiment field. * IndexScan support was also relocated into a separate patch, the first three patches are now only about IndexOnlyScan. * Last bits of reviews were incorporated and rebased.
Attachment
> On Fri, May 21, 2021 at 05:31:38PM +0200, Dmitry Dolgov wrote: > Hi, > > Here is another take on the patch with a couple of changes: > > * I've removed for now UniqueKeys parts. The interaction of skip scan & > unique keys patch was actually not that big, so the main difference is > that now the structure itself went away, a list of unique expressions > is used instead. All the suggestions about how this feature should > look like from the planning perspective are still there. On the one > hand it will allow to develop both patches independently and avoid > confusion for reviewers, on the other UniqueKeys could be easily > incorporated back when needed. > > * Support for skipping in case of moving backward on demand (scroll > cursor) is moved into a separate patch. This is implemented via > returning false from IndexSupportsBackwardScan in case if it's a skip > scan node, which in turn adds Materialize node on top when needed. The > name SupportsBackwardScan was a bit confusing for me, but it seems > it's only being used with a cursorOptions check for CURSOR_OPT_SCROLL. > Eventually those cases when BackwardScanDirection is used are still > handled by amskip. This change didn't affect the test coverage, all > the test cases supported in previous patch versions are still there. > > About Materialize node, I guess it could be less performant than a > "native" support, but it simplifies the implementation significantly > to the point that most parts, which were causing questions before, are > now located in the isolated patch. My idea here is to concentrate > efforts on the first three patches in this series, and consider the > rest of them as an experiment field. > > * IndexScan support was also relocated into a separate patch, the first > three patches are now only about IndexOnlyScan. > > * Last bits of reviews were incorporated and rebased. While the patch is still waiting for a review, I was motivated by the thread [1] to think about it from the interface point of view. Consider an index skip scan being just like a normal index scan with a set of underspecified leading search keys. It makes sense to have the same structure "begin scan" -> "get the next tuple" -> "end scan" (now I'm not sure if amskip is a good name to represent that, don't have anything better yet). But the "underspecified" part is currently indeed interpreted in a limited way -- as "missing" keys -- and is expressed only via the prefix size. Another option would be e.g. leading keys constrained by a range of values, so generally speaking it makes sense to extend amount of the information provided for skipping. As a naive approach I've added a new patch into the series, containing the extra data structure (ScanLooseKeys, doesn't have much meaning yet except somehow representing keys for skipping) used for index skip scan. Any thoughts about it? Besides that the new patch version contains some cleaning up and addresses commentaries around leaf page pinning from [1]. The idea behind the series structure is still the same: the first three patches contains the essence of the implementation (hoping to help concentrate review), the rest are more "experimental". [1]: https://www.postgresql.org/message-id/flat/CAH2-WzmUscvoxVkokHxP=uPTDjSi0tJkFpUPD-CeA35dvn-CMw@mail.gmail.com
Attachment
On Sat, Jan 22, 2022 at 1:32 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> On Fri, May 21, 2021 at 05:31:38PM +0200, Dmitry Dolgov wrote:
> Hi,
>
> Here is another take on the patch with a couple of changes:
>
> * I've removed for now UniqueKeys parts. The interaction of skip scan &
> unique keys patch was actually not that big, so the main difference is
> that now the structure itself went away, a list of unique expressions
> is used instead. All the suggestions about how this feature should
> look like from the planning perspective are still there. On the one
> hand it will allow to develop both patches independently and avoid
> confusion for reviewers, on the other UniqueKeys could be easily
> incorporated back when needed.
>
> * Support for skipping in case of moving backward on demand (scroll
> cursor) is moved into a separate patch. This is implemented via
> returning false from IndexSupportsBackwardScan in case if it's a skip
> scan node, which in turn adds Materialize node on top when needed. The
> name SupportsBackwardScan was a bit confusing for me, but it seems
> it's only being used with a cursorOptions check for CURSOR_OPT_SCROLL.
> Eventually those cases when BackwardScanDirection is used are still
> handled by amskip. This change didn't affect the test coverage, all
> the test cases supported in previous patch versions are still there.
>
> About Materialize node, I guess it could be less performant than a
> "native" support, but it simplifies the implementation significantly
> to the point that most parts, which were causing questions before, are
> now located in the isolated patch. My idea here is to concentrate
> efforts on the first three patches in this series, and consider the
> rest of them as an experiment field.
>
> * IndexScan support was also relocated into a separate patch, the first
> three patches are now only about IndexOnlyScan.
>
> * Last bits of reviews were incorporated and rebased.
While the patch is still waiting for a review, I was motivated by the
thread [1] to think about it from the interface point of view. Consider
an index skip scan being just like a normal index scan with a set of
underspecified leading search keys. It makes sense to have the same
structure "begin scan" -> "get the next tuple" -> "end scan" (now I'm
not sure if amskip is a good name to represent that, don't have anything
better yet). But the "underspecified" part is currently indeed
interpreted in a limited way -- as "missing" keys -- and is expressed
only via the prefix size. Another option would be e.g. leading keys
constrained by a range of values, so generally speaking it makes sense
to extend amount of the information provided for skipping.
As a naive approach I've added a new patch into the series, containing
the extra data structure (ScanLooseKeys, doesn't have much meaning yet
except somehow representing keys for skipping) used for index skip scan.
Any thoughts about it?
Besides that the new patch version contains some cleaning up and
addresses commentaries around leaf page pinning from [1]. The idea
behind the series structure is still the same: the first three patches
contains the essence of the implementation (hoping to help concentrate
review), the rest are more "experimental".
[1]: https://www.postgresql.org/message-id/flat/CAH2-WzmUscvoxVkokHxP=uPTDjSi0tJkFpUPD-CeA35dvn-CMw@mail.gmail.com
Hi,
+ /* If same EC already is already in the list, then not unique */
The word already is duplicated.
+ * make_pathkeys_for_uniquekeyclauses
The func name in the comment is different from the actual func name.
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
The year should be 2022 :-)
make_pathkeys_for_uniquekeys() is called by build_uniquekeys(). Should make_pathkeys_for_uniquekeys() be moved into uniquekeys.c ?
+query_has_uniquekeys_for(PlannerInfo *root, List *path_uniquekeys,
+ bool allow_multinulls)
+ bool allow_multinulls)
It seems allow_multinulls is not checked in the func. Can the parameter be removed ?
+ Path *newpath;
+
+ newpath = (Path *) create_projection_path(root, rel, subpath,
+ scanjoin_target);
+
+ newpath = (Path *) create_projection_path(root, rel, subpath,
+ scanjoin_target);
You can remove variable newpath and assign to lfirst(lc) directly.
+add_path(RelOptInfo *parent_rel, Path *new_path)
+add_unique_path(RelOptInfo *parent_rel, Path *new_path)
It seems the above two func's can be combined into one func which takes parent_rel->pathlist / parent_rel->unique_pathlist as third parameter.
Cheers
> On Sun, Jan 23, 2022 at 04:25:04PM -0800, Zhihong Yu wrote: > On Sat, Jan 22, 2022 at 1:32 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > Besides that the new patch version contains some cleaning up and > > addresses commentaries around leaf page pinning from [1]. The idea > > behind the series structure is still the same: the first three patches > > contains the essence of the implementation (hoping to help concentrate > > review), the rest are more "experimental". > > > > [1]: > > https://www.postgresql.org/message-id/flat/CAH2-WzmUscvoxVkokHxP=uPTDjSi0tJkFpUPD-CeA35dvn-CMw@mail.gmail.com > > Hi, > > + /* If same EC already is already in the list, then not unique */ > > The word already is duplicated. > > + * make_pathkeys_for_uniquekeyclauses > > The func name in the comment is different from the actual func name. Thanks for the review! Right, both above make sense. I'll wait a bit if there will be more commentaries, and then post a new version with all changes at once. > + * Portions Copyright (c) 2020, PostgreSQL Global Development Group > > The year should be 2022 :-) Now you see how old is this patch :) > make_pathkeys_for_uniquekeys() is called by build_uniquekeys(). Should > make_pathkeys_for_uniquekeys() be moved into uniquekeys.c ? It's actually placed there by analogy with make_pathkeys_for_sortclauses (immediately preceding function), so I think moving it into uniquekeys will only make more confusion. > +query_has_uniquekeys_for(PlannerInfo *root, List *path_uniquekeys, > + bool allow_multinulls) > > It seems allow_multinulls is not checked in the func. Can the parameter be > removed ? Right, it could be removed. I believe it was somewhat important when the patch was tightly coupled with the UniqueKeys patch, where it was put in use. > + Path *newpath; > + > + newpath = (Path *) create_projection_path(root, rel, subpath, > + scanjoin_target); > > You can remove variable newpath and assign to lfirst(lc) directly. Yes, but I've followed the same style for create_projection_path as in many other invocations of this function in planner.c -- I would prefer to keep it uniform. > +add_path(RelOptInfo *parent_rel, Path *new_path) > +add_unique_path(RelOptInfo *parent_rel, Path *new_path) > > It seems the above two func's can be combined into one func which > takes parent_rel->pathlist / parent_rel->unique_pathlist as third parameter. Sure, but here I've intentionally split it into separate functions, otherwise a lot of not relevant call sites have to be updated to provide the third parameter.
On Mon, Jan 24, 2022 at 8:51 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> On Sun, Jan 23, 2022 at 04:25:04PM -0800, Zhihong Yu wrote:
> On Sat, Jan 22, 2022 at 1:32 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > Besides that the new patch version contains some cleaning up and
> > addresses commentaries around leaf page pinning from [1]. The idea
> > behind the series structure is still the same: the first three patches
> > contains the essence of the implementation (hoping to help concentrate
> > review), the rest are more "experimental".
> >
> > [1]:
> > https://www.postgresql.org/message-id/flat/CAH2-WzmUscvoxVkokHxP=uPTDjSi0tJkFpUPD-CeA35dvn-CMw@mail.gmail.com
>
> Hi,
>
> + /* If same EC already is already in the list, then not unique */
>
> The word already is duplicated.
>
> + * make_pathkeys_for_uniquekeyclauses
>
> The func name in the comment is different from the actual func name.
Thanks for the review! Right, both above make sense. I'll wait a bit if
there will be more commentaries, and then post a new version with all
changes at once.
> + * Portions Copyright (c) 2020, PostgreSQL Global Development Group
>
> The year should be 2022 :-)
Now you see how old is this patch :)
> make_pathkeys_for_uniquekeys() is called by build_uniquekeys(). Should
> make_pathkeys_for_uniquekeys() be moved into uniquekeys.c ?
It's actually placed there by analogy with make_pathkeys_for_sortclauses
(immediately preceding function), so I think moving it into uniquekeys
will only make more confusion.
> +query_has_uniquekeys_for(PlannerInfo *root, List *path_uniquekeys,
> + bool allow_multinulls)
>
> It seems allow_multinulls is not checked in the func. Can the parameter be
> removed ?
Right, it could be removed. I believe it was somewhat important when the
patch was tightly coupled with the UniqueKeys patch, where it was put in
use.
Hi,
It would be nice to take out this unused parameter for this patch.
The parameter should be added in patch series where it is used.
Cheers
On 6/9/20 06:22, Dmitry Dolgov wrote: > Here is a new version of index skip scan patch, based on v8 patch for > UniqueKeys implementation from [1]. I want to start a new thread to > simplify navigation, hopefully I didn't forget anyone who actively > participated in the discussion. > This CommitFest entry has been closed with RwF at [1]. Thanks for all the feedback given ! [1] https://www.postgresql.org/message-id/ab8636e7-182f-886a-3a39-f3fc279ca45d%40redhat.com Best regards, Dmitry & Jesper