Re: Index Skip Scan - Mailing list pgsql-hackers

From Dmitry Dolgov
Subject Re: Index Skip Scan
Date
Msg-id 20200208152440.l2sllrla7utvgexr@localhost
Whole thread Raw
In response to Re: Index Skip Scan  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Index Skip Scan
Re: Index Skip Scan
List pgsql-hackers
> On Sat, Feb 08, 2020 at 03:22:17PM +0100, Tomas Vondra wrote:
>
> I've done some testing and benchmarking of the v31 patch, looking for
> regressions, costing issues etc. Essentially, I've ran a bunch of SELECT
> DISTINCT queries on data sets of various size, number of distinct values
> etc. The results are fairly large, so I've uploaded them to github
>
>     https://github.com/tvondra/skip-scan-test

Thanks a lot for testing!

> There are a couple of regressions, where the plan with skipscan enables
> is ~10x slower. But this seems to happen only in high-cardinality cases
> where we misestimate the number of groups. Consider a table with two
> independent columns
>
>   CREATE TABLE t (a text, b text);
>   INSERT INTO t SELECT
>     md5((10000*random())::int::text),
>     md5((10000*random())::int::text)
>   FROM generate_series(1,1000000) s(i);
>
>   CREATE INDEX ON t(a,b);
>
>   ANALYZE;
>
> which then behaves like this:
>
>   test=# select * from (select distinct a,b from t) foo offset 10000000;
>   Time: 3138.222 ms (00:03.138)
>   test=# set enable_indexskipscan = off;
>   Time: 0.312 ms
>   test=# select * from (select distinct a,b from t) foo offset 10000000;
>   Time: 199.749 ms
>
> So in this case the skip scan is ~15x slower than the usual plan (index
> only scan + unique). The reason why this happens is pretty simple - to
> estimate the number of groups we multiply the ndistinct estimates for
> the two columns (which both have n_distinct = 10000), but then we cap
> the estimate to 10% of the table. But when the columns are independent
> with high cardinalities that under-estimates the actual value, making
> the cost for skip scan much lower than it should be.

The current implementation checks if we can find the next value on the
same page to do a shortcut instead of tree traversal and improve such
kind of situations, but I can easily imagine that it's still not enough
in some extreme situations.

> I don't think this is an issue the skipscan patch needs to fix, though.
> Firstly, the regressed cases are a tiny minority. Secondly, we already
> have a way to improve the root cause - creating extended stats with
> ndistinct coefficients generally makes the problem go away.

Yes, I agree.

> One interesting observation however is that this regression only
> happened with text columns but not with int or bigint. My assumption is
> that this is due to text comparisons being much more expensive. Not sure
> if there is something we could do to deal with this - reduce the number
> of comparisons or something?

Hm, interesting. I need to check that we do not do any unnecessary
comparisons.

> On Sat, Feb 08, 2020 at 02:11:59PM +0100, Tomas Vondra wrote:
> > Yes, I've mentioned that already in one of the previous emails :) The
> > simplest way I see to achieve what we want is to do something like in
> > attached modified version with a new hasDeclaredCursor field. It's not a
> > final version though, but posted just for discussion, so feel free to
> > suggest any improvements or alternatives.
>
> IMO the proper fix for this case (moving forward, reading backwards) is
> simply making it work by properly checking deleted tuples etc. Not sure
> why that would be so much complex (haven't tried implementing it)?

It's probably not that complex by itself, but requires changing
responsibilities isolation. At the moment current implementation leaves
jumping over a tree fully to _bt_skip, and heap visibility checks only
to IndexOnlyNext. To check deleted tuples properly we need to either
verify a corresponding heap tuple visibility inside _bt_skip (as I've
mentioned in one of the previous emails, checking if an index tuple is
dead is not enough), or teach the code in IndexOnlyNext to understand
that _bt_skip can lead to returning the same tuple while moving forward
& reading backward. Do you think it's still makes sense to go this way?



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Internal key management system
Next
From: Andres Freund
Date:
Subject: Re: Internal key management system