Re: Index Skip Scan - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Index Skip Scan
Date
Msg-id 20191110211832.vv4jbscszz74jete@development
Whole thread Raw
In response to Re: Index Skip Scan  (Dmitry Dolgov <9erthalion6@gmail.com>)
Responses Re: Index Skip Scan
List pgsql-hackers
Hi,

I've looked at the patch again - in general it seems in pretty good
shape, all the issues I found are mostly minor.

Firstly, I'd like to point out that not all of the things I complained
about in my 2019/06/23 review got addressed. Those were mostly related
to formatting and code style, and the attached patch fixes some (but
maybe not all) of them.

The patch also tweaks wording of some bits in the docs and comments that
I found unclear. Would be nice if a native speaker could take a look.

A couple more comments:


1) pathkey_is_unique

The one additional issue I found is pathkey_is_unique - it's not really
explained what "unique" means and hot it's different from "redundant"
(which has quite a long explanation before pathkey_is_redundant).

My understanding is that pathkey is "unique" when it's EC does not match
an EC of another pathkey in the list. But if that's the case, then the
function name is wrong - it does exactly the opposite (i.e. it returns
'true' when the pathkey is *not* unique).


2) explain

I wonder if we should print the "Skip scan" info always, or similarly to
"Inner Unique" which does this:

/* try not to be too chatty about this in text mode */
     if (es->format != EXPLAIN_FORMAT_TEXT ||
         (es->verbose && ((Join *) plan)->inner_unique))
         ExplainPropertyBool("Inner Unique",
                             ((Join *) plan)->inner_unique,
                             es);
     break;

I'd do the same thing for skip scan - print it only in verbose mode, or
when using non-text output format.


3) There's an annoying limitation that for this to kick in, the order of
expressions in the DISTINCT clause has to match the index, i.e. with
index on (a,b,c) the skip scan will only kick in for queries using

    DISTINCT a
    DISTINCT a,b
    DISTINCT a,b,c

but not e.g. DISTINCT a,c,b. I don't think there's anything forcing us
to sort result of DISTINCT in any particular case, except that we don't
consider the other orderings "interesting" so we don't really consider
using the index (so no chance of using the skip scan).

That leads to pretty annoying speedups/slowdowns due to seemingly
irrelevant changes:

-- everything great, a,b,c matches an index
test=# explain (analyze, verbose) select distinct a,b,c from t;
                                                              QUERY PLAN
             
 

-------------------------------------------------------------------------------------------------------------------------------------
  Index Only Scan using t_a_b_c_idx on public.t  (cost=0.42..565.25 rows=1330 width=12) (actual time=0.016..10.387
rows=1331loops=1)
 
    Output: a, b, c
    Skip scan: true
    Heap Fetches: 1331
  Planning Time: 0.106 ms
  Execution Time: 10.843 ms
(6 rows)

-- slow, mismatch with index
test=# explain (analyze, verbose) select distinct a,c,b from t;
                                                         QUERY PLAN
   
 

---------------------------------------------------------------------------------------------------------------------------
  HashAggregate  (cost=22906.00..22919.30 rows=1330 width=12) (actual time=802.067..802.612 rows=1331 loops=1)
    Output: a, c, b
    Group Key: t.a, t.c, t.b
    ->  Seq Scan on public.t  (cost=0.00..15406.00 rows=1000000 width=12) (actual time=0.010..355.361 rows=1000000
loops=1)
          Output: a, b, c
  Planning Time: 0.076 ms
  Execution Time: 803.078 ms
(7 rows)

-- fast again, the extra ordering allows using the index again
test=# explain (analyze, verbose) select distinct a,c,b from t order by a,b,c;
                                                              QUERY PLAN
             
 

-------------------------------------------------------------------------------------------------------------------------------------
  Index Only Scan using t_a_b_c_idx on public.t  (cost=0.42..565.25 rows=1330 width=12) (actual time=0.035..12.120
rows=1331loops=1)
 
    Output: a, c, b
    Skip scan: true
    Heap Fetches: 1331
  Planning Time: 0.053 ms
  Execution Time: 12.632 ms
(6 rows)


This is a more generic issue, not specific to this patch, of course. I
think we saw it with the incremental sort patch, IIRC. I wonder how
difficult would it be to fix this here (not necessarily in v1).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: pg_upgrade fails to detect unsupported arrays and ranges
Next
From: Justin Pryzby
Date:
Subject: psql \d for wide tables / pattern for individual columns