Re: PoC/WIP: Extended statistics on expressions - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: PoC/WIP: Extended statistics on expressions
Date
Msg-id 4654037b-8df8-77fb-2981-67be93aebdf1@enterprisedb.com
Whole thread Raw
In response to Re: PoC/WIP: Extended statistics on expressions  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: PoC/WIP: Extended statistics on expressions
List pgsql-hackers
Hi,

attached is a significantly improved version of the patch, allowing
defining extended statistics on expressions. This fixes most of the
problems in the previous WIP version and AFAICS it does pass all
regression tests (including under valgrind). There's a bunch of FIXMEs
and a couple loose ends, but overall I think it's ready for reviews.


Overall, the patch does two main things:

* it adds a new "expressions" statistics kind, building per-expression
statistics (i.e it's similar to having expression index)

* it allows using expressions in definition of extended statistics, and
properly handles that in all existing statistics kinds (dependencies,
mcv, ndistinct)


The expression handling mostly copies what we do for indexes, with
similar restrictions - no volatile functions, aggregates etc. The list
of expressions is stored in pg_statistic_ext catalog, but unlike for
indexes we don't need to worry about the exact order of elements, so
there are no "0" for expressions in stxkeys etc. We simply assume the
expressions come after simple columns, and that's it.

To reference expressions in the built statistics (e.g. in a dependency)
we use "special attnums" computed from the expression index by adding
MaxHeapAttributeNumber. So the first expression has attnum 1601 etc.

This mapping expressions to attnums is used both while building and
applying the statistics to clauses, as it makes the whole process much
simpler than dealing with attnums and expressions entirely separately.


The first part allows us to do something like this:

    CREATE TABLE t (a int);
    INSERT INTO t SELECT i FROM generate_series(1,1000000) s(i);
    ANALYZE t;

    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT * FROM t WHERE mod(a,10) = 0;

    CREATE STATISTICS s (expressions) ON mod(a,10) FROM t;
    ANALYZE t;

    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT * FROM t WHERE mod(a,10) = 0;

Without the statistics we get this:

                                 QUERY PLAN
    --------------------------------------------------------
     Seq Scan on t  (cost=0.00..19425.00 rows=5000 width=4)
                    (actual rows=100000 loops=1)
       Filter: (mod(a, 10) = 0)
       Rows Removed by Filter: 900000
     Planning Time: 0.216 ms
     Execution Time: 157.552 ms
    (5 rows)

while with the statistics we get this

                                 QUERY PLAN
    ----------------------------------------------------------
     Seq Scan on t  (cost=0.00..19425.00 rows=100900 width=4)
                    (actual rows=100000 loops=1)
       Filter: (mod(a, 10) = 0)
       Rows Removed by Filter: 900000
     Planning Time: 0.399 ms
     Execution Time: 157.530 ms
    (5 rows)

So that's pretty nice improvement. In practice you could get the same
effect by creating an expression index

    CREATE INDEX ON t (mod(a,10));

but of course that's far from free - there's cost to maintain the index,
it blocks HOT, and it takes space on disk. The statistics have none of
these issues.

Implementation-wise, this simply builds per-column statistics for each
expression, and stashes them into a new column in pg_statistic_ext_data
catalog as an array of elements with pg_statistic composite type. And
then in selfuncs.c we look not just at indexes, but also at this when
looking for expression stats.

So that gives us the per-expression stats. This is enabled by default
when you don't specify the statistics type and the definition includes
any expression that is not a simple column reference. Otherwise you may
also request it explicitly by using "expressions" in the CREATE.


Now, the second part is really just a natural extension of the existing
stats to also work with expressions. The easiest thing is probably to
show some examples, so consider this:

    CREATE TABLE t (a INT, b INT, c INT);
    INSERT INTO t SELECT i, i, i FROM generate_series(1,1000000) s(i);
    ANALYZE t;

which without any statistics gives us this:

    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT 1 FROM t WHERE mod(a,10) = 0 AND mod(b,5) = 0;

                           QUERY PLAN
    ------------------------------------------------------
     Seq Scan on t  (cost=0.00..25406.00 rows=25 width=4)
                    (actual rows=100000 loops=1)
       Filter: ((mod(a, 10) = 0) AND (mod(b, 5) = 0))
       Rows Removed by Filter: 900000
     Planning Time: 0.080 ms
     Execution Time: 161.445 ms
    (5 rows)


    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT 1 FROM t GROUP BY mod(a,10), mod(b,5);

                               QUERY PLAN
    ------------------------------------------------------------------
     HashAggregate  (cost=76656.00..99468.50 rows=1000000 width=12)
                    (actual rows=10 loops=1)
       Group Key: mod(a, 10), mod(b, 5)
       Planned Partitions: 32  Batches: 1  Memory Usage: 1561kB
       ->  Seq Scan on t  (cost=0.00..20406.00 rows=1000000 width=8)
                          (actual rows=1000000 loops=1)
     Planning Time: 0.232 ms
     Execution Time: 514.446 ms
    (6 rows)

and now let's add statistics on the expressions:

    CREATE STATISTICS s ON mod(a,10), mod(b,5) FROM t;
    ANALYZE t;

which ends up with these spot-on estimates:

    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT 1 FROM t WHERE mod(a,10) = 0 AND mod(b,5) = 0;

                            QUERY PLAN
    ---------------------------------------------------------
     Seq Scan on t  (cost=0.00..25406.00 rows=97400 width=4)
                    (actual rows=100000 loops=1)
       Filter: ((mod(a, 10) = 0) AND (mod(b, 5) = 0))
       Rows Removed by Filter: 900000
     Planning Time: 0.366 ms
     Execution Time: 159.207 ms
    (5 rows)

    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT 1 FROM t GROUP BY mod(a,10), mod(b,5);

                               QUERY PLAN
    -----------------------------------------------------------------
     HashAggregate  (cost=25406.00..25406.15 rows=10 width=12)
                    (actual rows=10 loops=1)
       Group Key: mod(a, 10), mod(b, 5)
       Batches: 1  Memory Usage: 24kB
       ->  Seq Scan on t  (cost=0.00..20406.00 rows=1000000 width=8)
                          (actual rows=1000000 loops=1)
     Planning Time: 0.299 ms
     Execution Time: 530.793 ms
    (6 rows)

Of course, this is a very simple query, but hopefully you get the idea.


There's about two main areas where I think might be hidden issues:

1) We're kinda faking the pg_statistic entries, and I suppose there
might be some loose ends (e.g. with respect to ACLs etc.).

2) Memory management while evaluating the expressions during analyze is
kinda simplistic, we're probably keeping the memory around longer than
needed etc.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: Alexander Lakhin
Date:
Subject: Re: More time spending with "delete pending"
Next
From: Tom Lane
Date:
Subject: Re: [bug+patch] Inserting DEFAULT into generated columns from VALUES RTE