PoC/WIP: Extended statistics on expressions - Mailing list pgsql-hackers

From Tomas Vondra
Subject PoC/WIP: Extended statistics on expressions
Date
Msg-id ad7891d2-e90c-b446-9fe2-7419143847d7@enterprisedb.com
Whole thread Raw
Responses Re: PoC/WIP: Extended statistics on expressions  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
Hi,

Attached is a PoC/WIP patch adding support for extended statistics on
expressions. This is by no means "ready" - most of the stuff works, but
often in a rather hackish way. I certainly don't expect this to pass
regression tests, for example.

There's an example demonstrating how this works for two queries at the
end of this message. Now let's talk about the main parts of the patch:

1) extending grammar to allow expressions, not just plain columns

   Fairly straighforward, I think. I'm sure the logic which expressions
   are allowed is not 100% (e.g. volatile functions etc.) but that's
   a detail we can deal with later.

2) store the expressions in pg_statistic_ext catalog

   I ended up adding a separate column, similar to indexprs, except that
   the order of columns/expressions does not matter, so we don't need to
   bother with storing 0s in stxkeys - we simply consider expressions to
   be "after" all the simple columns.

3) build statistics

   This should work too, for all three types of statistics we have (mcv,
   dependencies and ndistinct). This should work too, although the code
   changes are often very hackish "to make it work".

   The main challenge here was how to represent the expressions in the
   statistics - e.g. in ndistinct, which track ndistinct estimates for
   combinations of parameters, and so far we used attnums for that. I
   decided the easiest way it to keep doing that, but offset the
   expressions by MaxHeapAttributeNumber. That seems to work, but maybe
   there's a better way.

4) apply the statistics

   This is the hard part, really, and the exact state of the support
   depends on type of statistics.

   For ndistinct coefficients, it generally works. I'm sure there may be
   bugs in estimate_num_groups, etc. but in principle it works.

   For MCV lists, it generally works too - you can define statistics on
   the expressions and the estimates should improve. The main downside
   here is that it requires at least two expressions, otherwise we can't
   build/apply the extended statistics. So for example

      SELECT * FROM t WHERE mod(a,100) = 10 AND mod(b,11) = 0

   may be estimated "correctly", once you drop any of the conditions it
   gets much worse as we don't have stats for individual expressions.
   That's rather annoying - it does not break the extended MCV, but the
   behavior will certainly cause confusion.

   For functional dependencies, the estimation does not work yet. Also,
   the missing per-column statistics have bigger impact than on MCV,
   because while MCV can work fine without it, the dependencies heavily
   rely on the per-column estimates. We only apply "corrections" based
   on the dependency degree, so we still need (good) per-column
   estimates, which does not quite work with the expressions.


   Of course, the lack of per-expression statistics may be somewhat
   fixed by adding indexes on expressions, but that's kinda expensive.


Now, a simple example demonstrating how this improves estimates - let's
create a table with 1M rows, and do queries with mod() expressions on
it. It might be date_trunc() or something similar, that'd work too.


table with 1M rows
==================

test=# create table t (a int);
test=# insert into t select i from generate_series(1,1000000) s(i);
test=# analyze t;


poorly estimated queries
========================

test=# explain (analyze, timing off) select * from t where mod(a,3) = 0
and mod(a,7) = 0;
                                    QUERY PLAN

----------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..24425.00 rows=25 width=4) (actual rows=47619
loops=1)
   Filter: ((mod(a, 3) = 0) AND (mod(a, 7) = 0))
   Rows Removed by Filter: 952381
 Planning Time: 0.329 ms
 Execution Time: 156.675 ms
(5 rows)

test=# explain (analyze, timing off) select mod(a,3), mod(a,7) from t
group by 1, 2;
                                          QUERY PLAN

-----------------------------------------------------------------------------------------------
 HashAggregate  (cost=75675.00..98487.50 rows=1000000 width=8) (actual
rows=21 loops=1)
   Group Key: mod(a, 3), mod(a, 7)
   Planned Partitions: 32  Batches: 1  Memory Usage: 1561kB
   ->  Seq Scan on t  (cost=0.00..19425.00 rows=1000000 width=8) (actual
rows=1000000 loops=1)
 Planning Time: 0.277 ms
 Execution Time: 502.803 ms
(6 rows)


improved estimates
==================

test=# create statistics s1 (ndistinct) on mod(a,3), mod(a,7) from t;
test=# analyze t;

test=# explain (analyze, timing off) select mod(a,3), mod(a,7) from t
group by 1, 2;
                                          QUERY PLAN

-----------------------------------------------------------------------------------------------
 HashAggregate  (cost=24425.00..24425.31 rows=21 width=8) (actual
rows=21 loops=1)
   Group Key: mod(a, 3), mod(a, 7)
   Batches: 1  Memory Usage: 24kB
   ->  Seq Scan on t  (cost=0.00..19425.00 rows=1000000 width=8) (actual
rows=1000000 loops=1)
 Planning Time: 0.135 ms
 Execution Time: 500.092 ms
(6 rows)

test=# create statistics s2 (mcv) on mod(a,3), mod(a,7) from t;
test=# analyze t;

test=# explain (analyze, timing off) select * from t where mod(a,3) = 0
and mod(a,7) = 0;
                                     QUERY PLAN

-------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..24425.00 rows=46433 width=4) (actual
rows=47619 loops=1)
   Filter: ((mod(a, 3) = 0) AND (mod(a, 7) = 0))
   Rows Removed by Filter: 952381
 Planning Time: 0.702 ms
 Execution Time: 152.280 ms
(5 rows)


Clearly, estimates for both queries are significantly improved. Of
course, this example is kinda artificial/simplistic.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: Zedstore - compressed in-core columnar storage
Next
From: Alvaro Herrera
Date:
Subject: Re: doc CREATE INDEX