Re: multivariate statistics (v19) - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: multivariate statistics (v19)
Date
Msg-id 277d9678-7a35-a746-0eb5-41d4bcd4ef55@2ndquadrant.com
Whole thread Raw
In response to Re: multivariate statistics (v19)  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: multivariate statistics (v19)  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
Hi everyone,

thanks for the reviews. Let me sum the feedback so far, and outline my 
plans for the next patch version that I'd like to submit for CF 2016-11.


1) syntax changes

I agree with the changes proposed by Dean, although only a subset of the 
syntax is going to be supported until we add support for either join or 
partial statistics. So something like this:
 CREATE STATISTICS name   [ WITH (options) ]   ON (column1, column2 [, ...])   FROM table

That should be a difficult change.


2) catalog names

I'm not sure what are the best names, so I'm fine with using whatever is 
the consensus.

That being said, I'm not sure I like extending the catalog to also 
support non-multivariate statistics (like for example statistics on 
expressions). While that would be a clearly useful feature, it seems 
like a slightly different use case and perhaps a separate catalog would 
be better. So maybe pg_statistic_ext is not the best name.


3) special data type(s) to store statistics

I agree using an opaque bytea value is not very nice. I see Heikki 
proposed using something like pg_node_tree, and maybe storing all the 
statistics in a single value.

I assume the pg_node_tree was meant only as an inspiration how to build 
pseudo-type on top of a varlena value. I agree that's a good idea, and I 
plan to do something like that - say adding pg_mcv, pg_histogram, 
pg_ndistinct and pg_dependencies data types.

Heikki also mentioned that maybe JSONB would be a good way to store the 
statistics. I don't think so - firstly, it only supports a subset of 
data types, so we'd be unable to store statistics for some data types 
(or we'd have to store them as text, which sucks). Also, there's a fair 
amount of smartness in how the statistics are stored (e.g. how the 
histogram bucket boundaries are deduplicated, or how the estimation uses 
the serialized representation directly). We'd lose all of that when 
using JSONB.

Similarly for storing all the statistics in a single value - I see no 
reason why keeping the statistics in separate columns would be a bad 
idea (after all, that's kinda the point of relational databases). Also, 
there are perfectly valid cases when the caller only needs a particular 
type of statistic - e.g. when estimating GROUP BY we'll only need the 
ndistinct coefficients. Why should we force the caller to fetch and 
detoast everything, and throw away probably 99% of that?

So my plan here is to define pseudo types similar to how pg_node_tree is 
defined. That does not seem like a tremendous amount of work.


4) functional dependencies

Several people mentioned they don't like how functional dependencies are 
detected at ANALYZE time, particularly that there's a sudden jump 
between 0 and 1. Instead, a continuous "dependency degree" between 0 and 
1 was proposed.

I'm fine with that, although that makes "clause reduction" (deciding 
that we don't need to estimate one of the clauses at all, as it's 
implied by some other clause) impossible. But that's fine, the 
functional dependencies will still be much less expensive than the other 
statistics.

I'm wondering how will this interact with transitivity, though. IIRC the 
current implementation is able to detect transitive dependencies and use 
that to reduce storage space etc.

In any case, this significantly complicates the functional dependencies, 
which were meant as a trivial type of statistics, mostly to establish 
the shared infrastructure. Which brings me to ndistinct.


5) ndistinct

So far, the ndistinct coefficients were lumped at the very end of the 
patch, and the statistic was only built but not used for any sort of 
estimation. I agree with Dean that perhaps it'd be better to move this 
to the very beginning, and use it as the simplest statistic to build the 
infrastructure instead of functional dependencies (which only gets truer 
due to the changes in functional dependencies, discussed in the 
preceding section).

I think it's probably a good idea and I plan to do that, so the patch 
series will probably look like this:
   * 001 - CREATE STATISTICS infrastucture with ndistinct coefficients   * 002 - use ndistinct coefficients to improve
GROUPBY estimates   * 003 - use ndistinct coefficients in clausesel.c (not sure)   * 004 - add functional dependencies
(build+ clausesel.c)   * 005 - add multivariate MCV (build + clausesel.c)   * 006 - add multivariate histograms (build
+clausesel.c)
 

I'm not sure about using the ndistinct coefficients in clausesel.c to 
estimate regular conditions - it's the place for which ndistinct 
coefficients were originally proposed by Kyotaro-san, but I seem to 
remember it was non-trivial to choose the best statistics when there 
were other types of stats available. But I'll look into that.


6) combining statistics

I've decided not to re-submit this part of the patch until the basic 
functionality gets in. I do think it's a very useful feature (despite 
having my doubts about the existing implementation), but it clearly 
distracts people.

Instead, the patch will use some simple selection strategy (e.g. using a 
single statistics covering most conditions) or perhaps something more 
advanced (e.g. non-overlapping statistics). But nothing complicated.


7) enriching the query plan

Sadly, none of the reviews provides any sort of feedback on how to 
enrich the query plan with information about statistics (instead of 
doing that in clausesel.c in ad-hoc ephemeral manner).

So I'm still a bit stuck on this :-(


8) join statistics

Not directly related to the current patch, but I recommend reading this 
paper quantifying impact of each part of query optimizer (estimates, 
cost model, plan enumeration):
    http://www.vldb.org/pvldb/vol9/p204-leis.pdf

The one conclusion that I take from it is we really need to think about 
improving the join estimates, somehow. Because it's by far the most 
significant source of issues (and the hardest one to fix).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: "Re: Question about grant create on database and pg_dump/pg_dumpall
Next
From: Masahiko Sawada
Date:
Subject: Re: Autovacuum launcher process launches worker process at high frequency