Re: multivariate statistics (v19) - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: multivariate statistics (v19)
Date
Msg-id 0924a7a9-a219-1366-d3bc-2ddd98bc9269@2ndquadrant.com
Whole thread Raw
In response to Re: multivariate statistics v14  (Tatsuo Ishii <ishii@postgresql.org>)
Responses Re: multivariate statistics (v19)
Re: multivariate statistics (v19)
Re: multivariate statistics (v19)
List pgsql-hackers
Hi,

Attached is v19 of the "multivariate stats" patch series - essentially
v18 rebased on top of current master. Aside from a few bug fixes, the
main improvement is addition of SGML docs demonstrating the statistics
in a way similar to the current "Row Estimation Examples" (and the docs
are actually in the same section). I've tried to keep the right amount
of technical detail (and pointing to the right README for additional
details), but this may need improvements. I have not written docs
explaining how statistics may be combined yet (more about this later).


There are two general design questions that I'd like to get feedback on:


1) enriching the query tree with multivariate statistics info

Right now all the stuff related to multivariate statistics estimation
happens in clausesel.c - matching condition to statistics, selection of
statistics to use (if there are multiple usable stats), etc. So pretty
much all this info is internal to clausesel.c and does not get outside.

I'm starting to think that some of the steps (matching quals to stats,
selection of stats) should happen in a "preprocess" step before the
actual estimation, storing the information (which stats to use, etc.) in
a new type of node in the query tree - something like RestrictInfo.

I believe this needs to happen sometime after deconstruct_jointree() as
that builds RestrictInfos nodes, and looking at planmain.c, right after
extract_restriction_or_clauses seems about right. Haven't tried, though.

This would move all the "statistics selection" logic from clausesel.c,
separating it from the "actual estimation" and simplifying the code.

But more importantly, I think we'll need to show some of the data in
EXPLAIN output. With per-column statistics it's fairly straightforward
to determine which statistics are used and how. But with multivariate
stats things are often more complicated - there may be multiple
candidate statistics (e.g. histograms covering different subsets of the
conditions), it's possible to apply them in different orders, etc.

But EXPLAIN can't show the info if it's ephemeral and available only
within clausesel.c (and thrown away after the estimation).


2) combining multiple statistics

I think the ability to combine multivariate statistics (covering
different subsets of conditions) is important and useful, but I'm
starting to think that the current implementation may not be the correct
one (which is why I haven't written the SGML docs about this part of the
patch series yet).

Assume there's a table "t" with 3 columns (a, b, c), and that we're
estimating query:

    SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3

but that we only have two statistics (a,b) and (b,c). The current patch
does about this:

    P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)

i.e. it estimates the first two conditions using (a,b), and then
estimates (c=3) using (b,c) with "b=2" as a condition. Now, this is very
efficient, but it only works as long as the query contains conditions
"connecting" the two statistics. So if we remove the "b=2" condition
from the query, this stops working.

But it's possible to do this differently, e.g. by doing this:

    P(a=1) * P(c=3|a=1)

where P(c=3|a=1) is using (b,c), but uses (a,b) to restrict the set of
buckets (if the statistics is a histogram) to consider. In pseudo-code,
it might look like this:

    buckets = {}
    foreach bucket x in (b,c):
        foreach bucket y in (a,b):
           if y matches (a=1) and overlap(x,y):
               buckets := buckets + x

which is the part of (b,c) matching (a=1), allowing us to compute the
conditional probability.

It may get more complicated, of course. In particular, there may be
different types of statistics, and we need to be able to "match" them
against each other. With just MCV lists and histograms that's probably
easy enough, but if we add other types of statistics, it may get way
more complicated.

I still think this is a useful capability, but perhaps there are better
ideas how to do that. In any case, it only affects the last part of the
patch (0006).


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Increasing timeout of poll_query_until for TAP tests
Next
From: Alfred Perlstein
Date:
Subject: Re: Why we lost Uber as a user