Re: Using multiple extended statistics for estimates - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Using multiple extended statistics for estimates
Date
Msg-id 20191106195440.wtxsyhiwdidl37hv@development
Whole thread Raw
In response to Using multiple extended statistics for estimates  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Using multiple extended statistics for estimates  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote:
>Hi,
>
>PostgreSQL 10 introduced extended statistics, allowing us to consider
>correlation between columns to improve estimates, and PostgreSQL 12
>added support for MCV statistics. But we still had the limitation that
>we only allowed using a single extended statistics per relation, i.e.
>given a table with two extended stats
>
>   CREATE TABLE t (a int, b int, c int, d int);
>   CREATE STATISTICS s1 (mcv) ON a, b FROM t;
>   CREATE STATISTICS s2 (mcv) ON c, d FROM t;
>
>and a query
>
>   SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1;
>
>we only ever used one of the statistics (and we considered them in a not
>particularly well determined order).
>
>This patch addresses this by using as many extended stats as possible,
>by adding a loop to statext_mcv_clauselist_selectivity(). In each step
>we pick the "best" applicable statistics (in the sense of covering the
>most attributes) and factor it into the oveall estimate.
>
>All this happens where we'd originally consider applying a single MCV
>list, i.e. before even considering the functional dependencies, so
>roughly like this:
>
>   while ()
>   {
>       ... apply another MCV list ...
>   }
>
>   ... apply functional dependencies ...
>
>
>I've both in the loop, but I think that'd be wrong - the MCV list is
>expected to contain more information about individual values (compared
>to functional deps, which are column-level).
>

Here is a slightly polished v2 of the patch, the main difference being
that computing clause_attnums was moved to a separate function.

This is a fairly simple patch, and it's not entirely new functionality
(applying multiple statistics was part of the very first patch seris,
although of course in a very different form). So unless there are
objections, I'd like to get this committed sometime next week.

There's room for improvement, of course, for example when handling
overlapping statistics. Consider a table with columns (a,b,c) and two
extended statistics on (a,b) and (b,c), and query with one clause per
column

   SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1

In this case the patch does not help, because we apply (a,b) and then we
have just a single clause remaining. What we could do is still apply the
(b,c) statistic, using the already-estimated clause on b as a condition.
So essentially we'd compute

    P(a=1 && b=1) * P(c=1 | b=1)

But that'll require larger changes, and I see it as an evolution of the
current patch.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: define bool in pgtypeslib_extern.h
Next
From: Tomas Vondra
Date:
Subject: Re: Using multiple extended statistics for estimates