Re: Proper usage of ndistinct vs. dependencies extended statistics - Mailing list pgsql-hackers

From David Rowley
Subject Re: Proper usage of ndistinct vs. dependencies extended statistics
Date
Msg-id CAKJS1f8159XZ6xJgxSQq98jRdVRF2JWFwiWhJys6OQJn3r_DcA@mail.gmail.com
Whole thread Raw
In response to Proper usage of ndistinct vs. dependencies extended statistics  (Paul Martinez <hellopfm@gmail.com>)
List pgsql-hackers
On Thu, 11 Apr 2019 at 11:53, Paul Martinez <hellopfm@gmail.com> wrote:
>
> I have some questions about the different types of extended statistics
> that were introduced in Postgres 10.
> - Which types of queries are each statistic type supposed to improve?

Multivariate ndistinct stats are aimed to improve distinct estimates
over groups of columns.  These can help in cases like GROUP BY a,b,
SELECT DISTINCT a,b, SELECT a,b FROM x UNION SELECT a,b FROM y;  They
also help in determining the number of times an index will be
rescanned in cases like nested loops with a parameterised inner path.

I see multivariate ndistinct estimates are not used for normal
selectivity estimates for unknown values.  e.g PREPARE q1 (int, int)
AS SELECT * FROM t1 WHERE a = $1 and b = $2; still assumes a and b are
independent even when ndistinct stats exist on the two columns.

There are a few other usages too. See calls of estimate_num_groups()

dependency stats just handle WHERE clauses (or more accurately,
clauses containing a reference to a single relation.  These only
handle equality OpExprs.  e.g "a = 10 and y = 3", not "a < 6 and y =
3".  Further stat types (most common values) added in PG12 aim to
allow inequality operators too.

> - When should one type of statistic be used over the other? Should they
>   both always be used?

If they both always should be always used then we'd likely not have
bothered making the types optional.   Both ndistinct and dependency
stats are fairly cheap to calculate and store, so it might not be too
big an issue adding both types if you're not sure. With these two
types there's not really any choice for the planner to decide to use
one or the other, it just makes use of the ones it can use for the
given situation.   That won't be the case as more stats types get
added. In PG12, for example, we had to choose of MCV stats should be
applied before dependencies stats.  That might be a no-brainer, but
perhaps the future there will be stats types where the order to apply
them is not so clear, although in those cases it might be questionable
why you'd want to define more than one type of stats on the same set
of columns.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: block-level incremental backup
Next
From: Michael Paquier
Date:
Subject: Re: Cleanup/remove/update references to OID column