pgsql: Improve ANALYZE's strategy for finding MCVs. - Mailing list pgsql-committers

From Dean Rasheed
Subject pgsql: Improve ANALYZE's strategy for finding MCVs.
Date
Msg-id E1eywjV-0005PP-Kr@gemulon.postgresql.org
Whole thread Raw
Responses Re: pgsql: Improve ANALYZE's strategy for finding MCVs.  (Simon Riggs <simon@2ndquadrant.com>)
List pgsql-committers
Improve ANALYZE's strategy for finding MCVs.

Previously, a value was included in the MCV list if its frequency was
25% larger than the estimated average frequency of all nonnull values
in the table.  For uniform distributions, that can lead to values
being included in the MCV list and significantly overestimated on the
basis of relatively few (sometimes just 2) instances being seen in the
sample.  For non-uniform distributions, it can lead to too few values
being included in the MCV list, since the overall average frequency
may be dominated by a small number of very common values, while the
remaining values may still have a large spread of frequencies, causing
both substantial overestimation and underestimation of the remaining
values.  Furthermore, increasing the statistics target may have little
effect because the overall average frequency will remain relatively
unchanged.

Instead, populate the MCV list with the largest set of common values
that are statistically significantly more common than the average
frequency of the remaining values.  This takes into account the
variance of the sample counts, which depends on the counts themselves
and on the proportion of the table that was sampled.  As a result, it
constrains the relative standard error of estimates based on the
frequencies of values in the list, reducing the chances of too many
values being included.  At the same time, it allows more values to be
included, since the MCVs need only be more common than the remaining
non-MCVs, rather than the overall average.  Thus it tends to produce
fewer MCVs than the previous code for uniform distributions, and more
for non-uniform distributions, reducing estimation errors in both
cases.  In addition, the algorithm responds better to increasing the
statistics target, allowing more values to be included in the MCV list
when more of the table is sampled.

Jeff Janes, substantially modified by me. Reviewed by John Naylor and
Tomas Vondra.

Discussion: https://postgr.es/m/CAMkU=1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO+a=4DVkzpuQ2tw@mail.gmail.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/b5db1d93d2a6e2d3186f8798a5d06e07b7536a1d

Modified Files
--------------
src/backend/commands/analyze.c | 216 ++++++++++++++++++++++++++++++-----------
1 file changed, 158 insertions(+), 58 deletions(-)


pgsql-committers by date:

Previous
From: Andres Freund
Date:
Subject: pgsql: Add file containing extensions of the LLVM C API.
Next
From: Simon Riggs
Date:
Subject: Re: pgsql: Improve ANALYZE's strategy for finding MCVs.