Thread: How to tell ANALYZE to collect statistics from the whole table?

How to tell ANALYZE to collect statistics from the whole table?

From
AlexK987
Date:
The documentation states that "The extent of analysis can be controlled by
adjusting the default_statistics_target configuration variable". It looks
like I can tell Postgres to create more histograms with more bins, and more
distinct values. This implicitly means that Postgres will use a larger
random subset to calculate statistics.

However, this is not what I want. My data may be quite skewed, and I want
full control over the size of the sample. I want to explicitly tell Postgres
to analyze the whole table. How can I accomplish that?



--
View this message in context:
http://postgresql.nabble.com/How-to-tell-ANALYZE-to-collect-statistics-from-the-whole-table-tp5835339.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: How to tell ANALYZE to collect statistics from the whole table?

From
Tomas Vondra
Date:
Hi,

On 25.1.2015 00:33, AlexK987 wrote:
> The documentation states that "The extent of analysis can be
> controlled by adjusting the default_statistics_target configuration
> variable". It looks like I can tell Postgres to create more
> histograms with more bins, and more distinct values. This implicitly
> means that Postgres will use a larger random subset to calculate
> statistics.
>
> However, this is not what I want. My data may be quite skewed, and I
> want full control over the size of the sample. I want to explicitly
> tell Postgres to analyze the whole table. How can I accomplish that?

I don't think there's an official way to do that - at least I can't
think of one. The only thing you can do is increasing statistics target
(either globally by setting default_statistics_target, or per column
using ALTER TABLE ... SET STATISTICS).

As you noticed, this however controls two things - sample size and how
detailed the statistics (MCV list / histogram) will be. The statistics
target is used as upper bound for number of MCV items / histogram bins,
and the number of sampled rows is (300 * statistics_target). With
default_statistics_target = 10000 (which si the max allowed value since
9.0), this produces very detailed stats and uses sample of ~3M rows.

It's a bit more complicated though, because there's an algorithm that
decides how many MCV items / histogram buckets to actually create, based
on the data. So you may not get more detailed stats, even when using
larger sample.

That being said, I really doubt increasing the statistics target above
10000 (or even sampling the whole table) will help you in practice.
Might be worth showing an example of a bad estimate with your data, or
maybe a test case to play with.

--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: How to tell ANALYZE to collect statistics from the whole table?

From
AlexK987
Date:
Tomas,

Thank you for a very useful reply. Right now I do not have a case of poor
performance caused by strong data skew which is not properly reflected in
statistics. I was being defensive, trying to prevent every possible thing
that might go wrong.



--
View this message in context:
http://postgresql.nabble.com/How-to-tell-ANALYZE-to-collect-statistics-from-the-whole-table-tp5835339p5835344.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: How to tell ANALYZE to collect statistics from the whole table?

From
Tomas Vondra
Date:
On 25.1.2015 02:04, AlexK987 wrote:
> Tomas,
>
> Thank you for a very useful reply. Right now I do not have a case of
> poor performance caused by strong data skew which is not properly
> reflected in statistics. I was being defensive, trying to prevent
> every possible thing that might go wrong.

OK. My recommendation is not to mess with default_statistics unless you
actually have to (e.g. increasing the value on all tables, withouth a
query where the current value causes trouble). It increases time to plan
the queries, collect statistics (ANALYZE / autovacuum) etc.

regards
--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: How to tell ANALYZE to collect statistics from the whole table?

From
Tom Lane
Date:
AlexK987 <alex.cue.987@gmail.com> writes:
> The documentation states that "The extent of analysis can be controlled by
> adjusting the default_statistics_target configuration variable". It looks
> like I can tell Postgres to create more histograms with more bins, and more
> distinct values. This implicitly means that Postgres will use a larger
> random subset to calculate statistics.

> However, this is not what I want. My data may be quite skewed, and I want
> full control over the size of the sample. I want to explicitly tell Postgres
> to analyze the whole table. How can I accomplish that?

You can't, and you wouldn't want to if you could, because that would
result in slurping the entire table into backend local memory.  All
the rows constituting the "random sample" are held in memory while
doing the statistical calculations.

In practice, the only stat that would be materially improved by taking
enormously large samples would be the number-of-distinct-values estimate.
There's already a way you can override ANALYZE's estimate of that number
if you need to.

            regards, tom lane


Re: How to tell ANALYZE to collect statistics from the whole table?

From
Jeff Janes
Date:
On Sat, Jan 24, 2015 at 9:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
AlexK987 <alex.cue.987@gmail.com> writes:
> The documentation states that "The extent of analysis can be controlled by
> adjusting the default_statistics_target configuration variable". It looks
> like I can tell Postgres to create more histograms with more bins, and more
> distinct values. This implicitly means that Postgres will use a larger
> random subset to calculate statistics.

> However, this is not what I want. My data may be quite skewed, and I want
> full control over the size of the sample. I want to explicitly tell Postgres
> to analyze the whole table. How can I accomplish that?

You can't, and you wouldn't want to if you could, because that would
result in slurping the entire table into backend local memory.  All
the rows constituting the "random sample" are held in memory while
doing the statistical calculations.

In practice, the only stat that would be materially improved by taking
enormously large samples would be the number-of-distinct-values estimate.
There's already a way you can override ANALYZE's estimate of that number
if you need to.

The accuracy of the list of most common values could also be improved a lot by increasing the sample. 

Cheers,

Jeff