Re: default_statistics_target WAS: max_wal_senders must die - Mailing list pgsql-hackers

From Josh Berkus
Subject Re: default_statistics_target WAS: max_wal_senders must die
Date
Msg-id 4CBF9169.4030408@agliodbs.com
Whole thread Raw
In response to Re: default_statistics_target WAS: max_wal_senders must die  (Greg Stark <gsstark@mit.edu>)
Responses Re: default_statistics_target WAS: max_wal_senders must die
List pgsql-hackers
> Why? Afaict this has been suggested multiple times by people who don't
> justify it in any way except with handwavy -- larger samples are
> better. The sample size is picked based on what sample statistics
> tells us we need to achieve a given 95th percentile confidence
> interval for the bucket size given.

I also just realized that I confused myself ... we don't really want
more MCVs.  What we want it more *samples* to derive a small number of
MCVs.  Right now # of samples and number of MCVs is inexorably bound,
and they shouldn't be.  On larger tables, you're correct that we don't
necessarily want more MCVs, we just need more samples to figure out
those MCVs accurately.

> Can you explain when this would and wouldn't bias the sample for the
> users so they can decide whether to use it or not?

Sure.  There's some good math in various ACM papers for this.  The
basics are that block-based sampling should be accompanied by an
increased sample size, or you are lowering your confidence level.  But
since block-based sampling allows you to increase your sample size
without increasing I/O or RAM usage, you *can* take a larger sample ...
a *much* larger sample if you have small rows.

The algorithms for deriving stats from a block-based sample are a bit
more complex, because the code needs to determine the level of physical
correlation in the blocks sampled and skew the stats based on that.  So
there would be an increase in CPU time.  As a result, we'd probably give
some advice like "random sampling for small tables, block-based for
large ones".

> I think increasing the MCV is too simplistic since we don't really
> have any basis for any particular value. I think what we need are some
> statistics nerds to come along and say here's this nice tool from
> which you can make the following predictions and understand how
> increasing or decreasing the data set size affects the accuracy of the
> predictions.

Agreed.

Nathan?

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Issues with Quorum Commit
Next
From: Greg Stark
Date:
Subject: Re: default_statistics_target WAS: max_wal_senders must die