Re: default_statistics_target WAS: max_wal_senders must die - Mailing list pgsql-hackers

From Robert Haas
Subject Re: default_statistics_target WAS: max_wal_senders must die
Date
Msg-id AANLkTi=h5+ar4LBngmcZVcbRct9R7t9Qk7+rHBxpb4Y4@mail.gmail.com
Whole thread Raw
In response to Re: default_statistics_target WAS: max_wal_senders must die  (Greg Stark <gsstark@mit.edu>)
Responses Re: default_statistics_target WAS: max_wal_senders must die
List pgsql-hackers
On Wed, Oct 20, 2010 at 6:38 PM, Greg Stark <gsstark@mit.edu> wrote:
> On Wed, Oct 20, 2010 at 3:15 PM, Josh Berkus <josh@agliodbs.com> wrote:
>>>> Maybe what should be done about this is to have separate sizes for the
>>>> MCV list and the histogram, where the MCV list is automatically sized
>>>> during ANALYZE.
>>
>> It's been suggested multiple times that we should base our sample size
>> on a % of the table, or at least offer that as an option.
>
> Why? Afaict this has been suggested multiple times by people who don't
> justify it in any way except with handwavy -- larger samples are
> better. The sample size is picked based on what sample statistics
> tells us we need to achieve a given 95th percentile confidence
> interval for the bucket size given.
>
> Robert pointed out one reason we would want smaller buckets for larger
> tables but nobody has explained why we would want smaller confidence
> intervals for the same size buckets. That amounts to querying larger
> tables for the same percentage of the table but wanting more precise
> estimates than you want for smaller tables.

Yes, I think a percentage of the table is going to break down either
at the high end or the low end.  Hand-waving (but based on
experience), for a 1000 row table a statistics target of 10 is
probably approximately right and 100 is too much and 1 is too little.
But for a 1,000,000 row table 10,000 is probably too much and even
1,000 is pushing it.  So using a constant percentage of table rows
doesn't feel right.  I had a thought today that it might make sense to
use an exponential curve, like min(2 * N^(1/3), 10).  I can't really
justify that mathematically, but that doesn't mean it won't work well
in practice.

>> So, I think we could easily be quite a bit smarter than just increasing
>> the MCV.  Although that might be a nice start.
>
> I think increasing the MCV is too simplistic since we don't really
> have any basis for any particular value. I think what we need are some
> statistics nerds to come along and say here's this nice tool from
> which you can make the following predictions and understand how
> increasing or decreasing the data set size affects the accuracy of the
> predictions.

I'm not sure that's realistic, because everything depends on what
queries you're running, and you can get arbitrary answers by
postulating arbitrary queries.  However, this does not make me excited
about "doing nothing".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: WIP: extensible enums
Next
From: Robert Haas
Date:
Subject: Re: WIP: extensible enums