Re: Query optimizer 8.0.1 (and 8.0) - Mailing list pgsql-hackers

From pgsql@mohawksoft.com
Subject Re: Query optimizer 8.0.1 (and 8.0)
Date
Msg-id 16805.24.91.171.78.1107800884.squirrel@mail.mohawksoft.com
Whole thread Raw
In response to Re: Query optimizer 8.0.1 (and 8.0)  (Bruno Wolff III <bruno@wolff.to>)
Responses Re: Query optimizer 8.0.1 (and 8.0)  (Bruno Wolff III <bruno@wolff.to>)
List pgsql-hackers
> On Mon, Feb 07, 2005 at 11:27:59 -0500,
>   pgsql@mohawksoft.com wrote:
>>
>> It is inarguable that increasing the sample size increases the accuracy
>> of
>> a study, especially when diversity of the subject is unknown. It is
>> known
>> that reducing a sample size increases probability of error in any poll
>> or
>> study. The required sample size depends on the variance of the whole. It
>> is mathmatically unsound to ASSUME any sample size is valid without
>> understanding the standard deviation of the set.
>
> For large populations the accuracy of estimates of statistics based on
> random
> samples from that population are not very sensitve to population size and
> depends primarily on the sample size. So that you would not expect to need
> to use larger sample sizes on larger data sets for data sets over some
> minimum size.

That assumes a fairly low standard deviation. If the standard deviation is
low, then a minimal sample size works fine. If there was zero deviation in
the  data, then a sample of one works fine.

If the standard deviation is high, then you need more samples. If you have
a high standard deviation and a large data set, you need more samples than
you would need for a smaller data set.

In the current implementation of analyze.c, the default is 100 samples. On
a table of 10,000 rows, that is probably a good number characterize the
data enough for the query optimizer (1% sample). For a table with 4.6
million rows, that's less than 0.002%

Think about an iregularly occuring event, unevenly distributed throughout
the data set. A randomized sample strategy normalized across the whole
data set with too few samples will mischaracterize the event or even miss
it altogether.


pgsql-hackers by date:

Previous
From: Abhijit Menon-Sen
Date:
Subject: Re: Patent issues and 8.1
Next
From: Bruno Wolff III
Date:
Subject: Re: Query optimizer 8.0.1 (and 8.0)