Re: Query optimizer 8.0.1 (and 8.0) - Mailing list pgsql-hackers

From pgsql@mohawksoft.com
Subject Re: Query optimizer 8.0.1 (and 8.0)
Date
Msg-id 16623.24.91.171.78.1107793679.squirrel@mail.mohawksoft.com
Whole thread Raw
In response to Re: Query optimizer 8.0.1 (and 8.0)  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Query optimizer 8.0.1 (and 8.0)  (Bruno Wolff III <bruno@wolff.to>)
List pgsql-hackers
> pgsql@mohawksoft.com writes:
>> On a very basic level, why bother sampling the whole table at all? Why
>> not
>> check one block and infer all information from that? Because we know
>> that
>> isn't enough data. In a table of 4.6 million rows, can you say with any
>> mathmatical certainty that a sample of 100 points can be, in any way,
>> representative?
>
> This is a statistical argument, not a rhetorical one, and I'm not going
> to bother answering handwaving.  Show me some mathematical arguments for
> a specific sampling rule and I'll listen.
>

Tom, I am floored by this response, I am shaking my head in disbelief.

It is inarguable that increasing the sample size increases the accuracy of
a study, especially when diversity of the subject is unknown. It is known
that reducing a sample size increases probability of error in any poll or
study. The required sample size depends on the variance of the whole. It
is mathmatically unsound to ASSUME any sample size is valid without
understanding the standard deviation of the set.

http://geographyfieldwork.com/MinimumSampleSize.htm

Again, I understand why you used the Vitter algorithm, but it has been
proven insufficient (as used) with the US Census TIGER database. We
understand this because we have seen that the random sampling as
implemented has insufficient information to properly characterize the
variance in the data.



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Is there a way to make VACUUM run completely outside transaction
Next
From: Jan Wieck
Date:
Subject: Re: Patent issues and 8.1