Re: Query optimizer 8.0.1 (and 8.0) - Mailing list pgsql-hackers

From Mark Kirkwood
Subject Re: Query optimizer 8.0.1 (and 8.0)
Date
Msg-id 4207DC01.9060102@coretech.co.nz
Whole thread Raw
In response to Re: Query optimizer 8.0.1 (and 8.0)  (pgsql@mohawksoft.com)
Responses Re: Query optimizer 8.0.1 (and 8.0)
List pgsql-hackers
Maybe I am missing something - ISTM that you can increase your 
statistics target for those larger tables to obtain a larger (i.e. 
better) sample.

regards

Mark

pgsql@mohawksoft.com wrote:
>>pgsql@mohawksoft.com writes:
> Any and all random sampling assumes a degree of uniform distribution. This
> is the basis of the model. It assumes that chunks of the whole will be
> representative of the whole (to some degree). This works when normal
> variations are more or less distributed uniformly. As variations and
> trends becomes less uniformly distributed, more samples are required to
> characterize it.
> 
> Douglas Adams had a great device called the "Total Perspective Vortex"
> which infered the whole of the universe from a piece of fairy cake. It was
> a subtle play on the absurd notion that a very small sample could lead to
> an understanding of an infinitly larger whole.
> 
> On a very basic level, why bother sampling the whole table at all? Why not
> check one block and infer all information from that? Because we know that
> isn't enough data. In a table of 4.6 million rows, can you say with any
> mathmatical certainty that a sample of 100 points can be, in any way,
> representative?
> 
> Another problem with random sampling is trend analysis. Often times there
> are minor trends in data. Ron pointed out the lastname firstname trend.
> Although there seems to be no correlation between firstnames in the table,
> there are clearly groups or clusters of ordered data that is an ordering
> that is missed by too small a sample.
> 
> I understand why you chose the Vitter algorithm, because it provides a
> basically sound methodology for sampling without knowledge of the size of
> the whole, but I think we can do better. I would suggest using the current
> algorithm the first time through, then adjust the number of samples [n]
> based on the previous estimate of the size of the table [N]. Each
> successive ANALYZE will become more accurate. The Vitter algorithm is
> still useful as [N] will always be an estimate.
>



pgsql-hackers by date:

Previous
From: Bruno Wolff III
Date:
Subject: Re: Query optimizer 8.0.1 (and 8.0)
Next
From: Alvaro Herrera
Date:
Subject: Re: Is there a way to make VACUUM run completely outside