Re: More thoughts about planner's cost estimates - Mailing list pgsql-hackers

From Michael Dean
Subject Re: More thoughts about planner's cost estimates
Date
Msg-id 4480A204.3070407@sourceview.com
Whole thread Raw
In response to Re: More thoughts about planner's cost estimates  (Greg Stark <gsstark@mit.edu>)
Responses Re: More thoughts about planner's cost estimates  (David Fetter <david@fetter.org>)
List pgsql-hackers
Greg Stark wrote:

>Josh Berkus <josh@agliodbs.com> writes:
>
>  
>
>>>    Using a variety of synthetic and real-world data sets, we show that
>>>    distinct sampling gives estimates for distinct values queries that
>>>are within 0%-10%, whereas previous methods were typically 50%-250% off,
>>>across the spectrum of data sets and queries studied.
>>>      
>>>
>>Aha.  It's a question of the level of error permissable.   For our 
>>estimates, being 100% off is actually OK.  That's why I was looking at 5% 
>>block sampling; it stays within the range of +/- 50% n-distinct in 95% of 
>>cases.
>>    
>>
>
>Well, let's see. Say for example we're scanning 50,000 records out of 1M
>records. Using the Theorem from Charikar et al quoted as Theorem 1 in section
>2 we get that there is at least a 5% chance of getting a result with a ratio
>error at least sqrt((1M-50k)/100k ln(20)) or 5.33.
>
>So no, even assuming you have a good unbiased sample, a 5% sample is only
>going to get you to a result within 19%-533% of the real values 95% of the
>time. Not 50-250%. And that's a negative result, so it's *at best* 19-533%. On
>some distributions it may be outside that range even more than 95% of the
>time.
>
>This is entirely unlike the histograms where we have a solid foundation for a
>positive result. We can guarantee that the result will be outside +/- x% *at
>most* 5% of the time. For most distributions it'll be outside that range even
>less.
>
>And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from
>5% block sampling took just as long as reading all the blocks. Even if we
>figure out what's causing that (IMHO surprising) result and improve matters I
>would only expect it to be 3-4x faster than a full scan.
>
>http://archives.postgresql.org/pgsql-hackers/2006-01/msg00285.php
>
>  
>
I'm sorry to interrupt your esoteric (to me) discussion, but I have a 
very simple question:  would you define a "good unbiased sample"?  My 
statistics professor Dan Price (rest his soul) would tell me there are 
only random samples of some sort, and "other", which are always biased, 
and never good.  Excuse my absolutes, I didn't mean Good or Evil.  And 
how do you calculate a level of error without randomization? And are you 
talking of type I or type II error?
Michael


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: COPY (query) TO file
Next
From: Tom Lane
Date:
Subject: Re: More thoughts about planner's cost estimates