Re: Overhauling GUCS - Mailing list pgsql-hackers

From Gregory Stark
Subject Re: Overhauling GUCS
Date
Msg-id 87tzg1mwhh.fsf@oxford.xeocode.com
Whole thread Raw
In response to Re: Overhauling GUCS  (Josh Berkus <josh@agliodbs.com>)
Responses Re: pg_statistics and sample size WAS: Overhauling GUCS  (Josh Berkus <josh@agliodbs.com>)
List pgsql-hackers
"Josh Berkus" <josh@agliodbs.com> writes:

> Greg,
>
>> Speak to the statisticians. Our sample size is calculated using the same
>> theory behind polls which sample 600 people to learn what 250 million
>> people are going to do on election day. You do NOT need (significantly)
>> larger samples for larger populations.
>
> Your analogy is bad.  For elections, the voters have only a few choices.  
> In a 300 million row table, there could be 300 million different values, 
> and the histogram becomes less accurate for every order of magnitude 
> smaller than 300 million it is.

I think you're right that you need, for example, 600 people *for each answer*
to give good poll results. So for a two-way election which is about even
that's about 1,200 people. If one of the candidates is much less popular you
might have to sample many more people before you have 600 people in that
bucket.

The analogous case in our situation is not having 300 million distinct values,
since we're not gathering info on specific values, only the buckets. We need,
for example, 600 samples *for each bucket*. Each bucket is chosen to have the
same number of samples in it. So that means that we always need the same
number of samples for a given number of buckets.

>> Also, our estimates for n_distinct are very unreliable. The math behind
>> sampling for statistics just doesn't work the same way for properties
>> like n_distinct. For that Josh is right, we *would* need a sample size
>> proportional to the whole data set which would practically require us to
>> scan the whole table (and have a technique for summarizing the results
>> in a nearly constant sized data structure).
>
> Actually, a number of papers have shown block-based algorithms which can 
> arrive a reasonably confident (between 50% and 250% of accurate) estimates 
> based on scanning only 5% of *blocks*.  Simon did some work on this a 
> couple years ago, but he and I had difficultly convincing -hackers that a 
> genuine problem existed.

Really? Could you send references? The paper I read surveyed previous work and
found that you needed to scan up to 50% of the table to get good results.
50-250% is considerably looser than what I recall it considering "good"
results so these aren't entirely inconsistent but I thought previous results
were much worse than that. 

> You're correct that we'd need to change pg_statistic, though.  For one 
> thing, we need to separate the sample size from the histogram size.

That amounts to giving users control over the sample size per bucket. Which
allows them to get a more or less accurate estimate for a range covering a
single bucket without changing the size of the bucket.

I'm a bit puzzled which direction you want to go.

> Also, we seem to be getting pretty far away from the original GUC 
> discussion.

Thank heavens :)

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!
 


pgsql-hackers by date:

Previous
From: "Heikki Linnakangas"
Date:
Subject: Re: Core team statement on replication in PostgreSQL
Next
From: Gregory Stark
Date:
Subject: Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics