Re: Overhauling GUCS - Mailing list pgsql-hackers

From Josh Berkus
Subject Re: Overhauling GUCS
Date
Msg-id 200806091649.22281.josh@agliodbs.com
Whole thread Raw
In response to Re: Overhauling GUCS  (Gregory Stark <stark@enterprisedb.com>)
Responses Re: Overhauling GUCS  (Gregory Stark <stark@enterprisedb.com>)
List pgsql-hackers
Greg,

> Speak to the statisticians. Our sample size is calculated using the same
> theory behind polls which sample 600 people to learn what 250 million
> people are going to do on election day. You do NOT need (significantly)
> larger samples for larger populations.

Your analogy is bad.  For elections, the voters have only a few choices.  
In a 300 million row table, there could be 300 million different values, 
and the histogram becomes less accurate for every order of magnitude 
smaller than 300 million it is.

> Also, our estimates for n_distinct are very unreliable. The math behind
> sampling for statistics just doesn't work the same way for properties
> like n_distinct. For that Josh is right, we *would* need a sample size
> proportional to the whole data set which would practically require us to
> scan the whole table (and have a technique for summarizing the results
> in a nearly constant sized data structure).

Actually, a number of papers have shown block-based algorithms which can 
arrive a reasonably confident (between 50% and 250% of accurate) estimates 
based on scanning only 5% of *blocks*.  Simon did some work on this a 
couple years ago, but he and I had difficultly convincing -hackers that a 
genuine problem existed.

You're correct that we'd need to change pg_statistic, though.  For one 
thing, we need to separate the sample size from the histogram size.

Also, we seem to be getting pretty far away from the original GUC 
discussion.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco


pgsql-hackers by date:

Previous
From: Gregory Stark
Date:
Subject: Re: Overhauling GUCS
Next
From: Simon Riggs
Date:
Subject: Re: pg_dump restore time and Foreign Keys