Home > mailing lists

Re: Overhauling GUCS - Mailing list pgsql-hackers

From	Josh Berkus
Subject	Re: Overhauling GUCS
Date	June 9, 2008 20:50:27
Msg-id	200806091649.22281.josh@agliodbs.com Whole thread Raw
In response to	Re: Overhauling GUCS (Gregory Stark <stark@enterprisedb.com>)
Responses	Re: Overhauling GUCS
List	pgsql-hackers

Tree view

Greg,

> Speak to the statisticians. Our sample size is calculated using the same
> theory behind polls which sample 600 people to learn what 250 million
> people are going to do on election day. You do NOT need (significantly)
> larger samples for larger populations.

Your analogy is bad.  For elections, the voters have only a few choices.  
In a 300 million row table, there could be 300 million different values, 
and the histogram becomes less accurate for every order of magnitude 
smaller than 300 million it is.

> Also, our estimates for n_distinct are very unreliable. The math behind
> sampling for statistics just doesn't work the same way for properties
> like n_distinct. For that Josh is right, we *would* need a sample size
> proportional to the whole data set which would practically require us to
> scan the whole table (and have a technique for summarizing the results
> in a nearly constant sized data structure).

Actually, a number of papers have shown block-based algorithms which can 
arrive a reasonably confident (between 50% and 250% of accurate) estimates 
based on scanning only 5% of *blocks*.  Simon did some work on this a 
couple years ago, but he and I had difficultly convincing -hackers that a 
genuine problem existed.

You're correct that we'd need to change pg_statistic, though.  For one 
thing, we need to separate the sample size from the histogram size.

Also, we seem to be getting pretty far away from the original GUC 
discussion.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

pgsql-hackers by date:

From: Gregory Stark
Date: 09 June 2008, 20:02:36
Subject: Re: Overhauling GUCS

From: Simon Riggs
Date: 09 June 2008, 21:26:47
Subject: Re: pg_dump restore time and Foreign Keys

Re: Overhauling GUCS - Mailing list pgsql-hackers

Previous

Next