Home > mailing lists

Re: More thoughts about planner's cost estimates - Mailing list pgsql-hackers

From	Greg Stark
Subject	Re: More thoughts about planner's cost estimates
Date	June 1, 2006 15:26:08
Msg-id	8764jkhgwb.fsf@stark.xeocode.com Whole thread Raw
In response to	Re: More thoughts about planner's cost estimates (Josh Berkus <josh@agliodbs.com>)
Responses	Re: More thoughts about planner's cost estimates Re: More thoughts about planner's cost estimates
List	pgsql-hackers

Tree view

Josh Berkus <josh@agliodbs.com> writes:

> 1. n-distinct estimation is bad, as previously discussed;
> 
> 2. our current heuristics sampling methods prevent us from sampling more than
> 0.5% of any reasonably large table, causing all statistics on those tables to
> be off for any table with irregular distribution of values;

I'm convinced these two are more connected than you believe. For distributions
of values in a range using small samples is on solid statistical basis. It's
the same reason poll results based on only a few hundred people can accurately
predict elections in a country with hundreds of millions of voters.

However for things like estimating discrete values there's absolutely no solid
foundation for these small samples. Your accuracy is going to be simply
proportional to the sample size, meaning you can't get anything trustworthy
without sampling large enough portions of the table that a full sequential
scan would be just as fast.

I might be interested in implementing that algorithm that was posted a while
back that involved generating good unbiased samples of discrete values. The
algorithm was quite clever and well described and paper claimed impressively
good results.

However it will only make sense if people are willing to accept that analyze
will need a full table scan -- at least for tables where the DBA knows that
good n_distinct estimates are necessary.

> 3. We don't have any method to analyze inter-column correlation within a table;
> 
> 4. We don't keep statistics on foriegn key correlation;

Gosh these would be nice but they sound like hard problems. Has anybody even
made any headway in brainstorming how to tackle them?

> 5. random_page_cost (as previously discussed) is actually a funciton of
> relatively immutable hardware statistics, and as such should not need to exist
> as a GUC once the cost model is fixed.

I don't think that's true at all. Not all hardware is the same.

Certainly the need to twiddle this GUC should be drastically reduced if the
cache effects are modelled properly and the only excuses left are legitimate
hardware differences.

-- 
greg

pgsql-hackers by date:

From: Tzahi Fadida
Date: 01 June 2006, 14:39:40
Subject: Re: CTID issues and a soc student in need of help

From: Robert Treat
Date: 01 June 2006, 15:29:58
Subject: stable snapshot looks outdated

Re: More thoughts about planner's cost estimates - Mailing list pgsql-hackers

Previous

Next