Re: More thoughts about planner's cost estimates - Mailing list pgsql-hackers

From Greg Stark
Subject Re: More thoughts about planner's cost estimates
Date
Msg-id 8764jkhgwb.fsf@stark.xeocode.com
Whole thread Raw
In response to Re: More thoughts about planner's cost estimates  (Josh Berkus <josh@agliodbs.com>)
Responses Re: More thoughts about planner's cost estimates  (Josh Berkus <josh@agliodbs.com>)
Re: More thoughts about planner's cost estimates  ("Jim C. Nasby" <jnasby@pervasive.com>)
List pgsql-hackers
Josh Berkus <josh@agliodbs.com> writes:

> 1. n-distinct estimation is bad, as previously discussed;
> 
> 2. our current heuristics sampling methods prevent us from sampling more than
> 0.5% of any reasonably large table, causing all statistics on those tables to
> be off for any table with irregular distribution of values;

I'm convinced these two are more connected than you believe. For distributions
of values in a range using small samples is on solid statistical basis. It's
the same reason poll results based on only a few hundred people can accurately
predict elections in a country with hundreds of millions of voters.

However for things like estimating discrete values there's absolutely no solid
foundation for these small samples. Your accuracy is going to be simply
proportional to the sample size, meaning you can't get anything trustworthy
without sampling large enough portions of the table that a full sequential
scan would be just as fast.

I might be interested in implementing that algorithm that was posted a while
back that involved generating good unbiased samples of discrete values. The
algorithm was quite clever and well described and paper claimed impressively
good results.

However it will only make sense if people are willing to accept that analyze
will need a full table scan -- at least for tables where the DBA knows that
good n_distinct estimates are necessary.

> 3. We don't have any method to analyze inter-column correlation within a table;
> 
> 4. We don't keep statistics on foriegn key correlation;

Gosh these would be nice but they sound like hard problems. Has anybody even
made any headway in brainstorming how to tackle them?

> 5. random_page_cost (as previously discussed) is actually a funciton of
> relatively immutable hardware statistics, and as such should not need to exist
> as a GUC once the cost model is fixed.

I don't think that's true at all. Not all hardware is the same.

Certainly the need to twiddle this GUC should be drastically reduced if the
cache effects are modelled properly and the only excuses left are legitimate
hardware differences.

-- 
greg



pgsql-hackers by date:

Previous
From: Tzahi Fadida
Date:
Subject: Re: CTID issues and a soc student in need of help
Next
From: Robert Treat
Date:
Subject: stable snapshot looks outdated