Thread: More thoughts about planner's cost estimates

More thoughts about planner's cost estimates

From
Tom Lane
Date:
I've been thinking about the planner's costing problems again,
particularly in connection with Philippe Lang's complaint here:
http://archives.postgresql.org/pgsql-general/2006-05/msg01526.php
Investigation showed that the planner is much too optimistic about the
usefulness of a multi-index BitmapAnd plan, and since that comparison is
just a cost-estimate comparison, it implies we are underestimating the
cost of an index scan.  A typical example from his results is
   ->  BitmapAnd  (cost=12.30..12.30 rows=1 width=0) (actual time=0.306..0.306 rows=0 loops=13628)         ->  Bitmap
IndexScan on lw_id_workflow  (cost=0.00..2.02 rows=7 width=0) (actual time=0.009..0.009 rows=7 loops=13628)
 Index Cond: (lw.id_workflow = "outer".id)         ->  Bitmap Index Scan on lw_ordre  (cost=0.00..10.03 rows=1437
width=0)(actual time=0.293..0.293 rows=1714 loops=13628)               Index Cond: (ordre = $4)
 

There are two variables involved here: the cost of touching an index page
and the cost of processing an index tuple.  Given the two comparable data
points above, we can solve for those numbers, and it turns out that the
real cost ratio on Philippe's machine is about 50 to 1.  Which says that
for him, cpu_index_tuple_cost plus cpu_operator_cost should be around
0.02, nearly an order of magnitude higher than their current default
values (0.001 and 0.0025 respectively).

In general it seems to me that for CPU-bound databases, the default values
of the cpu_xxx_cost variables are too low.  I am tempted to raise the
default value of cpu_index_tuple_cost to 0.005, which would be half of
cpu_tuple_cost and twice cpu_operator_cost; that seems more in line with
my feel for the relative costs of things.  For a CPU-bound database all
three would need to go up more than that.  But rather than telling people
to manipulate all three of these variables individually, I think it might
also be a good idea to provide a new GUC variable named something like
"cpu_speed_scale" that would just be a multiplier for the other variables.
It would default to 1.0 and our standard advice for CPU-bound databases
would be "decrease random_page_cost to 1.0 and raise cpu_speed_scale to
10.0 or so".  Seems cleaner than telling people to muck with three or so
individual values.

Another thing that's bothering me is that the index access cost computation
(in genericcostestimate) is looking sillier and sillier:
   /*    * Estimate the number of index pages that will be retrieved.    *    * For all currently-supported index
types,the first page of the index is    * a metadata page, and we should figure on fetching that plus a pro-rated    *
fractionof the remaining pages.    */   if (index->pages > 1 && index->tuples > 0)   {       numIndexPages =
(numIndexTuples/ index->tuples) * (index->pages - 1);       numIndexPages += 1;        /* count the metapage too */
 numIndexPages = ceil(numIndexPages);   }   else       numIndexPages = 1.0;
 
   /*    * Compute the index access cost.    *    * Disk cost: our generic assumption is that the index pages will be
read   * sequentially, so they have cost 1.0 each, not random_page_cost.    */   *indexTotalCost = numIndexPages;
 

There are several things wrong with this, at least in its application to
btrees.  It's not counting descent of the btree (ie, touches of the root
page and intermediate-level pages).  On the other hand it's surely
overcharging for metapage touches.  As of CVS tip we cache the metapage in
the relcache, so it's probably fair to disregard that cost altogether.
And on the third hand, when we need to retrieve multiple leaf pages it's
over-optimistic to assume that those'll be purely sequential fetches.
(That would be approximately true in a freshly-built btree, but not in one
that's suffered any amount of page splitting or recycling.)

I've desisted from touching this mainly because the obvious sanity
adjustments, such as adding something for tree descent and charging
random_page_cost not 1.0 for leaf page touches, would increase the
estimated cost of index usage, which seemed like not the direction we
wanted to go in.  So I figured that the errors were more or less
cancelling each other.  But the addition of bitmap index scans is now
exposing the weaknesses, so we need to face up to the problem.

I am inclined to think that a reasonable model is to continue to estimate
the number of index leaf pages touched as above (pro-rating against the
total number of index entries), to charge random_page_cost per leaf page
touched, and not to count anything for metapage or tree descent.  I
justify the latter on the grounds that upper tree levels will tend to stay
in cache because they're touched so often.  random_page_cost per leaf page
may be an overestimate, since sometimes logically adjacent leaf pages will
be physically adjacent too, but not charging for tree descent should help
to cancel that out.  With this model, the disk cost to fetch a single
index entry will be estimated as random_page_cost (default 4.0) rather
than the current fixed 2.0.  This shouldn't hurt things too much for
simple indexscans --- especially since anyone running with a reduced
random_page_cost won't see as much change.  And it will increase the costs
for bitmap scans that cross many index pages, which is what we need in
light of Philippe's numbers.

Now we have seen a lot of cases in which indexscans were being drastically
overestimated, so increasing the cost estimate even more may seem like a
horrid idea.  But I believe that most or all of these cases were ones
where the same index was being visited many times, and so the real
estimation failure is not accounting for cache effects across multiple
indexscans.  Rather than being afraid to raise the cost estimate for an
indexscan in isolation, I think it's time to bite the bullet and do
something about that.

The big difficulty in modeling cache effects from multiple scans is that
we usually don't know how many times the index will be scanned.  If we did
have that number, I think it'd be reasonable to use the Mackert and Lohman
approximation already used in cost_index to estimate the total number of
index leaf pages fetched over N scans, and then divide by N to get our
per-scan estimated cost.  But because the planner builds cost estimates
bottom-up, in most scenarios we don't have any idea whether an indexscan
plan node will be iterated once or many times in the finished plan.
However there is one case where we do have some information: when
computing a best_inner_indexscan() for a nestloop join, it'd be reasonable
to use the estimated size of the outer relation as the loop count.
And this is exactly the case where we are falling down in practice:
the complaints we've seen are mostly about overestimating the cost of a
nestloop-with-inner-index-scan.  So I think handling just this case would
be a big step forward, even if it's not a theoretically pure general
solution.

Thoughts, gripes, better ideas?
        regards, tom lane


Re: More thoughts about planner's cost estimates

From
Josh Berkus
Date:
Tom,

As you know, this is something I think about a bit too, though not 
nearly as deeply as you.

> In general it seems to me that for CPU-bound databases, the default values
> of the cpu_xxx_cost variables are too low.  I am tempted to raise the
> default value of cpu_index_tuple_cost to 0.005, which would be half of
> cpu_tuple_cost and twice cpu_operator_cost; that seems more in line with
> my feel for the relative costs of things.  For a CPU-bound database all
> three would need to go up more than that.  But rather than telling people
> to manipulate all three of these variables individually, I think it might
> also be a good idea to provide a new GUC variable named something like
> "cpu_speed_scale" that would just be a multiplier for the other variables.

Yes, please!  I've done a bit of playing around with the cpu_* 
variables, and frankly have always manipulated them by the same multiple.

Also, my testing has shown that on *large* databases (20x RAM +) with 
fast processors (Opteron) the cpu_* values are too high.  But a lot of 
that is offsetting estimates which are too high elsewhere ... as is 
random_page_cost.


> I've desisted from touching this mainly because the obvious sanity
> adjustments, such as adding something for tree descent and charging
> random_page_cost not 1.0 for leaf page touches, would increase the
> estimated cost of index usage, which seemed like not the direction we
> wanted to go in.  So I figured that the errors were more or less
> cancelling each other.  But the addition of bitmap index scans is now
> exposing the weaknesses, so we need to face up to the problem.

Yeah.  I've refrained from proposing changes because it's a 
pick-up-sticks.  If we start modifying the model, we need to fix 
*everything*, not just one item.  And then educate our users that they 
need to use the GUC variables in a different way.  Here's the issues I see:

1. n-distinct estimation is bad, as previously discussed;

2. our current heuristics sampling methods prevent us from sampling more 
than 0.5% of any reasonably large table, causing all statistics on those 
tables to be off for any table with irregular distribution of values;

3. We don't have any method to analyze inter-column correlation within a 
table;

4. We don't keep statistics on foriegn key correlation;

5. random_page_cost (as previously discussed) is actually a funciton of 
relatively immutable hardware statistics, and as such should not need to 
exist as a GUC once the cost model is fixed.

6. We haven't added any way to estimate rows returned from SRFs.

> I am inclined to think that a reasonable model is to continue to estimate
> the number of index leaf pages touched as above (pro-rating against the
> total number of index entries), to charge random_page_cost per leaf page
> touched, and not to count anything for metapage or tree descent.  I
> justify the latter on the grounds that upper tree levels will tend to stay
> in cache because they're touched so often.  random_page_cost per leaf page
> may be an overestimate, since sometimes logically adjacent leaf pages will
> be physically adjacent too, but not charging for tree descent should help
> to cancel that out.  With this model, the disk cost to fetch a single
> index entry will be estimated as random_page_cost (default 4.0) rather
> than the current fixed 2.0.  This shouldn't hurt things too much for
> simple indexscans --- especially since anyone running with a reduced
> random_page_cost won't see as much change.  And it will increase the costs
> for bitmap scans that cross many index pages, which is what we need in
> light of Philippe's numbers.

Per above, I think anything involving random_page_cost is the wrong way 
to go.  RPC is a GUC assigned by the DBA on pure guesswork.  We should 
be decreasing its use or eliminating it entirely, not increasing the 
amount we use it.

For btrees, we should be able to accurately estimate the cost of the 
index access given:
a) the depth of the tree;
b) the likelyhood that requested values are adjacent;
c) whether the index and tables are already in shared_mem, or else (d) 
and (e) below:
d) the probability that the index pages are cached in memory, which is 
composed of:(1) the frequency with which that index is accessed, modified by(2) whether the index is a fraction of
availableram, or larger than ram
 
e) the probability that the relevant table pages are cached in memory, 
determined by the same two factors.

*None* of this should involve a userset parameter (random_page_cost) 
since as you pointed out months ago, the ration of seq/seek access has 
remained relatively constant over the past 6 years of HDD and I/O 
engineering.

Sadly, I don't have the math for all of the above but I'm hoping someone 
(either here or on ACM) does.

Also, this would require us to have different *formulas* and not merely 
different cost factors for different kinds of indexes.  I believe that 
this is something that Jie is already struggling with.  Jie?

> The big difficulty in modeling cache effects from multiple scans is that
> we usually don't know how many times the index will be scanned.  If we did
> have that number, I think it'd be reasonable to use the Mackert and Lohman
> approximation already used in cost_index to estimate the total number of
> index leaf pages fetched over N scans, and then divide by N to get our
> per-scan estimated cost.  But because the planner builds cost estimates
> bottom-up, in most scenarios we don't have any idea whether an indexscan
> plan node will be iterated once or many times in the finished plan.

I think we can work around this by figuring out whether the index has 
already been scanned in the plan, something we *can* know.  So the first 
scan will be full cost and remaining scans will be fractional cost. 
Possibly not as good as Mackert/Lohman, but it allows us reasonably 
accurate *overall* estimates without needing to move away from bottom-up 
plan building.


> Thoughts, gripes, better ideas?

yes.  I think if we're going to fix the cost model, we should overhaul 
it and target 8.3.   Realistically, once we start tinkering with the 
cost model most queries are going to get worse before they get better.

--Josh


Re: More thoughts about planner's cost estimates

From
Greg Stark
Date:
Josh Berkus <josh@agliodbs.com> writes:

> 1. n-distinct estimation is bad, as previously discussed;
> 
> 2. our current heuristics sampling methods prevent us from sampling more than
> 0.5% of any reasonably large table, causing all statistics on those tables to
> be off for any table with irregular distribution of values;

I'm convinced these two are more connected than you believe. For distributions
of values in a range using small samples is on solid statistical basis. It's
the same reason poll results based on only a few hundred people can accurately
predict elections in a country with hundreds of millions of voters.

However for things like estimating discrete values there's absolutely no solid
foundation for these small samples. Your accuracy is going to be simply
proportional to the sample size, meaning you can't get anything trustworthy
without sampling large enough portions of the table that a full sequential
scan would be just as fast.

I might be interested in implementing that algorithm that was posted a while
back that involved generating good unbiased samples of discrete values. The
algorithm was quite clever and well described and paper claimed impressively
good results.

However it will only make sense if people are willing to accept that analyze
will need a full table scan -- at least for tables where the DBA knows that
good n_distinct estimates are necessary.

> 3. We don't have any method to analyze inter-column correlation within a table;
> 
> 4. We don't keep statistics on foriegn key correlation;

Gosh these would be nice but they sound like hard problems. Has anybody even
made any headway in brainstorming how to tackle them?

> 5. random_page_cost (as previously discussed) is actually a funciton of
> relatively immutable hardware statistics, and as such should not need to exist
> as a GUC once the cost model is fixed.

I don't think that's true at all. Not all hardware is the same.

Certainly the need to twiddle this GUC should be drastically reduced if the
cache effects are modelled properly and the only excuses left are legitimate
hardware differences.

-- 
greg



Re: More thoughts about planner's cost estimates

From
Josh Berkus
Date:
Greg,

> I'm convinced these two are more connected than you believe.

Actually, I think they are inseparable.

> I might be interested in implementing that algorithm that was posted a
> while back that involved generating good unbiased samples of discrete
> values. The algorithm was quite clever and well described and paper
> claimed impressively good results.
>
> However it will only make sense if people are willing to accept that
> analyze will need a full table scan -- at least for tables where the DBA
> knows that good n_distinct estimates are necessary.

What about block-based sampling?   Sampling 1 in 20 disk pages, rather than 
1 in 20 rows, should require siginificantly less scanning, and yet give us 
a large enough sample for reasonable accuracy.

> > 3. We don't have any method to analyze inter-column correlation within
> > a table;
> >
> > 4. We don't keep statistics on foriegn key correlation;
>
> Gosh these would be nice but they sound like hard problems. Has anybody
> even made any headway in brainstorming how to tackle them?

There's no time like the present!

Actually, these  both seem like fairly straightforward problems 
storage-wise.  The issue is deriving the statistics, for tables with many 
columns or FKs.  

> > 5. random_page_cost (as previously discussed) is actually a funciton
> > of relatively immutable hardware statistics, and as such should not
> > need to exist as a GUC once the cost model is fixed.
>
> I don't think that's true at all. Not all hardware is the same.
>
> Certainly the need to twiddle this GUC should be drastically reduced if
> the cache effects are modelled properly and the only excuses left are
> legitimate hardware differences.

OK ... but still, it should become a "little knob" rather than the "big 
knob" it is currently.


-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: More thoughts about planner's cost estimates

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Yeah.  I've refrained from proposing changes because it's a 
> pick-up-sticks.  If we start modifying the model, we need to fix 
> *everything*, not just one item.  And then educate our users that they 
> need to use the GUC variables in a different way.  Here's the issues I see:

> [ various deficiencies in our statistics gathering ]

Those are all valid points, but pretty much orthogonal to what I'm on
about today.  The problems I'm thinking about are seen even when the
planner's rowcount estimates are dead on (as indeed they mostly were
in Philippe's example).

> 5. random_page_cost (as previously discussed) is actually a funciton of 
> relatively immutable hardware statistics, and as such should not need to 
> exist as a GUC once the cost model is fixed.

Well, it'll still exist as a GUC for the same reasons the other ones are
GUCs.  But I think the main reasons people have needed to twiddle it are(1) their database is completely RAM-resident
(whereRPC    *should* logically be 1.0), or(2) they're trying to compensate for the overestimation of    nestloop
indexscans.
The changes I'm proposing should help with (2).

> For btrees, we should be able to accurately estimate the cost of the 
> index access given:
> a) the depth of the tree;

Although logically the tree descent *ought* to be a factor, I haven't
seen any evidence that we need to take it into account; in fact all the
evidence points to the conclusion that we're better off ignoring it.
My theory about why is that the upper levels of the tree stay in cache.
I have to admit that's just handwaving, but counting additional disk
hits to fetch the first index tuple is clearly not the direction the
cost model needs to go in.  Look again at the examples in Philippe's
output: an indexscan fetching 1700 tuples (about 5 leaf pages probably)
took 32x longer than one fetching 7 tuples (presumably all on the same
leaf page).  There's no way that a model that counts an additional
couple of page fetches for tree descent is going to arrive at those
numbers.  And we see this over and over again: incredibly small times
to fetch a single index tuple, at least on the average when the index
is being hit many times in one query.

> c) whether the index and tables are already in shared_mem, or else (d) 
> and (e) below:
> d) the probability that the index pages are cached in memory, which is 
> composed of:
>     (1) the frequency with which that index is accessed, modified by
>     (2) whether the index is a fraction of available ram, or larger than ram
> e) the probability that the relevant table pages are cached in memory, 
> determined by the same two factors.

These would all be nice things to know, but I'm afraid it's pie in the
sky.  We have no reasonable way to get those numbers.  (And if we could
get them, there would be another set of problems, namely plan instability:
the planner's choices would become very difficult to reproduce.)

>> The big difficulty in modeling cache effects from multiple scans is that
>> we usually don't know how many times the index will be scanned.

> I think we can work around this by figuring out whether the index has 
> already been scanned in the plan, something we *can* know.  So the first 
> scan will be full cost and remaining scans will be fractional cost. 

Uh, no, because what we're normally thinking about is independent probes
into the index with different keys.  For a small number of probes you
have to figure that those all hit different leaf pages and aren't going
to amortize well.  As the number of probes goes up, you can start
charging fractional costs because it's more likely you're hitting a leaf
page you hit recently.  The Mackert/Lohman equations do this reasonably
well --- it's possible we can find something better, but those seem like
a good place to start.  The immediate problem is to get an
infrastructure in place that gives us some numbers to apply equations to.
        regards, tom lane


Re: More thoughts about planner's cost estimates

From
Greg Stark
Date:
Josh Berkus <josh@agliodbs.com> writes:

> > However it will only make sense if people are willing to accept that
> > analyze will need a full table scan -- at least for tables where the DBA
> > knows that good n_distinct estimates are necessary.
> 
> What about block-based sampling?   Sampling 1 in 20 disk pages, rather than 
> 1 in 20 rows, should require siginificantly less scanning, and yet give us 
> a large enough sample for reasonable accuracy.

a) We already use block based sampling to reduce overhead. If you're talking  about using the entire block and not just
randomlysampled tuples from  within those blocks then your sample will be biased.
 

b) It *still* wouldn't give you reasonable accuracy for n_distinct estimates.  Your probability of being accurate for
discreteprocesses like n_distinct  is going to be more or less proportional to your sample. So sampling 5% of  the
tablegives you only a 5% of the chance of an accurate prediction. More  or less.
 

> Actually, these  both seem like fairly straightforward problems 
> storage-wise.  The issue is deriving the statistics, for tables with many 
> columns or FKs.  

Well there are problems on many levels:

1) You have n^2 possible two-column combinations. That's a lot of processing  and storage.

2) It isn't even clear what data you're exactly looking for. Certainly  "correlation" is just shorthand here and isn't
whatyou're actually looking  for. You need something like a complete histogram for the dependent column  for every
possiblevalue of the forcing column. That would be an enormous  amount of storage.
 

> OK ... but still, it should become a "little knob" rather than the "big 
> knob" it is currently.

Sure, the more accurately you model the cache the less need there would be to
adjust random_page_cost. I'm not sure how accurate Postgres can really get
unless it switches philosophies towards using O_DIRECT and doing it's own
caching and scheduling. But it could certainly be much better than today.
Tom's comment makes it look like there's hope for pretty accurate cache
modelling for nested loop plans.

-- 
greg



Re: More thoughts about planner's cost estimates

From
Josh Berkus
Date:
Greg, Tom,

> a) We already use block based sampling to reduce overhead. If you're
> talking about using the entire block and not just randomly sampled
> tuples from within those blocks then your sample will be biased.

There are actually some really good equations to work with this, estimating 
both row correlation and n-distinct based on sampling complete blocks.  
See prior discussions on this list on N-distinct.

> 1) You have n^2 possible two-column combinations. That's a lot of
> processing and storage.

Yes, that's the hard problem to solve.  Actually, btw, it's n!, not n^2.

> 2) It isn't even clear what data you're exactly looking for. Certainly
>    "correlation" is just shorthand here and isn't what you're actually
> looking for. 

Actually, I'd think that a correlation number estimate (0 = complete 
uncorrelated, 1 = completely correlated) would be sufficient to improve 
row count estimation significantly, without incurring the vast overhead of 
histogramXhistorgram manipulation.

> Those are all valid points, but pretty much orthogonal to what I'm on
> about today.  The problems I'm thinking about are seen even when the
> planner's rowcount estimates are dead on (as indeed they mostly were
> in Philippe's example).

OK, I was afraid they were interdependant.  You would know better than me.

> Well, it'll still exist as a GUC for the same reasons the other ones are
> GUCs.  But I think the main reasons people have needed to twiddle it are
>     (1) their database is completely RAM-resident (where RPC
>         *should* logically be 1.0), or
>     (2) they're trying to compensate for the overestimation of
>         nestloop indexscans.
> The changes I'm proposing should help with (2).

Right.  What I'm saying is that (1) should be derived from 
estimated_cache_size and DBSIZE, not by setting an additional GUC.

> >     (1) the frequency with which that index is accessed, modified by
> >     (2) whether the index is a fraction of available ram, or larger than
> > ram e) the probability that the relevant table pages are cached in
> > memory, determined by the same two factors.
>
> These would all be nice things to know, but I'm afraid it's pie in the
> sky.  We have no reasonable way to get those numbers.  (And if we could
> get them, there would be another set of problems, namely plan
> instability: the planner's choices would become very difficult to
> reproduce.)

Hmmm ... I think those numbers are possible to infer, at least on a gross 
level.  Whether the cost of calculating them outweighs the benefit of 
having them is another question, resolvable only through some 
experimentation.

> a good place to start.  The immediate problem is to get an
> infrastructure in place that gives us some numbers to apply equations
> to.

No arguments, of course.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: More thoughts about planner's cost estimates

From
Josh Berkus
Date:
Greg,

> > 1) You have n^2 possible two-column combinations. That's a lot of
> > processing and storage.
>
> Yes, that's the hard problem to solve.  Actually, btw, it's n!, not n^2.

Ooops, bad math.  Andrew pointed out it's actually n*(n-1)/2, not n!.

Also, we could omit columns unlikely to correlate, such as large text
columns, bytea and numerics with high precisions.  Also, we probably don't
need to correlate UNIQUE columns inside ... I think.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: More thoughts about planner's cost estimates

From
"Jim C. Nasby"
Date:
On Thu, Jun 01, 2006 at 02:25:56PM -0400, Greg Stark wrote:
> 
> Josh Berkus <josh@agliodbs.com> writes:
> 
> > 1. n-distinct estimation is bad, as previously discussed;
> > 
> > 2. our current heuristics sampling methods prevent us from sampling more than
> > 0.5% of any reasonably large table, causing all statistics on those tables to
> > be off for any table with irregular distribution of values;
> 
> I'm convinced these two are more connected than you believe. For distributions
> of values in a range using small samples is on solid statistical basis. It's
> the same reason poll results based on only a few hundred people can accurately
> predict elections in a country with hundreds of millions of voters.
> 
> However for things like estimating discrete values there's absolutely no solid
> foundation for these small samples. Your accuracy is going to be simply
> proportional to the sample size, meaning you can't get anything trustworthy
> without sampling large enough portions of the table that a full sequential
> scan would be just as fast.
> 
> I might be interested in implementing that algorithm that was posted a while
> back that involved generating good unbiased samples of discrete values. The
> algorithm was quite clever and well described and paper claimed impressively
> good results.
> 
> However it will only make sense if people are willing to accept that analyze
> will need a full table scan -- at least for tables where the DBA knows that
> good n_distinct estimates are necessary.
There *might* be an alternative....

Suppose that the executor could keep track of what values it's seeing
from a table as it's actually running a query. These could be used to
build statistical information without actually running analyze.

Of course, the problem is that you could end up with some seriously
skewed statistics, depending on what your query patterns are. But in
some ways that's not bad; you're optimizing for the queries you most
often run.

One possible way to handle this would be to keep track of how many times
each value has been seen, as well as when it was last seen, so you have
some idea of the quality of the data. Another idea is to somehow blend
these stats with the traditional method.

> > 3. We don't have any method to analyze inter-column correlation within a table;
> > 
> > 4. We don't keep statistics on foriegn key correlation;
> 
> Gosh these would be nice but they sound like hard problems. Has anybody even
> made any headway in brainstorming how to tackle them?
A *huge* improvement would be gathering statistics on all multi-column
indexes. Some of the stats might not make sense in this context, but
others (such as correlation) would.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: More thoughts about planner's cost estimates

From
"Jim C. Nasby"
Date:
On Thu, Jun 01, 2006 at 03:15:09PM -0400, Tom Lane wrote:
> These would all be nice things to know, but I'm afraid it's pie in the
> sky.  We have no reasonable way to get those numbers.  (And if we could
> get them, there would be another set of problems, namely plan instability:
> the planner's choices would become very difficult to reproduce.)

Speaking of plan instability, something that's badly needed is the
ability to steer away from query plans that *might* be the most optimal,
but also will fail horribly should the cost estimates be wrong. People
generally don't care about getting the absolutely most optimal plan;
they do care about NOT getting a plan that's horribly bad. One possible
way to do this would be to have the estimator calculate a worst-case
cost to go along with the best case cost, or maybe even 3 numbers:
ideal, what we think will happen, and worst-case. I know that index
scans already have that information, at least in the form of cost for a
plan with 0 correlation and one with perfect correlation.

Does anyone have any experience developing genetic algorithms? I think
coming up with cost estimators might be an ideal use for that
technology...
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: More thoughts about planner's cost estimates

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> Speaking of plan instability, something that's badly needed is the
> ability to steer away from query plans that *might* be the most optimal,
> but also will fail horribly should the cost estimates be wrong.

You sure that doesn't leave us with the empty set :-( ?  Any plan
we pick will be horrible under some scenario.  I do agree that the
current lowest-cost-and-nothing-else criterion can pick some pretty
brittle plans, but it's not that easy to see how to improve it.
I don't think either "best case" or "worst case" are particularly
helpful concepts here.  You'd almost have to try to develop an
estimated probability distribution, and that's way more info than
we have.

> People generally don't care about getting the absolutely most optimal
> plan; they do care about NOT getting a plan that's horribly bad.

If 8.2 runs a query at half the speed of 8.1, people will be unhappy,
and they won't be mollified if you tell them that that plan is "better"
because it would have degraded less quickly if the planner's estimates
were wrong.  The complaint that started this thread (Philippe Lang's
a couple days ago) was in fact of exactly that nature: his query was
running slower than it had been because the planner was picking bitmap
scans instead of plain index scans.  Well, those bitmap scans would have
been a lot more robust in the face of bad rowcount estimates, but the
estimates weren't wrong.
        regards, tom lane


Re: More thoughts about planner's cost estimates

From
Greg Stark
Date:
Josh Berkus <josh@agliodbs.com> writes:

> Greg, Tom,
> 
> > a) We already use block based sampling to reduce overhead. If you're
> > talking about using the entire block and not just randomly sampled
> > tuples from within those blocks then your sample will be biased.
> 
> There are actually some really good equations to work with this, estimating 
> both row correlation and n-distinct based on sampling complete blocks.  
> See prior discussions on this list on N-distinct.

In the prior discussions someone posted the paper with the algorithm I
mentioned. That paper mentions that previous work showed poor results at
estimating n_distinct even with sample sizes as large as 50% or more.

> > 1) You have n^2 possible two-column combinations. That's a lot of
> > processing and storage.
> 
> Yes, that's the hard problem to solve.  Actually, btw, it's n!, not n^2.

You have O(n^2) possible *two-column* combinations and O(n!) n-column
combinations. I was hoping two-column combinations would be enough information
to extrapolate from for larger combinations.

> > 2) It isn't even clear what data you're exactly looking for. Certainly
> >    "correlation" is just shorthand here and isn't what you're actually
> > looking for. 
> 
> Actually, I'd think that a correlation number estimate (0 = complete 
> uncorrelated, 1 = completely correlated) would be sufficient to improve 
> row count estimation significantly, without incurring the vast overhead of 
> histogramXhistorgram manipulation.

No, correlation is actually quite uninteresting here. Consider for example two
columns "state" and "area code". The "area code" is pretty much 100% dependent
on state, but will show very little correlation. Similarly things like
"product name" and "catalog number" or even "author" and "genre" would be
problem cases but have little correlation. And you can easily imagine queries
that specify restrictive clauses on two such columns for which the existing
statistics will overestimate the selectivity because it assumes no
interdependency.

Hopefully you're right that you don't need complete histograms. Perhaps
there's some statistics concept they don't teach in stats 101 that would cover
this need. What we're looking for is a function f(a,b,x) that gives the net
selectivity given a and b, the selectivity of two clauses based on two
columns, and x some simple value that can be easily calculated by looking at
the data in advance.

-- 
greg



Re: More thoughts about planner's cost estimates

From
David Fetter
Date:
On Thu, Jun 01, 2006 at 08:36:16PM -0400, Greg Stark wrote:
> 
> Josh Berkus <josh@agliodbs.com> writes:
> 
> > Greg, Tom,
> > 
> > > a) We already use block based sampling to reduce overhead.  If
> > > you're talking about using the entire block and not just
> > > randomly sampled tuples from within those blocks then your
> > > sample will be biased.
> > 
> > There are actually some really good equations to work with this,
> > estimating both row correlation and n-distinct based on sampling
> > complete blocks.  See prior discussions on this list on
> > N-distinct.
> 
> In the prior discussions someone posted the paper with the algorithm
> I mentioned.  That paper mentions that previous work showed poor
> results at estimating n_distinct even with sample sizes as large as
> 50% or more.

Which paper?  People have referenced several different ones.

> > > 1) You have n^2 possible two-column combinations.  That's a lot
> > > of processing and storage.
> > 
> > Yes, that's the hard problem to solve.  Actually, btw, it's n!,
> > not n^2.
> 
> You have O(n^2) possible *two-column* combinations and O(n!)
> n-column combinations.  I was hoping two-column combinations would
> be enough information to extrapolate from for larger combinations.

The math nerd in me says that this is bad math, but it might work
"well enough" by ass-u-ming a lack of higher-order correllations.

> > > 2) It isn't even clear what data you're exactly looking for.
> > > Certainly "correlation" is just shorthand here and isn't what
> > > you're actually looking for. 
> > 
> > Actually, I'd think that a correlation number estimate (0 =
> > complete uncorrelated, 1 = completely correlated) would be
> > sufficient to improve row count estimation significantly, without
> > incurring the vast overhead of histogramXhistorgram manipulation.
> 
> No, correlation is actually quite uninteresting here.  Consider for
> example two columns "state" and "area code".  The "area code" is
> pretty much 100% dependent on state, but will show very little
> correlation.

There are quantitative tests of independence available.  I'm not sure
whether any of these have been applied even theoretically in
DBMS-land.

> Hopefully you're right that you don't need complete histograms.
> Perhaps there's some statistics concept they don't teach in stats
> 101 that would cover this need.  What we're looking for is a
> function f(a,b,x) that gives the net selectivity given a and b, the
> selectivity of two clauses based on two columns, and x some simple
> value that can be easily calculated by looking at the data in
> advance.

That would be neat :)

Cheers,
D
-- 
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666                             Skype: davidfetter

Remember to vote!


Re: More thoughts about planner's cost estimates

From
Mark Kirkwood
Date:
Tom Lane wrote:

> Another thing that's bothering me is that the index access cost computation
> (in genericcostestimate) is looking sillier and sillier:
> 
>     /*
>      * Estimate the number of index pages that will be retrieved.
>      *
>      * For all currently-supported index types, the first page of the index is
>      * a metadata page, and we should figure on fetching that plus a pro-rated
>      * fraction of the remaining pages.
>      */
>     if (index->pages > 1 && index->tuples > 0)
>     {
>         numIndexPages = (numIndexTuples / index->tuples) * (index->pages - 1);
>         numIndexPages += 1;        /* count the metapage too */
>         numIndexPages = ceil(numIndexPages);
>     }
>     else
>         numIndexPages = 1.0;
> 
>     /*
>      * Compute the index access cost.
>      *
>      * Disk cost: our generic assumption is that the index pages will be read
>      * sequentially, so they have cost 1.0 each, not random_page_cost.
>      */
>     *indexTotalCost = numIndexPages;
> 
> There are several things wrong with this, at least in its application to
> btrees.  It's not counting descent of the btree (ie, touches of the root
> page and intermediate-level pages).  On the other hand it's surely
> overcharging for metapage touches.  As of CVS tip we cache the metapage in
> the relcache, so it's probably fair to disregard that cost altogether.
> And on the third hand, when we need to retrieve multiple leaf pages it's
> over-optimistic to assume that those'll be purely sequential fetches.
> (That would be approximately true in a freshly-built btree, but not in one
> that's suffered any amount of page splitting or recycling.)
> 
> 
> I am inclined to think that a reasonable model is to continue to estimate
> the number of index leaf pages touched as above (pro-rating against the
> total number of index entries), to charge random_page_cost per leaf page
> touched, and not to count anything for metapage or tree descent.  I
> justify the latter on the grounds that upper tree levels will tend to stay
> in cache because they're touched so often.  random_page_cost per leaf page
> may be an overestimate, since sometimes logically adjacent leaf pages will
> be physically adjacent too, but not charging for tree descent should help
> to cancel that out.  With this model, the disk cost to fetch a single
> index entry will be estimated as random_page_cost (default 4.0) rather
> than the current fixed 2.0.  This shouldn't hurt things too much for
> simple indexscans --- especially since anyone running with a reduced
> random_page_cost won't see as much change.  And it will increase the costs
> for bitmap scans that cross many index pages, which is what we need in
> light of Philippe's numbers.
> 

This sounds good to me, although the 2.0 -> 4.0 cost jump may cause 
(more) cases of people seeing their index scans in pre-8.2 versions 
becoming some other type of access in 8.2. I guess a comment about 
testing existing applications could be placed in the 8.2 release notes?

> Now we have seen a lot of cases in which indexscans were being drastically
> overestimated, so increasing the cost estimate even more may seem like a
> horrid idea.  But I believe that most or all of these cases were ones
> where the same index was being visited many times, and so the real
> estimation failure is not accounting for cache effects across multiple
> indexscans.  Rather than being afraid to raise the cost estimate for an
> indexscan in isolation, I think it's time to bite the bullet and do
> something about that.
> 
> The big difficulty in modeling cache effects from multiple scans is that
> we usually don't know how many times the index will be scanned.  If we did
> have that number, I think it'd be reasonable to use the Mackert and Lohman
> approximation already used in cost_index to estimate the total number of
> index leaf pages fetched over N scans, and then divide by N to get our
> per-scan estimated cost.  But because the planner builds cost estimates
> bottom-up, in most scenarios we don't have any idea whether an indexscan
> plan node will be iterated once or many times in the finished plan.
> However there is one case where we do have some information: when
> computing a best_inner_indexscan() for a nestloop join, it'd be reasonable
> to use the estimated size of the outer relation as the loop count.
> And this is exactly the case where we are falling down in practice:
> the complaints we've seen are mostly about overestimating the cost of a
> nestloop-with-inner-index-scan.  So I think handling just this case would
> be a big step forward, even if it's not a theoretically pure general
> solution.
> 
>

If I understand correctly, this is the case were we normally see folks 
needing add several 'set enable_xxx=off' statements to get the nested 
loop plan that *actually* works best :-). So, also looks good!

Cheers

Mark



Re: More thoughts about planner's cost estimates

From
Tom Lane
Date:
Mark Kirkwood <markir@paradise.net.nz> writes:
> Tom Lane wrote:
>> With this model, the disk cost to fetch a single
>> index entry will be estimated as random_page_cost (default 4.0) rather
>> than the current fixed 2.0.  This shouldn't hurt things too much for
>> simple indexscans --- especially since anyone running with a reduced
>> random_page_cost won't see as much change.  And it will increase the costs
>> for bitmap scans that cross many index pages, which is what we need in
>> light of Philippe's numbers.

> This sounds good to me, although the 2.0 -> 4.0 cost jump may cause 
> (more) cases of people seeing their index scans in pre-8.2 versions 
> becoming some other type of access in 8.2. I guess a comment about 
> testing existing applications could be placed in the 8.2 release notes?

Yeah, that comes with the territory.  One point to note is that with
this model, setting random_page_cost below 2.0 will actually make small
indexscans look *cheaper* than they do now.  So it'll certainly be
possible to make the thing jump in that direction if you need to.
        regards, tom lane


Re: More thoughts about planner's cost estimates

From
Nicolai Petri
Date:
Hi All,

Just a small comment from a mortal user.

On Thursday 01 June 2006 19:28, Josh Berkus wrote:
> 5. random_page_cost (as previously discussed) is actually a funciton of
> relatively immutable hardware statistics, and as such should not need to
> exist as a GUC once the cost model is fixed.
It's correct that the hardware statistics are quite immutable - per 
tablespace. This suggests to me that the GUC should be default value only and 
then overridable per tablespace. It's quite normal to have primary (current 
data) and secondary (historical data) storage for larger databases.
This can also fit nicely into the pg_tmp storage if tie support for multiple 
tmp folders together with tablespaces.

>
> 6. We haven't added any way to estimate rows returned from SRFs.
This would also be very cool - currently the planner can really get annoyed 
when joining SRF functions with tables.

---
Nicolai


Re: More thoughts about planner's cost estimates

From
Kenneth Marshall
Date:
Josh, Greg, and Tom,

I do not know how sensitive the plans will be to the correlation,
but one thought might be to map the histogram X histogram correlation
to a square grid of values. Then you can map them to an integer which
would give you 8 x 8 with binary values, a 5 x 5 with 4 values per
point, or a 4 x 4 with 8 values per point. If close is good enough,
that would be a compact way to store some histogram cross correlation
information.

Ken

On Thu, Jun 01, 2006 at 01:50:26PM -0700, Josh Berkus wrote:
> Greg, Tom,
> 
> ...
> > 2) It isn't even clear what data you're exactly looking for. Certainly
> >    "correlation" is just shorthand here and isn't what you're actually
> > looking for. 
> 
> Actually, I'd think that a correlation number estimate (0 = complete 
> uncorrelated, 1 = completely correlated) would be sufficient to improve 
> row count estimation significantly, without incurring the vast overhead of 
> histogramXhistorgram manipulation.
> ...


Re: More thoughts about planner's cost estimates

From
Greg Stark
Date:
David Fetter <david@fetter.org> writes:

> > In the prior discussions someone posted the paper with the algorithm
> > I mentioned.  That paper mentions that previous work showed poor
> > results at estimating n_distinct even with sample sizes as large as
> > 50% or more.
> 
> Which paper?  People have referenced several different ones.

Phillip B Gibbons. Distinct Sampling for Highly-Accurate Answers to Distinct
Values Queries and Event Reports. In Proceedings of the 27th VLDB Conference,
Roma, Italy, 2001.

Which says (emphasis in original as italics):
   The most well-studied approach for distinct-values estimation is to   collect a uniform random sample S of the data,
storeS in a database, and   then use S at query time to provide fast, approximate answers to distinct   values queries
[22,23, 27, 21, 29, 5, 28, 18, 19, 9, 7]. However,   previous work [28, 18, 9, 7] has shown powerful negative results
onthe   quality of distinct-values estimates based on sampling (or other   techniques that examine only part of the
inputdata), even for the simple   case of counting the number of distinct values in a column. The strongest   negative
resultis due to Charikar et al. [7], who proved that estimating   the number of distinct values in a column to within a
smallconstant   factor (with probability > 1/2) requires that *nearly* *the* *entire*   *data* *set* *be* *sampled*.
Moreover,all known sampling-based estimators   provide unsatisfactory results on data sets of interest [7], even for
this  simple case.
 

And later:
   Using a variety of synthetic and real-world data sets, we show that   distinct sampling gives estimates for distinct
valuesqueries that are   within 0%-10%, whereas previous methods were typically 50%-250% off,   across the spectrum of
datasets and queries studied.
 

Here "distinct sampling" is the algorithm being proposed which requires
looking at every record and keeping a sample *of the distinct values*. The
"previous methods" are methods based on sampling the records

I haven't read the citation [7] that proves the strong negative result for any
estimator of distinct values based on sampling. It's:
 M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation error guarantees for distinct values. In
Proc.19th ACM Symp. on Principles of Database Systems, pages 268?279, May 2000.
 


> > Hopefully you're right that you don't need complete histograms.
> > Perhaps there's some statistics concept they don't teach in stats
> > 101 that would cover this need.  What we're looking for is a
> > function f(a,b,x) that gives the net selectivity given a and b, the
> > selectivity of two clauses based on two columns, and x some simple
> > value that can be easily calculated by looking at the data in
> > advance.
> 
> That would be neat :)

Doing a bit of basic searching around I think the tool we're looking for here
is called a "chi-squared test for independence".

I haven't read up on how it works so I'm unclear if i it be calculated using a
simple O(n) scan or if it would require some non-linear post-processing after
the analyze pass which would be unfortunate.

And I haven't found anything that describes how to make use of the resulting
number. Does it actually give a formula f(a,b,x) that gives a nice convenient
expected selectivity for the clause?

Oh, and incidentally, at first glance it seems like calculating it depends on
having good distinct value sampling.

-- 
greg



Re: More thoughts about planner's cost estimates

From
Josh Berkus
Date:
Greg,

>     Using a variety of synthetic and real-world data sets, we show that
>     distinct sampling gives estimates for distinct values queries that
> are within 0%-10%, whereas previous methods were typically 50%-250% off,
> across the spectrum of data sets and queries studied.

Aha.  It's a question of the level of error permissable.   For our 
estimates, being 100% off is actually OK.  That's why I was looking at 5% 
block sampling; it stays within the range of +/- 50% n-distinct in 95% of 
cases.

> Doing a bit of basic searching around I think the tool we're looking for
> here is called a "chi-squared test for independence".

Augh.  I wrote a program (in Pascal) to do this back in 1988.   Now I can't 
remember the math.  For a two-column test it's relatively 
computation-light, though, as I recall ... but I don't remember standard 
chi square works with a random sample.


-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: More thoughts about planner's cost estimates

From
Greg Stark
Date:
Josh Berkus <josh@agliodbs.com> writes:

> >     Using a variety of synthetic and real-world data sets, we show that
> >     distinct sampling gives estimates for distinct values queries that
> > are within 0%-10%, whereas previous methods were typically 50%-250% off,
> > across the spectrum of data sets and queries studied.
> 
> Aha.  It's a question of the level of error permissable.   For our 
> estimates, being 100% off is actually OK.  That's why I was looking at 5% 
> block sampling; it stays within the range of +/- 50% n-distinct in 95% of 
> cases.

Well, let's see. Say for example we're scanning 50,000 records out of 1M
records. Using the Theorem from Charikar et al quoted as Theorem 1 in section
2 we get that there is at least a 5% chance of getting a result with a ratio
error at least sqrt((1M-50k)/100k ln(20)) or 5.33.

So no, even assuming you have a good unbiased sample, a 5% sample is only
going to get you to a result within 19%-533% of the real values 95% of the
time. Not 50-250%. And that's a negative result, so it's *at best* 19-533%. On
some distributions it may be outside that range even more than 95% of the
time.

This is entirely unlike the histograms where we have a solid foundation for a
positive result. We can guarantee that the result will be outside +/- x% *at
most* 5% of the time. For most distributions it'll be outside that range even
less.

And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from
5% block sampling took just as long as reading all the blocks. Even if we
figure out what's causing that (IMHO surprising) result and improve matters I
would only expect it to be 3-4x faster than a full scan.

http://archives.postgresql.org/pgsql-hackers/2006-01/msg00285.php

-- 
greg



Re: More thoughts about planner's cost estimates

From
Michael Dean
Date:
Greg Stark wrote:

>Josh Berkus <josh@agliodbs.com> writes:
>
>  
>
>>>    Using a variety of synthetic and real-world data sets, we show that
>>>    distinct sampling gives estimates for distinct values queries that
>>>are within 0%-10%, whereas previous methods were typically 50%-250% off,
>>>across the spectrum of data sets and queries studied.
>>>      
>>>
>>Aha.  It's a question of the level of error permissable.   For our 
>>estimates, being 100% off is actually OK.  That's why I was looking at 5% 
>>block sampling; it stays within the range of +/- 50% n-distinct in 95% of 
>>cases.
>>    
>>
>
>Well, let's see. Say for example we're scanning 50,000 records out of 1M
>records. Using the Theorem from Charikar et al quoted as Theorem 1 in section
>2 we get that there is at least a 5% chance of getting a result with a ratio
>error at least sqrt((1M-50k)/100k ln(20)) or 5.33.
>
>So no, even assuming you have a good unbiased sample, a 5% sample is only
>going to get you to a result within 19%-533% of the real values 95% of the
>time. Not 50-250%. And that's a negative result, so it's *at best* 19-533%. On
>some distributions it may be outside that range even more than 95% of the
>time.
>
>This is entirely unlike the histograms where we have a solid foundation for a
>positive result. We can guarantee that the result will be outside +/- x% *at
>most* 5% of the time. For most distributions it'll be outside that range even
>less.
>
>And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from
>5% block sampling took just as long as reading all the blocks. Even if we
>figure out what's causing that (IMHO surprising) result and improve matters I
>would only expect it to be 3-4x faster than a full scan.
>
>http://archives.postgresql.org/pgsql-hackers/2006-01/msg00285.php
>
>  
>
I'm sorry to interrupt your esoteric (to me) discussion, but I have a 
very simple question:  would you define a "good unbiased sample"?  My 
statistics professor Dan Price (rest his soul) would tell me there are 
only random samples of some sort, and "other", which are always biased, 
and never good.  Excuse my absolutes, I didn't mean Good or Evil.  And 
how do you calculate a level of error without randomization? And are you 
talking of type I or type II error?
Michael


Re: More thoughts about planner's cost estimates

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from
> 5% block sampling took just as long as reading all the blocks. Even if we
> figure out what's causing that (IMHO surprising) result and improve matters I
> would only expect it to be 3-4x faster than a full scan.

One way to reduce the I/O pain from extensive sampling would be to turn
VACUUM ANALYZE into a genuine combined operation instead of a mere
notational shorthand for two separate scans.

I'd still be worried about the CPU pain though.  ANALYZE can afford to
expend a pretty fair number of cycles per sampled tuple, but with a
whole-table sample that's going to add up.
        regards, tom lane


Re: More thoughts about planner's cost estimates

From
David Fetter
Date:
On Fri, Jun 02, 2006 at 01:39:32PM -0700, Michael Dean wrote:

> I'm sorry to interrupt your esoteric (to me) discussion, but I have
> a very simple question:  would you define a "good unbiased sample"?
> My statistics professor Dan Price (rest his soul) would tell me
> there are only random samples of some sort, and "other", which are
> always biased, and never good.

What's at issue here is a biased estimator, not a biased sample.  If
you know an unbiased estimator of multiplicity based on a random
sample, that would be a great :)

Cheers,
D
-- 
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666                             Skype: davidfetter

Remember to vote!


Re: More thoughts about planner's cost estimates

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Greg Stark <gsstark@mit.edu> writes:
> > And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from
> > 5% block sampling took just as long as reading all the blocks. Even if we
> > figure out what's causing that (IMHO surprising) result and improve matters I
> > would only expect it to be 3-4x faster than a full scan.
> 
> One way to reduce the I/O pain from extensive sampling would be to turn
> VACUUM ANALYZE into a genuine combined operation instead of a mere
> notational shorthand for two separate scans.

Except that we're also looking for every way we can to avoid having vacuum
have to touch every page so large tables with narrow hot spots can still
afford to be vacuumed regularly during peak hours.

But for most users analyze doesn't really have to run as often as vacuum. One
sequential scan per night doesn't seem like that big a deal to me.

> I'd still be worried about the CPU pain though.  ANALYZE can afford to
> expend a pretty fair number of cycles per sampled tuple, but with a
> whole-table sample that's going to add up.

That is a concern. Though the algorithm is pretty efficient, it basically
amounts to hashing all the tuples keeping only a limited number of hash
buckets and just throwing away the rest of the data. 

Things could get more complicated if we get to more complex stats than simple
n_distinct estimates and performing dependence tests on the data for
multi-column statistics.

-- 
greg



Re: More thoughts about planner's cost estimates

From
Josh Berkus
Date:
Greg, Tom,

> But for most users analyze doesn't really have to run as often as
> vacuum. One sequential scan per night doesn't seem like that big a deal
> to me.

Clearly you don't have any 0.5 TB databases.  

> > I'd still be worried about the CPU pain though.  ANALYZE can afford to
> > expend a pretty fair number of cycles per sampled tuple, but with a
> > whole-table sample that's going to add up.

Agreed.  Despite conventional wisdom, most PostgreSQL databases ... even 
those with high level OLTP or very large DW ... are CPU-bound.    We 
really don't want an ANALYZE which is an order-of-magnitude increase in 
CPU activity.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: More thoughts about planner's cost estimates

From
"Todd A. Cook"
Date:
Josh Berkus wrote:
> Greg, Tom,
> 
>> But for most users analyze doesn't really have to run as often as
>> vacuum. One sequential scan per night doesn't seem like that big a deal
>> to me.
> 
> Clearly you don't have any 0.5 TB databases.  

Perhaps something like "ANALYZE FULL"?  Then only those who need the
more precise statistics would pay the cost for a full scan.

-- todd


Re: More thoughts about planner's cost estimates

From
Tom Lane
Date:
I wrote:
> In general it seems to me that for CPU-bound databases, the default values
> of the cpu_xxx_cost variables are too low.  ... rather than telling people
> to manipulate all three of these variables individually, I think it might
> also be a good idea to provide a new GUC variable named something like
> "cpu_speed_scale" that would just be a multiplier for the other variables.
> It would default to 1.0 and our standard advice for CPU-bound databases
> would be "decrease random_page_cost to 1.0 and raise cpu_speed_scale to
> 10.0 or so".  Seems cleaner than telling people to muck with three or so
> individual values.

Nicolai Petri's comment about per-tablespace access costs caused me to
rethink the above proposal.  Instead of inventing "cpu_speed_scale",
which seems rather baroque after thinking about it more, what I now
think we should do is invent a "seq_page_cost" GUC to replace the
traditionally hardwired value of 1.0 cost unit per sequential page
fetch.  Then, if you've got different tablespaces with different disk
speeds, you could imagine having per-tablespace values of seq_page_cost
and random_page_cost, whereas you probably want the CPU cost numbers
to remain the same across all tables.

I don't really want to get into inventing per-tablespace settings right
now, because the need hasn't been demonstrated; but if we ever do want
to do it, this approach will be a whole lot less confusing than
something involving a cpu_speed_scale knob.

This still leaves you twiddling two knobs (now random_page_cost and
seq_page_cost) if you want to set up the planner for an all-in-memory
database.  So it's not any more complicated for that purpose.

One objection to this is that after moving "off the gold standard" of
1.0 = one page fetch, there is no longer any clear meaning to the
cost estimate units; you're faced with the fact that they're just an
arbitrary scale.  I'm not sure that's such a bad thing, though.  For
instance, some people might want to try to tune their settings so that
the estimates are actually comparable to milliseconds of real time.

Comments?
        regards, tom lane


Re: More thoughts about planner's cost estimates

From
Greg Stark
Date:
Josh Berkus <josh@agliodbs.com> writes:

> Greg, Tom,
> 
> > But for most users analyze doesn't really have to run as often as
> > vacuum. One sequential scan per night doesn't seem like that big a deal
> > to me.
> 
> Clearly you don't have any 0.5 TB databases.  

Actually I did not so long ago. 

Sequential scans in an OLTP query would be a disaster. But a single sequential
scan run at a controlled time wouldn't concern me as long as *all* of the
following constraints are met:
a) You can run them at your leisure at off-peak times when your i/o bandwidth   isn't in short supply.
 b) You don't need the results urgently so you don't care if it takes a while    to run.
 c) You don't need many of them at the same time.

Even on your production system surely you occasionally, say, take a full
backup or run "select count(*)" or other sanity checks on the data?

> > > I'd still be worried about the CPU pain though.  ANALYZE can afford to
> > > expend a pretty fair number of cycles per sampled tuple, but with a
> > > whole-table sample that's going to add up.
> 
> Agreed.  Despite conventional wisdom, most PostgreSQL databases ... even 
> those with high level OLTP or very large DW ... are CPU-bound.    We 
> really don't want an ANALYZE which is an order-of-magnitude increase in 
> CPU activity.

I don't think Tom was actually expressing concern about ANALYZE becoming more
expensive, but about tying ANALYZE and VACUUM together and making VACUUM more
expensive. VACUUM is something we want to encourage people to think they can
run all day long, not just occasionally.

-- 
greg



Re: More thoughts about planner's cost estimates

From
Rod Taylor
Date:
> One objection to this is that after moving "off the gold standard" of
> 1.0 = one page fetch, there is no longer any clear meaning to the
> cost estimate units; you're faced with the fact that they're just an
> arbitrary scale.  I'm not sure that's such a bad thing, though.  For
> instance, some people might want to try to tune their settings so that
> the estimates are actually comparable to milliseconds of real time.

Any chance that the correspondence to time could be made a part of the
design on purpose and generally advise people to follow that rule? If we
could tell people to run *benchmark* and use those numbers directly as a
first approximation tuning, it could help quite a bit for people new to
PostgreSQL experiencing poor performance.

effective_cache_size then becomes essentially the last hand-set variable
for medium sized installations.
-- 



Re: More thoughts about planner's cost estimates

From
Tom Lane
Date:
Rod Taylor <pg@rbt.ca> writes:
>> One objection to this is that after moving "off the gold standard" of
>> 1.0 = one page fetch, there is no longer any clear meaning to the
>> cost estimate units; you're faced with the fact that they're just an
>> arbitrary scale.  I'm not sure that's such a bad thing, though.  For
>> instance, some people might want to try to tune their settings so that
>> the estimates are actually comparable to milliseconds of real time.

> Any chance that the correspondence to time could be made a part of the
> design on purpose and generally advise people to follow that rule?

We might eventually get to that point, but I'm hesitant to try to do it
immediately.  For one thing, I really *don't* want to get bug reports
from newbies complaining that the cost estimates are always off by a
factor of X.  (Not but what we haven't gotten some of those anyway :-()
In the short term I see us sticking to the convention that seq_page_cost
is 1.0 in a "typical" database, while anyone who's really hot to try to
make the other happen is free to experiment.

> If we could tell people to run *benchmark* and use those numbers
> directly as a first approximation tuning, it could help quite a bit
> for people new to PostgreSQL experiencing poor performance.

We don't have such a benchmark ... if we did, we could have told
people how to use it to set the variables already.  I'm very very
suspicious of any suggestion that it's easy to derive appropriate
numbers for these settings from one magic benchmark.
        regards, tom lane


Re: More thoughts about planner's cost estimates

From
Jim Nasby
Date:
On Jun 2, 2006, at 5:24 PM, Todd A. Cook wrote:
> Josh Berkus wrote:
>> Greg, Tom,
>>> But for most users analyze doesn't really have to run as often as
>>> vacuum. One sequential scan per night doesn't seem like that big  
>>> a deal
>>> to me.
>> Clearly you don't have any 0.5 TB databases.
>
> Perhaps something like "ANALYZE FULL"?  Then only those who need the
> more precise statistics would pay the cost for a full scan.

Might also be worth adding analyze delay settings, ala  
vacuum_cost_delay.


Re: More thoughts about planner's cost estimates

From
Hannu Krosing
Date:
Ühel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby:

> Might also be worth adding analyze delay settings, ala  
> vacuum_cost_delay.

Actually we should have delay settings for all potential
(almost-)full-scan service ops, - VACUUM, ANALYSE, CREATE INDEX, ADD
CONSTRAINT, maybe more - so that there would be better chances of
running those on busy databases without disastrous effects.

Probably we should use the same settings for all these, not invent a new
set for each.

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




Re: More thoughts about planner's cost estimates

From
Hannu Krosing
Date:
Ühel kenal päeval, R, 2006-06-02 kell 16:23, kirjutas Greg Stark:

> And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from
> 5% block sampling took just as long as reading all the blocks. Even if we
> figure out what's causing that (IMHO surprising) result and improve matters I
> would only expect it to be 3-4x faster than a full scan.

You should not be surprised by this once you visualise what happens at
the disk level with all those platters spinning and heads moving :) 

Disks can read at full rotation speed, so skipping (not reading) some
blocks will not make reading the remaining blocks from the same track
faster. And if there are more than 20 8k pages per track, you still have
a very high probablility you need to read all tracks..

You may be able to move to the next track a little earlier compared to
reading all blocks, but then you are likely to miss the block from next
track and have to wait a full rotation.

You will get some win from skipping pages only once your % falls so low
that you can also skip a significant number of tracks.

> http://archives.postgresql.org/pgsql-hackers/2006-01/msg00285.php

Your test program could have got a little better results, if you had
somehow managed to tell the system all the block numbers to read in one
go, not each time the next one after hetting the previous one. In
current version it is quite likely that it had to wait several disk
rotations for even the sectors from the same track, as for small steps
it may have missed the next sector. It does not apply for disks which
always read a full track in RAM cache, but even there all tracks are
actually read.

The fact that 5% was not slower than seqscan seems to indicate that
actually all track reads were cached inside the disk or controller.

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




Re: More thoughts about planner's cost estimates

From
Mike Benoit
Date:
pgbench appears to already support arbitrary SQL queries with the -f
switch, so why couldn't we just make it a little smarter and have people
enable SQL query logging for a day or two, then pass the log off to
pgbench:

pgbench -f <log file>

Seems to me like that wouldn't be too difficult to do, and would give
much closer "real-world" results than pgbench's built-in benchmark.

On top of that the community could start offering up "template"
benchmarks like: "busy website", "data warehouse", "forums", "financial"
and distribute them with pgbench:

pgbench -f templates/data_warehouse.pgbench
pgbench -f templates/forums.pgbench
...

From that point a brute force auto-tune utility would be pretty straight
forward to write.

pgautotune -f templates/data_warehouse.bench,myapp.sqllog

Or if one server runs multiple custom apps that you want to tune for:

pgautotune -f myapp1.sqllog,myapp2.sqllog,myapp3.sqllog

Even if it took 48hrs to run, it would be a good burn-in test for a
brand new server. ;)


On Fri, 2006-06-02 at 19:38 -0400, Tom Lane wrote:
> Rod Taylor <pg@rbt.ca> writes:
> >> One objection to this is that after moving "off the gold standard" of
> >> 1.0 = one page fetch, there is no longer any clear meaning to the
> >> cost estimate units; you're faced with the fact that they're just an
> >> arbitrary scale.  I'm not sure that's such a bad thing, though.  For
> >> instance, some people might want to try to tune their settings so that
> >> the estimates are actually comparable to milliseconds of real time.
>
> > Any chance that the correspondence to time could be made a part of the
> > design on purpose and generally advise people to follow that rule?
>
> We might eventually get to that point, but I'm hesitant to try to do it
> immediately.  For one thing, I really *don't* want to get bug reports
> from newbies complaining that the cost estimates are always off by a
> factor of X.  (Not but what we haven't gotten some of those anyway :-()
> In the short term I see us sticking to the convention that seq_page_cost
> is 1.0 in a "typical" database, while anyone who's really hot to try to
> make the other happen is free to experiment.
>
> > If we could tell people to run *benchmark* and use those numbers
> > directly as a first approximation tuning, it could help quite a bit
> > for people new to PostgreSQL experiencing poor performance.
>
> We don't have such a benchmark ... if we did, we could have told
> people how to use it to set the variables already.  I'm very very
> suspicious of any suggestion that it's easy to derive appropriate
> numbers for these settings from one magic benchmark.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
--
Mike Benoit <ipso@snappymail.ca>

Re: More thoughts about planner's cost estimates

From
Greg Stark
Date:
Hannu Krosing <hannu@skype.net> writes:

> Disks can read at full rotation speed, so skipping (not reading) some
> blocks will not make reading the remaining blocks from the same track
> faster. And if there are more than 20 8k pages per track, you still have
> a very high probablility you need to read all tracks..

Well, if there are exactly 20 you would expect a 50% chance of having to read
a track so you would expect double the effective bandwidth. It would have to
be substantially bigger to not have any noticeable effect.

> You may be able to move to the next track a little earlier compared to
> reading all blocks, but then you are likely to miss the block from next
> track and have to wait a full rotation.

No, I don't think that works out. You always have a chance of missing the
block from the next track and having to wait a full rotation and your chance
isn't increased or decreased by seeking earlier in the rotation. So you would
expect each track to take <20 block reads> less time on average.


> Your test program could have got a little better results, if you had
> somehow managed to tell the system all the block numbers to read in one
> go, not each time the next one after hetting the previous one. 

I was trying to simulate the kind of read pattern that postgres generates
which I believe looks like that.

> The fact that 5% was not slower than seqscan seems to indicate that
> actually all track reads were cached inside the disk or controller.

I dunno, your first explanation seemed pretty convincing and doesn't depend on
specific assumptions about the caching. Moreover this doesn't explain why you
*do* get a speedup when reading less than 5%.


Perhaps what this indicates is that the real meat is in track sampling, not
block sampling.


-- 
greg



Re: More thoughts about planner's cost estimates

From
Greg Stark
Date:
Greg Stark <gsstark@MIT.EDU> writes:

> Perhaps what this indicates is that the real meat is in track sampling, not
> block sampling.

Fwiw, I've done a little benchmarking and I'm starting to think this isn't a
bad idea. I see a dramatic speed improvement for samples of 1-10% as the block
size increases. Presumably this is as Hannu said, reducing the number of
tracks necessary to cover the sample.

I see improvements up to around 256M blocks or so, but my data is pretty
questionable since I'm busy watching tv in Mythtv in another window. It's on
another drive but it still seems to be making the numbers jump around a bit.

I expect there's a trade-off between keeping enough blocks for the sample of
blocks to be representative on the one hand and large blocks being much faster
to read in on the other.

I would suggest something like setting the block size in the block sampling
algorithm to something like max(8k,sqrt(table size)). That gives 8k blocks for
anything up to 255M but takes better advantage of the speed increase available
from sequential i/o for larger tables, from my experiments about a 50%
increase in speed. 

Actually maybe even something even more aggressive would be better, maybe
(table size)^.75 So it kicks in sooner than on 256M tables and gets to larger
block sizes on reasonable sized tables.

Note, this doesn't mean anything like changing page sizes, just selecting more
blocks that hopefully lie on the same track when possible.

-- 
greg



Re: More thoughts about planner's cost estimates

From
Greg Stark
Date:
Greg Stark <gsstark@MIT.EDU> writes:

> I see improvements up to around 256M blocks or so, but my data is pretty
> questionable since I'm busy watching tv in Mythtv in another window. It's on
> another drive but it still seems to be making the numbers jump around a bit.

Oops, I meant blocks of 256k there. Sorry.

-- 
greg



Re: More thoughts about planner's cost estimates

From
Greg Stark
Date:
Hannu Krosing <hannu@skype.net> writes:

> Ühel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby:
> 
> > Might also be worth adding analyze delay settings, ala  
> > vacuum_cost_delay.
> 
> Actually we should have delay settings for all potential
> (almost-)full-scan service ops, - VACUUM, ANALYSE, CREATE INDEX, ADD
> CONSTRAINT, maybe more - so that there would be better chances of
> running those on busy databases without disastrous effects.

What about UPDATE and DELETE and for that matter SELECT?


-- 
greg



Re: More thoughts about planner's cost estimates

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Hannu Krosing <hannu@skype.net> writes:
>> Ühel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby:
>>> Might also be worth adding analyze delay settings, ala  
>>> vacuum_cost_delay.

ANALYZE already respects the vacuum delay settings.

>> Actually we should have delay settings for all potential
>> (almost-)full-scan service ops, - VACUUM, ANALYSE, CREATE INDEX, ADD
>> CONSTRAINT, maybe more - so that there would be better chances of
>> running those on busy databases without disastrous effects.

> What about UPDATE and DELETE and for that matter SELECT?

This seems pretty silly.  The point of the delay stuff is to prevent
background maintenance operations from eating an unreasonable share
of resources compared to foreground queries.  I don't see why you'd
put delays into queries --- if your machine is loaded, it's loaded.

I think the existing features are sufficient in this line and that
doing more is just adding complexity for complexity's sake.
        regards, tom lane


Re: More thoughts about planner's cost estimates

From
Hannu Krosing
Date:
Ühel kenal päeval, P, 2006-06-04 kell 18:09, kirjutas Tom Lane:
> Greg Stark <gsstark@mit.edu> writes:
> > Hannu Krosing <hannu@skype.net> writes:
> >> Ühel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby:
> >>> Might also be worth adding analyze delay settings, ala  
> >>> vacuum_cost_delay.
> 
> ANALYZE already respects the vacuum delay settings.
> 
> >> Actually we should have delay settings for all potential
> >> (almost-)full-scan service ops, - VACUUM, ANALYSE, CREATE INDEX, ADD
> >> CONSTRAINT, maybe more - so that there would be better chances of
> >> running those on busy databases without disastrous effects.
> 
> > What about UPDATE and DELETE and for that matter SELECT?
> 
> This seems pretty silly.  The point of the delay stuff is to prevent
> background maintenance operations from eating an unreasonable share
> of resources compared to foreground queries.  I don't see why you'd
> put delays into queries --- if your machine is loaded, it's loaded.
> 
> I think the existing features are sufficient in this line and that
> doing more is just adding complexity for complexity's sake.

Making CREATE INDEX respect delay settings will be reasonable once we
get it to run without locking the table. 

And if non-locking is doable for ADD/ALTER CONSTRAINT, then it makes
sense there too.

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




Re: More thoughts about planner's cost estimates

From
Jim Nasby
Date:
On Jun 4, 2006, at 5:09 PM, Tom Lane wrote:

> Greg Stark <gsstark@mit.edu> writes:
>> Hannu Krosing <hannu@skype.net> writes:
>>> Ühel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby:
>>>> Might also be worth adding analyze delay settings, ala
>>>> vacuum_cost_delay.
>
> ANALYZE already respects the vacuum delay settings.
>
>>> Actually we should have delay settings for all potential
>>> (almost-)full-scan service ops, - VACUUM, ANALYSE, CREATE INDEX, ADD
>>> CONSTRAINT, maybe more - so that there would be better chances of
>>> running those on busy databases without disastrous effects.
>
>> What about UPDATE and DELETE and for that matter SELECT?
>
> This seems pretty silly.  The point of the delay stuff is to prevent
> background maintenance operations from eating an unreasonable share
> of resources compared to foreground queries.  I don't see why you'd
> put delays into queries --- if your machine is loaded, it's loaded.

'maintenance operations' often also mean running large updates. Being
able to run those at a reduced priority would certainly be helpful in
many cases. Though, a better way to accomplish this would be to have
the OS handle prioritized IO scheduling, but since pretty much none
of them seem to do that...
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461





Re: More thoughts about planner's cost estimates

From
Ron Mayer
Date:
Tom Lane wrote:
> 
> One objection to this is that after moving "off the gold standard" of
> 1.0 = one page fetch, there is no longer any clear meaning to the
> cost estimate units; you're faced with the fact that they're just an
> arbitrary scale.  I'm not sure that's such a bad thing, though.

It seems to me the appropriate gold standard is Time, in microseconds
or milliseconds.

The default postgresql.conf can come with a set of hardcoded
values that reasonably approximate some real-world system; and
if that's documented in the file someone reading it can say "hey, my CPU's about the same but my disk subsystem is much
faster,so I know in which direction to change things".
 
And another person may say "ooh, now I know that my 4GHz
machines should have about twice the number here as my 2GHz
box".

For people who *really* care a lot (HW vendors?), they could
eventually make measurements on their systems.