Thread: merge>hash>loop

merge>hash>loop

From
Ian Westmacott
Date:
I have this query, where PG (8.1.2) prefers Merge Join over Hash Join
over Nested Loop.  However, this order turns out to increase in
performance.  I was hoping someone might be able to shed some light on
why PG chooses the plans in this order, and what I might do to
influence it otherwise.  Thanks,

itvtrackdata=> explain analyze SELECT mva,blob.* FROM
blobs00000000000033c3_c16010 AS blob NATURAL JOIN
objects00000000000033c3 WHERE

AREA(BOX(POINT(bbox_x1,bbox_y0),POINT(bbox_x0,bbox_y1))#BOX('(50,10),(10,50)'))/AREA(BOX(POINT(bbox_x1,bbox_y0),POINT(bbox_x0,bbox_y1)))>0
ANDtime>=1263627787-32768 AND time<1273458187 AND finish-start>=8738 ORDER BY time ASC; 

QUERY
PLAN
                                                             

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=47170.44..47184.46 rows=5607 width=28) (actual
time=2661.980..2663.642 rows=1687 loops=1)
   Sort Key: blob."time"
   ->  Merge Join  (cost=46547.88..46821.32 rows=5607 width=28) (actual
time=2645.685..2657.621 rows=1687 loops=1)
         Merge Cond: ("outer".sva = "inner".sva)
         ->  Sort  (cost=18003.31..18045.36 rows=16821 width=20) (actual
time=181.303..183.092 rows=1741 loops=1)
               Sort Key: blob.sva
               ->  Bitmap Heap Scan on blobs00000000000033c3_c16010 blob
(cost=535.77..16822.65 rows=16821 width=20) (actual time=10.827..177.671
rows=1741 loops=1)
                     Recheck Cond: (("time" >= 1263595019) AND ("time" <
1273458187))
                     Filter: ((area((box(point((bbox_x1)::double
precision, (bbox_y0)::double precision), point((bbox_x0)::double
precision, (bbox_y1)::double precision)) # '(50,50),(10,10)'::box)) /
area(box(point((bbox_x1)::double precision, (bbox_y0)::double
precision), point((bbox_x0)::double precision, (bbox_y1)::double
precision)))) > 0::double precision)
                     ->  Bitmap Index Scan on
idx_blobs00000000000033c3_c16010_time  (cost=0.00..535.77 rows=50462
width=0) (actual time=8.565..8.565 rows=50673 loops=1)
                           Index Cond: (("time" >= 1263595019) AND
("time" < 1273458187))
         ->  Sort  (cost=28544.56..28950.88 rows=162526 width=16)
(actual time=2387.726..2437.429 rows=29969 loops=1)
               Sort Key: objects00000000000033c3.sva
               ->  Seq Scan on objects00000000000033c3
(cost=0.00..14477.68 rows=162526 width=16) (actual time=0.085..826.125
rows=207755 loops=1)
                     Filter: ((finish - "start") >= 8738)
 Total runtime: 2675.037 ms
(16 rows)

itvtrackdata=> set enable_mergejoin to false;
SET
itvtrackdata=> explain analyze SELECT mva,blob.* FROM
blobs00000000000033c3_c16010 AS blob NATURAL JOIN
objects00000000000033c3 WHERE

AREA(BOX(POINT(bbox_x1,bbox_y0),POINT(bbox_x0,bbox_y1))#BOX('(50,10),(10,50)'))/AREA(BOX(POINT(bbox_x1,bbox_y0),POINT(bbox_x0,bbox_y1)))>0
ANDtime>=1263627787-32768 AND time<1273458187 AND finish-start>=8738 ORDER BY time ASC; 

QUERY
PLAN
                                                          

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=65783.09..65797.10 rows=5607 width=28) (actual
time=1211.228..1212.671 rows=1687 loops=1)
   Sort Key: blob."time"
   ->  Hash Join  (cost=15419.77..65433.97 rows=5607 width=28) (actual
time=1067.514..1207.727 rows=1687 loops=1)
         Hash Cond: ("outer".sva = "inner".sva)
         ->  Bitmap Heap Scan on blobs00000000000033c3_c16010 blob
(cost=535.77..16822.65 rows=16821 width=20) (actual time=14.720..149.179
rows=1741 loops=1)
               Recheck Cond: (("time" >= 1263595019) AND ("time" <
1273458187))
               Filter: ((area((box(point((bbox_x1)::double precision,
(bbox_y0)::double precision), point((bbox_x0)::double precision,
(bbox_y1)::double precision)) # '(50,50),(10,10)'::box)) /
area(box(point((bbox_x1)::double precision, (bbox_y0)::double
precision), point((bbox_x0)::double precision, (bbox_y1)::double
precision)))) > 0::double precision)
               ->  Bitmap Index Scan on
idx_blobs00000000000033c3_c16010_time  (cost=0.00..535.77 rows=50462
width=0) (actual time=12.880..12.880 rows=50673 loops=1)
                     Index Cond: (("time" >= 1263595019) AND ("time" <
1273458187))
         ->  Hash  (cost=14477.68..14477.68 rows=162526 width=16)
(actual time=1052.729..1052.729 rows=207755 loops=1)
               ->  Seq Scan on objects00000000000033c3
(cost=0.00..14477.68 rows=162526 width=16) (actual time=0.028..684.047
rows=207755 loops=1)
                     Filter: ((finish - "start") >= 8738)
 Total runtime: 1217.938 ms
(13 rows)

itvtrackdata=> set enable_hashjoin to false;
SET
itvtrackdata=> explain analyze SELECT mva,blob.* FROM
blobs00000000000033c3_c16010 AS blob NATURAL JOIN
objects00000000000033c3 WHERE

AREA(BOX(POINT(bbox_x1,bbox_y0),POINT(bbox_x0,bbox_y1))#BOX('(50,10),(10,50)'))/AREA(BOX(POINT(bbox_x1,bbox_y0),POINT(bbox_x0,bbox_y1)))>0
ANDtime>=1263627787-32768 AND time<1273458187 AND finish-start>=8738 ORDER BY time ASC; 

QUERY
PLAN
                                                          

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=118258.49..118272.51 rows=5607 width=28) (actual
time=197.204..198.871 rows=1687 loops=1)
   Sort Key: blob."time"
   ->  Nested Loop  (cost=535.77..117909.37 rows=5607 width=28) (actual
time=27.560..192.526 rows=1687 loops=1)
         ->  Bitmap Heap Scan on blobs00000000000033c3_c16010 blob
(cost=535.77..16822.65 rows=16821 width=20) (actual time=27.450..157.266
rows=1741 loops=1)
               Recheck Cond: (("time" >= 1263595019) AND ("time" <
1273458187))
               Filter: ((area((box(point((bbox_x1)::double precision,
(bbox_y0)::double precision), point((bbox_x0)::double precision,
(bbox_y1)::double precision)) # '(50,50),(10,10)'::box)) /
area(box(point((bbox_x1)::double precision, (bbox_y0)::double
precision), point((bbox_x0)::double precision, (bbox_y1)::double
precision)))) > 0::double precision)
               ->  Bitmap Index Scan on
idx_blobs00000000000033c3_c16010_time  (cost=0.00..535.77 rows=50462
width=0) (actual time=24.445..24.445 rows=50673 loops=1)
                     Index Cond: (("time" >= 1263595019) AND ("time" <
1273458187))
         ->  Index Scan using idx_objects00000000000033c3_sva on
objects00000000000033c3  (cost=0.00..6.00 rows=1 width=16) (actual
time=0.013..0.015 rows=1 loops=1741)
               Index Cond: ("outer".sva = objects00000000000033c3.sva)
               Filter: ((finish - "start") >= 8738)
 Total runtime: 200.719 ms
(12 rows)

--
Ian Westmacott <ianw@intellivid.com>
IntelliVid Corp.


Re: merge>hash>loop

From
Tom Lane
Date:
Ian Westmacott <ianw@intellivid.com> writes:
> I have this query, where PG (8.1.2) prefers Merge Join over Hash Join
> over Nested Loop.  However, this order turns out to increase in
> performance.  I was hoping someone might be able to shed some light on
> why PG chooses the plans in this order, and what I might do to
> influence it otherwise.  Thanks,

Reducing random_page_cost would influence it to prefer the nestloop.
However, I doubt you're ever going to get really ideal results for
this query --- the estimated row counts are too far off, and the
WHERE conditions are sufficiently bizarre that there's not much hope
that they'd ever be much better.

            regards, tom lane

Re: merge>hash>loop

From
Ian Westmacott
Date:
On Fri, 2006-04-14 at 12:13 -0400, Tom Lane wrote:
> Ian Westmacott <ianw@intellivid.com> writes:
> > I have this query, where PG (8.1.2) prefers Merge Join over Hash Join
> > over Nested Loop.  However, this order turns out to increase in
> > performance.  I was hoping someone might be able to shed some light on
> > why PG chooses the plans in this order, and what I might do to
> > influence it otherwise.  Thanks,
>
> Reducing random_page_cost would influence it to prefer the nestloop.
> However, I doubt you're ever going to get really ideal results for
> this query --- the estimated row counts are too far off, and the
> WHERE conditions are sufficiently bizarre that there's not much hope
> that they'd ever be much better.

That's what I feared, thanks.  But then if I simplify things a bit,
such that the row counts are quite good, and PG still chooses a
worse plan, can I conclude anything about my configuration settings
(like random_page_cost)?

itvtrackdata=> explain analyze SELECT mva,blob.* FROM
blobs00000000000033c3_c16010 AS blob NATURAL JOIN
objects00000000000033c3 WHERE time>=1263627787 AND time<1273458187 ORDER
BY time ASC;

QUERY
PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=75426.47..75552.21 rows=50295 width=28) (actual
time=2083.624..2146.654 rows=50472 loops=1)
   Sort Key: blob."time"
   ->  Hash Join  (cost=16411.51..70749.84 rows=50295 width=28) (actual
time=1790.746..1910.204 rows=50472 loops=1)
         Hash Cond: ("outer".sva = "inner".sva)
         ->  Bitmap Heap Scan on blobs00000000000033c3_c16010 blob
(cost=533.77..14421.19 rows=50295 width=20) (actual time=12.533..78.284
rows=50472 loops=1)
               Recheck Cond: (("time" >= 1263627787) AND ("time" <
1273458187))
               ->  Bitmap Index Scan on
idx_blobs00000000000033c3_c16010_time  (cost=0.00..533.77 rows=50295
width=0) (actual time=12.447..12.447 rows=50472 loops=1)
                     Index Cond: (("time" >= 1263627787) AND ("time" <
1273458187))
         ->  Hash  (cost=12039.79..12039.79 rows=487579 width=16)
(actual time=1618.128..1618.128 rows=487579 loops=1)
               ->  Seq Scan on objects00000000000033c3
(cost=0.00..12039.79 rows=487579 width=16) (actual time=0.019..833.931
rows=487579 loops=1)
 Total runtime: 2194.492 ms
(11 rows)

itvtrackdata=> set enable_hashjoin to false;
SET
itvtrackdata=> explain analyze SELECT mva,blob.* FROM
blobs00000000000033c3_c16010 AS blob NATURAL JOIN
objects00000000000033c3 WHERE time>=1263627787 AND time<1273458187 ORDER
BY time ASC;

QUERY
PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=321096.91..321222.64 rows=50295 width=28) (actual
time=929.893..992.316 rows=50472 loops=1)
   Sort Key: blob."time"
   ->  Nested Loop  (cost=533.77..316420.28 rows=50295 width=28) (actual
time=16.780..719.140 rows=50472 loops=1)
         ->  Bitmap Heap Scan on blobs00000000000033c3_c16010 blob
(cost=533.77..14421.19 rows=50295 width=20) (actual time=16.693..84.299
rows=50472 loops=1)
               Recheck Cond: (("time" >= 1263627787) AND ("time" <
1273458187))
               ->  Bitmap Index Scan on
idx_blobs00000000000033c3_c16010_time  (cost=0.00..533.77 rows=50295
width=0) (actual time=16.546..16.546 rows=50472 loops=1)
                     Index Cond: (("time" >= 1263627787) AND ("time" <
1273458187))
         ->  Index Scan using idx_objects00000000000033c3_sva on
objects00000000000033c3  (cost=0.00..5.99 rows=1 width=16) (actual
time=0.006..0.008 rows=1 loops=50472)
               Index Cond: ("outer".sva = objects00000000000033c3.sva)
 Total runtime: 1039.725 ms
(10 rows)

--
Ian Westmacott <ianw@intellivid.com>
IntelliVid Corp.


Re: merge>hash>loop

From
Tom Lane
Date:
Ian Westmacott <ianw@intellivid.com> writes:
> That's what I feared, thanks.  But then if I simplify things a bit,
> such that the row counts are quite good, and PG still chooses a
> worse plan, can I conclude anything about my configuration settings
> (like random_page_cost)?

Well, the other thing that's going on here is that we know we are
overestimating the cost of nestloop-with-inner-indexscan plans.
The current estimation for that is basically "outer scan cost plus N
times inner scan cost" where N is the estimated number of outer tuples;
in other words the repeated indexscan probes are each assumed to happen
from a cold start.  In reality, caching of the upper levels of the index
means that the later index probes are much cheaper than this model
thinks.  We've known about this for some time but no one's yet proposed
a more reasonable cost model.

In my mind this is tied into another issue, which is that the planner
always costs on the basis of each query starting from zero.  In a real
environment it's much cheaper to use heavily-used indexes than this cost
model suggests, because they'll already be swapped in due to use by
previous queries.  But we haven't got any infrastructure to keep track
of what's been heavily used, let alone a cost model that could make use
of the info.

I think part of the reason that people commonly reduce random_page_cost
to values much lower than physical reality would suggest is that it
provides a crude way of partially compensating for this basic problem.

            regards, tom lane

Re: merge>hash>loop

From
Markus Schaber
Date:
Hi, Tom,

Tom Lane wrote:

> Well, the other thing that's going on here is that we know we are
> overestimating the cost of nestloop-with-inner-indexscan plans.
> The current estimation for that is basically "outer scan cost plus N
> times inner scan cost" where N is the estimated number of outer tuples;
> in other words the repeated indexscan probes are each assumed to happen
> from a cold start.  In reality, caching of the upper levels of the index
> means that the later index probes are much cheaper than this model
> thinks.  We've known about this for some time but no one's yet proposed
> a more reasonable cost model.

My spontaneus guess would be to use log(N)*inner instead of N*inner. I
don't have any backings for that, it's just what my intuition tells me
as a first shot.

> In my mind this is tied into another issue, which is that the planner
> always costs on the basis of each query starting from zero.  In a real
> environment it's much cheaper to use heavily-used indexes than this cost
> model suggests, because they'll already be swapped in due to use by
> previous queries.  But we haven't got any infrastructure to keep track
> of what's been heavily used, let alone a cost model that could make use
> of the info.

An easy first approach would be to add a user tunable cache probability
value to each index (and possibly table) between 0 and 1. Then simply
multiply random_page_cost with (1-that value) for each scan.

Later, this value could be automatically tuned by stats analysis or
other means.

> I think part of the reason that people commonly reduce random_page_cost
> to values much lower than physical reality would suggest is that it
> provides a crude way of partially compensating for this basic problem.

I totall agree with this, it's just what we did here from time to time. :-)

Hmm, how does effective_cach_size correspond with it? Shouldn't a high
effective_cache_size have a similar effect?

Thanks,
Markus
--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

Re: merge>hash>loop

From
"Jim C. Nasby"
Date:
On Tue, Apr 18, 2006 at 12:51:59PM +0200, Markus Schaber wrote:
> > In my mind this is tied into another issue, which is that the planner
> > always costs on the basis of each query starting from zero.  In a real
> > environment it's much cheaper to use heavily-used indexes than this cost
> > model suggests, because they'll already be swapped in due to use by
> > previous queries.  But we haven't got any infrastructure to keep track
> > of what's been heavily used, let alone a cost model that could make use
> > of the info.
>
> An easy first approach would be to add a user tunable cache probability
> value to each index (and possibly table) between 0 and 1. Then simply
> multiply random_page_cost with (1-that value) for each scan.
>
> Later, this value could be automatically tuned by stats analysis or
> other means.

Actually, if you run with stats_block_level turned on you have a
first-order approximation of what is and isn't cached. Perhaps the
planner could make use of this information if it's available.

> > I think part of the reason that people commonly reduce random_page_cost
> > to values much lower than physical reality would suggest is that it
> > provides a crude way of partially compensating for this basic problem.
>
> I totall agree with this, it's just what we did here from time to time. :-)
>
> Hmm, how does effective_cach_size correspond with it? Shouldn't a high
> effective_cache_size have a similar effect?

Generally, effective_cache_size is used to determine the likelyhood that
something will be in-cache. random_page_cost tells us how expensive it
will be to get that information if it isn't in cache.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: merge>hash>loop

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> Actually, if you run with stats_block_level turned on you have a
> first-order approximation of what is and isn't cached.

Only if those stats decayed (pretty fast) with time; which they don't.

            regards, tom lane

Re: merge>hash>loop

From
Tom Lane
Date:
Markus Schaber <schabi@logix-tt.com> writes:
> Hmm, how does effective_cach_size correspond with it? Shouldn't a high
> effective_cache_size have a similar effect?

It seems reasonable to suppose that effective_cache_size ought to be
used as a number indicating how much "stuff" would hang around from
query to query.  Right now it's not used that way...

            regards, tom lane

Re: merge>hash>loop

From
"Jim C. Nasby"
Date:
On Tue, Apr 18, 2006 at 06:22:26PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > Actually, if you run with stats_block_level turned on you have a
> > first-order approximation of what is and isn't cached.
>
> Only if those stats decayed (pretty fast) with time; which they don't.

Good point. :/ I'm guessing there's no easy way to see how many blocks
for a given relation are in shared memory, either...
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: merge>hash>loop

From
"Jim C. Nasby"
Date:
On Tue, Apr 18, 2006 at 06:26:48PM -0400, Tom Lane wrote:
> Markus Schaber <schabi@logix-tt.com> writes:
> > Hmm, how does effective_cach_size correspond with it? Shouldn't a high
> > effective_cache_size have a similar effect?
>
> It seems reasonable to suppose that effective_cache_size ought to be
> used as a number indicating how much "stuff" would hang around from
> query to query.  Right now it's not used that way...

Maybe it would be a reasonable first pass to have estimators calculate
the cost if a node found everything it wanted in cache and then do a
linear interpolation between that and the costs we currently come up
with? Something like pg_class.relpages / sum(pg_class.relpages) would
give an idea of how much of a relation is likely to be cached, which
could be used for the linear interpolation.

Of course having *any* idea as to how much of a relation was actually in
shared_buffers (or better yet, the OS cache) would be a lot more
accurate, but this simple method might be a good enough first-pass.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: merge>hash>loop

From
Tom Lane
Date:
Markus Schaber <schabi@logix-tt.com> writes:
> An easy first approach would be to add a user tunable cache probability
> value to each index (and possibly table) between 0 and 1. Then simply
> multiply random_page_cost with (1-that value) for each scan.

That's not the way you'd need to use it.  But on reflection I do think
there's some merit in a "cache probability" parameter, ranging from zero
(giving current planner behavior) to one (causing the planner to assume
everything is already in cache from prior queries).  We'd have to look
at exactly how such an assumption should affect the cost equations ...

            regards, tom lane

Re: merge>hash>loop

From
Mark Kirkwood
Date:
Jim C. Nasby wrote:
> On Tue, Apr 18, 2006 at 06:22:26PM -0400, Tom Lane wrote:
>> "Jim C. Nasby" <jnasby@pervasive.com> writes:
>>> Actually, if you run with stats_block_level turned on you have a
>>> first-order approximation of what is and isn't cached.
>> Only if those stats decayed (pretty fast) with time; which they don't.
>
> Good point. :/ I'm guessing there's no easy way to see how many blocks
> for a given relation are in shared memory, either...

contrib/pg_buffercache will tell you this - what buffers from what
relation are in shared_buffers (if you want to interrogate the os file
buffer cache, that's a different story - tho I've been toying with doing
a utility for  Freebsd that would do this).

Cheers

Mark

Re: merge>hash>loop

From
"Jim C. Nasby"
Date:
On Wed, Apr 19, 2006 at 04:47:40PM +1200, Mark Kirkwood wrote:
> Jim C. Nasby wrote:
> >On Tue, Apr 18, 2006 at 06:22:26PM -0400, Tom Lane wrote:
> >>"Jim C. Nasby" <jnasby@pervasive.com> writes:
> >>>Actually, if you run with stats_block_level turned on you have a
> >>>first-order approximation of what is and isn't cached.
> >>Only if those stats decayed (pretty fast) with time; which they don't.
> >
> >Good point. :/ I'm guessing there's no easy way to see how many blocks
> >for a given relation are in shared memory, either...
>
> contrib/pg_buffercache will tell you this - what buffers from what
> relation are in shared_buffers (if you want to interrogate the os file

So theoretically with that code we could make the cost estimator
functions more intelligent about actual query costs. Now, how you'd
actually see how those estimates improved...

> buffer cache, that's a different story - tho I've been toying with doing
> a utility for  Freebsd that would do this).

Well, the problem is that I doubt anything that OS-specific would be
accepted into core. What we really need is some method that's
OS-agnostic...
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: merge>hash>loop

From
Tom Lane
Date:
Mark Kirkwood <markir@paradise.net.nz> writes:
> Jim C. Nasby wrote:
>> Good point. :/ I'm guessing there's no easy way to see how many blocks
>> for a given relation are in shared memory, either...

> contrib/pg_buffercache will tell you this -

I think the key word in Jim's comment was "easy", ie, cheap.  Grovelling
through many thousands of buffers to count the matches to a given
relation doesn't sound appetizing, especially not if it gets done over
again several times during each query-planning cycle.  Trying to keep
centralized counts somewhere would be even worse (because of locking/
contention issues).

            regards, tom lane

Re: merge>hash>loop

From
Mark Kirkwood
Date:
Tom Lane wrote:
> Mark Kirkwood <markir@paradise.net.nz> writes:
>> Jim C. Nasby wrote:
>>> Good point. :/ I'm guessing there's no easy way to see how many blocks
>>> for a given relation are in shared memory, either...
>
>> contrib/pg_buffercache will tell you this -
>
> I think the key word in Jim's comment was "easy", ie, cheap.  Grovelling
> through many thousands of buffers to count the matches to a given
> relation doesn't sound appetizing, especially not if it gets done over
> again several times during each query-planning cycle.  Trying to keep
> centralized counts somewhere would be even worse (because of locking/
> contention issues).
>

Yeah - not sensible for a transaction oriented system - might be ok for
DSS tho.

Cheers

mark

Re: merge>hash>loop

From
"Jim C. Nasby"
Date:
On Wed, Apr 19, 2006 at 01:25:28AM -0400, Tom Lane wrote:
> Mark Kirkwood <markir@paradise.net.nz> writes:
> > Jim C. Nasby wrote:
> >> Good point. :/ I'm guessing there's no easy way to see how many blocks
> >> for a given relation are in shared memory, either...
>
> > contrib/pg_buffercache will tell you this -
>
> I think the key word in Jim's comment was "easy", ie, cheap.  Grovelling
> through many thousands of buffers to count the matches to a given
> relation doesn't sound appetizing, especially not if it gets done over
> again several times during each query-planning cycle.  Trying to keep
> centralized counts somewhere would be even worse (because of locking/
> contention issues).

Very true. OTOH, it might not be unreasonable to periodically slog
through the buffers and store that information, perhaps once a minute,
or every X number of transactions.

I think a bigger issue is that we currently have no way to really
measure the effictiveness of the planner. Without that it's impossible
to come up with any real data on whether cost formula A is better or
worse than cost formula B. The only means I can think of for doing this
would be to measure estimated cost vs actual cost, but with the overhead
of EXPLAIN ANALYZE and other variables that might not prove terribly
practical. Maybe someone else has some ideas...
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461