Thread: merge>hash>loop
I have this query, where PG (8.1.2) prefers Merge Join over Hash Join over Nested Loop. However, this order turns out to increase in performance. I was hoping someone might be able to shed some light on why PG chooses the plans in this order, and what I might do to influence it otherwise. Thanks, itvtrackdata=> explain analyze SELECT mva,blob.* FROM blobs00000000000033c3_c16010 AS blob NATURAL JOIN objects00000000000033c3 WHERE AREA(BOX(POINT(bbox_x1,bbox_y0),POINT(bbox_x0,bbox_y1))#BOX('(50,10),(10,50)'))/AREA(BOX(POINT(bbox_x1,bbox_y0),POINT(bbox_x0,bbox_y1)))>0 ANDtime>=1263627787-32768 AND time<1273458187 AND finish-start>=8738 ORDER BY time ASC; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=47170.44..47184.46 rows=5607 width=28) (actual time=2661.980..2663.642 rows=1687 loops=1) Sort Key: blob."time" -> Merge Join (cost=46547.88..46821.32 rows=5607 width=28) (actual time=2645.685..2657.621 rows=1687 loops=1) Merge Cond: ("outer".sva = "inner".sva) -> Sort (cost=18003.31..18045.36 rows=16821 width=20) (actual time=181.303..183.092 rows=1741 loops=1) Sort Key: blob.sva -> Bitmap Heap Scan on blobs00000000000033c3_c16010 blob (cost=535.77..16822.65 rows=16821 width=20) (actual time=10.827..177.671 rows=1741 loops=1) Recheck Cond: (("time" >= 1263595019) AND ("time" < 1273458187)) Filter: ((area((box(point((bbox_x1)::double precision, (bbox_y0)::double precision), point((bbox_x0)::double precision, (bbox_y1)::double precision)) # '(50,50),(10,10)'::box)) / area(box(point((bbox_x1)::double precision, (bbox_y0)::double precision), point((bbox_x0)::double precision, (bbox_y1)::double precision)))) > 0::double precision) -> Bitmap Index Scan on idx_blobs00000000000033c3_c16010_time (cost=0.00..535.77 rows=50462 width=0) (actual time=8.565..8.565 rows=50673 loops=1) Index Cond: (("time" >= 1263595019) AND ("time" < 1273458187)) -> Sort (cost=28544.56..28950.88 rows=162526 width=16) (actual time=2387.726..2437.429 rows=29969 loops=1) Sort Key: objects00000000000033c3.sva -> Seq Scan on objects00000000000033c3 (cost=0.00..14477.68 rows=162526 width=16) (actual time=0.085..826.125 rows=207755 loops=1) Filter: ((finish - "start") >= 8738) Total runtime: 2675.037 ms (16 rows) itvtrackdata=> set enable_mergejoin to false; SET itvtrackdata=> explain analyze SELECT mva,blob.* FROM blobs00000000000033c3_c16010 AS blob NATURAL JOIN objects00000000000033c3 WHERE AREA(BOX(POINT(bbox_x1,bbox_y0),POINT(bbox_x0,bbox_y1))#BOX('(50,10),(10,50)'))/AREA(BOX(POINT(bbox_x1,bbox_y0),POINT(bbox_x0,bbox_y1)))>0 ANDtime>=1263627787-32768 AND time<1273458187 AND finish-start>=8738 ORDER BY time ASC; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=65783.09..65797.10 rows=5607 width=28) (actual time=1211.228..1212.671 rows=1687 loops=1) Sort Key: blob."time" -> Hash Join (cost=15419.77..65433.97 rows=5607 width=28) (actual time=1067.514..1207.727 rows=1687 loops=1) Hash Cond: ("outer".sva = "inner".sva) -> Bitmap Heap Scan on blobs00000000000033c3_c16010 blob (cost=535.77..16822.65 rows=16821 width=20) (actual time=14.720..149.179 rows=1741 loops=1) Recheck Cond: (("time" >= 1263595019) AND ("time" < 1273458187)) Filter: ((area((box(point((bbox_x1)::double precision, (bbox_y0)::double precision), point((bbox_x0)::double precision, (bbox_y1)::double precision)) # '(50,50),(10,10)'::box)) / area(box(point((bbox_x1)::double precision, (bbox_y0)::double precision), point((bbox_x0)::double precision, (bbox_y1)::double precision)))) > 0::double precision) -> Bitmap Index Scan on idx_blobs00000000000033c3_c16010_time (cost=0.00..535.77 rows=50462 width=0) (actual time=12.880..12.880 rows=50673 loops=1) Index Cond: (("time" >= 1263595019) AND ("time" < 1273458187)) -> Hash (cost=14477.68..14477.68 rows=162526 width=16) (actual time=1052.729..1052.729 rows=207755 loops=1) -> Seq Scan on objects00000000000033c3 (cost=0.00..14477.68 rows=162526 width=16) (actual time=0.028..684.047 rows=207755 loops=1) Filter: ((finish - "start") >= 8738) Total runtime: 1217.938 ms (13 rows) itvtrackdata=> set enable_hashjoin to false; SET itvtrackdata=> explain analyze SELECT mva,blob.* FROM blobs00000000000033c3_c16010 AS blob NATURAL JOIN objects00000000000033c3 WHERE AREA(BOX(POINT(bbox_x1,bbox_y0),POINT(bbox_x0,bbox_y1))#BOX('(50,10),(10,50)'))/AREA(BOX(POINT(bbox_x1,bbox_y0),POINT(bbox_x0,bbox_y1)))>0 ANDtime>=1263627787-32768 AND time<1273458187 AND finish-start>=8738 ORDER BY time ASC; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=118258.49..118272.51 rows=5607 width=28) (actual time=197.204..198.871 rows=1687 loops=1) Sort Key: blob."time" -> Nested Loop (cost=535.77..117909.37 rows=5607 width=28) (actual time=27.560..192.526 rows=1687 loops=1) -> Bitmap Heap Scan on blobs00000000000033c3_c16010 blob (cost=535.77..16822.65 rows=16821 width=20) (actual time=27.450..157.266 rows=1741 loops=1) Recheck Cond: (("time" >= 1263595019) AND ("time" < 1273458187)) Filter: ((area((box(point((bbox_x1)::double precision, (bbox_y0)::double precision), point((bbox_x0)::double precision, (bbox_y1)::double precision)) # '(50,50),(10,10)'::box)) / area(box(point((bbox_x1)::double precision, (bbox_y0)::double precision), point((bbox_x0)::double precision, (bbox_y1)::double precision)))) > 0::double precision) -> Bitmap Index Scan on idx_blobs00000000000033c3_c16010_time (cost=0.00..535.77 rows=50462 width=0) (actual time=24.445..24.445 rows=50673 loops=1) Index Cond: (("time" >= 1263595019) AND ("time" < 1273458187)) -> Index Scan using idx_objects00000000000033c3_sva on objects00000000000033c3 (cost=0.00..6.00 rows=1 width=16) (actual time=0.013..0.015 rows=1 loops=1741) Index Cond: ("outer".sva = objects00000000000033c3.sva) Filter: ((finish - "start") >= 8738) Total runtime: 200.719 ms (12 rows) -- Ian Westmacott <ianw@intellivid.com> IntelliVid Corp.
Ian Westmacott <ianw@intellivid.com> writes: > I have this query, where PG (8.1.2) prefers Merge Join over Hash Join > over Nested Loop. However, this order turns out to increase in > performance. I was hoping someone might be able to shed some light on > why PG chooses the plans in this order, and what I might do to > influence it otherwise. Thanks, Reducing random_page_cost would influence it to prefer the nestloop. However, I doubt you're ever going to get really ideal results for this query --- the estimated row counts are too far off, and the WHERE conditions are sufficiently bizarre that there's not much hope that they'd ever be much better. regards, tom lane
On Fri, 2006-04-14 at 12:13 -0400, Tom Lane wrote: > Ian Westmacott <ianw@intellivid.com> writes: > > I have this query, where PG (8.1.2) prefers Merge Join over Hash Join > > over Nested Loop. However, this order turns out to increase in > > performance. I was hoping someone might be able to shed some light on > > why PG chooses the plans in this order, and what I might do to > > influence it otherwise. Thanks, > > Reducing random_page_cost would influence it to prefer the nestloop. > However, I doubt you're ever going to get really ideal results for > this query --- the estimated row counts are too far off, and the > WHERE conditions are sufficiently bizarre that there's not much hope > that they'd ever be much better. That's what I feared, thanks. But then if I simplify things a bit, such that the row counts are quite good, and PG still chooses a worse plan, can I conclude anything about my configuration settings (like random_page_cost)? itvtrackdata=> explain analyze SELECT mva,blob.* FROM blobs00000000000033c3_c16010 AS blob NATURAL JOIN objects00000000000033c3 WHERE time>=1263627787 AND time<1273458187 ORDER BY time ASC; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=75426.47..75552.21 rows=50295 width=28) (actual time=2083.624..2146.654 rows=50472 loops=1) Sort Key: blob."time" -> Hash Join (cost=16411.51..70749.84 rows=50295 width=28) (actual time=1790.746..1910.204 rows=50472 loops=1) Hash Cond: ("outer".sva = "inner".sva) -> Bitmap Heap Scan on blobs00000000000033c3_c16010 blob (cost=533.77..14421.19 rows=50295 width=20) (actual time=12.533..78.284 rows=50472 loops=1) Recheck Cond: (("time" >= 1263627787) AND ("time" < 1273458187)) -> Bitmap Index Scan on idx_blobs00000000000033c3_c16010_time (cost=0.00..533.77 rows=50295 width=0) (actual time=12.447..12.447 rows=50472 loops=1) Index Cond: (("time" >= 1263627787) AND ("time" < 1273458187)) -> Hash (cost=12039.79..12039.79 rows=487579 width=16) (actual time=1618.128..1618.128 rows=487579 loops=1) -> Seq Scan on objects00000000000033c3 (cost=0.00..12039.79 rows=487579 width=16) (actual time=0.019..833.931 rows=487579 loops=1) Total runtime: 2194.492 ms (11 rows) itvtrackdata=> set enable_hashjoin to false; SET itvtrackdata=> explain analyze SELECT mva,blob.* FROM blobs00000000000033c3_c16010 AS blob NATURAL JOIN objects00000000000033c3 WHERE time>=1263627787 AND time<1273458187 ORDER BY time ASC; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=321096.91..321222.64 rows=50295 width=28) (actual time=929.893..992.316 rows=50472 loops=1) Sort Key: blob."time" -> Nested Loop (cost=533.77..316420.28 rows=50295 width=28) (actual time=16.780..719.140 rows=50472 loops=1) -> Bitmap Heap Scan on blobs00000000000033c3_c16010 blob (cost=533.77..14421.19 rows=50295 width=20) (actual time=16.693..84.299 rows=50472 loops=1) Recheck Cond: (("time" >= 1263627787) AND ("time" < 1273458187)) -> Bitmap Index Scan on idx_blobs00000000000033c3_c16010_time (cost=0.00..533.77 rows=50295 width=0) (actual time=16.546..16.546 rows=50472 loops=1) Index Cond: (("time" >= 1263627787) AND ("time" < 1273458187)) -> Index Scan using idx_objects00000000000033c3_sva on objects00000000000033c3 (cost=0.00..5.99 rows=1 width=16) (actual time=0.006..0.008 rows=1 loops=50472) Index Cond: ("outer".sva = objects00000000000033c3.sva) Total runtime: 1039.725 ms (10 rows) -- Ian Westmacott <ianw@intellivid.com> IntelliVid Corp.
Ian Westmacott <ianw@intellivid.com> writes: > That's what I feared, thanks. But then if I simplify things a bit, > such that the row counts are quite good, and PG still chooses a > worse plan, can I conclude anything about my configuration settings > (like random_page_cost)? Well, the other thing that's going on here is that we know we are overestimating the cost of nestloop-with-inner-indexscan plans. The current estimation for that is basically "outer scan cost plus N times inner scan cost" where N is the estimated number of outer tuples; in other words the repeated indexscan probes are each assumed to happen from a cold start. In reality, caching of the upper levels of the index means that the later index probes are much cheaper than this model thinks. We've known about this for some time but no one's yet proposed a more reasonable cost model. In my mind this is tied into another issue, which is that the planner always costs on the basis of each query starting from zero. In a real environment it's much cheaper to use heavily-used indexes than this cost model suggests, because they'll already be swapped in due to use by previous queries. But we haven't got any infrastructure to keep track of what's been heavily used, let alone a cost model that could make use of the info. I think part of the reason that people commonly reduce random_page_cost to values much lower than physical reality would suggest is that it provides a crude way of partially compensating for this basic problem. regards, tom lane
Hi, Tom, Tom Lane wrote: > Well, the other thing that's going on here is that we know we are > overestimating the cost of nestloop-with-inner-indexscan plans. > The current estimation for that is basically "outer scan cost plus N > times inner scan cost" where N is the estimated number of outer tuples; > in other words the repeated indexscan probes are each assumed to happen > from a cold start. In reality, caching of the upper levels of the index > means that the later index probes are much cheaper than this model > thinks. We've known about this for some time but no one's yet proposed > a more reasonable cost model. My spontaneus guess would be to use log(N)*inner instead of N*inner. I don't have any backings for that, it's just what my intuition tells me as a first shot. > In my mind this is tied into another issue, which is that the planner > always costs on the basis of each query starting from zero. In a real > environment it's much cheaper to use heavily-used indexes than this cost > model suggests, because they'll already be swapped in due to use by > previous queries. But we haven't got any infrastructure to keep track > of what's been heavily used, let alone a cost model that could make use > of the info. An easy first approach would be to add a user tunable cache probability value to each index (and possibly table) between 0 and 1. Then simply multiply random_page_cost with (1-that value) for each scan. Later, this value could be automatically tuned by stats analysis or other means. > I think part of the reason that people commonly reduce random_page_cost > to values much lower than physical reality would suggest is that it > provides a crude way of partially compensating for this basic problem. I totall agree with this, it's just what we did here from time to time. :-) Hmm, how does effective_cach_size correspond with it? Shouldn't a high effective_cache_size have a similar effect? Thanks, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org
On Tue, Apr 18, 2006 at 12:51:59PM +0200, Markus Schaber wrote: > > In my mind this is tied into another issue, which is that the planner > > always costs on the basis of each query starting from zero. In a real > > environment it's much cheaper to use heavily-used indexes than this cost > > model suggests, because they'll already be swapped in due to use by > > previous queries. But we haven't got any infrastructure to keep track > > of what's been heavily used, let alone a cost model that could make use > > of the info. > > An easy first approach would be to add a user tunable cache probability > value to each index (and possibly table) between 0 and 1. Then simply > multiply random_page_cost with (1-that value) for each scan. > > Later, this value could be automatically tuned by stats analysis or > other means. Actually, if you run with stats_block_level turned on you have a first-order approximation of what is and isn't cached. Perhaps the planner could make use of this information if it's available. > > I think part of the reason that people commonly reduce random_page_cost > > to values much lower than physical reality would suggest is that it > > provides a crude way of partially compensating for this basic problem. > > I totall agree with this, it's just what we did here from time to time. :-) > > Hmm, how does effective_cach_size correspond with it? Shouldn't a high > effective_cache_size have a similar effect? Generally, effective_cache_size is used to determine the likelyhood that something will be in-cache. random_page_cost tells us how expensive it will be to get that information if it isn't in cache. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes: > Actually, if you run with stats_block_level turned on you have a > first-order approximation of what is and isn't cached. Only if those stats decayed (pretty fast) with time; which they don't. regards, tom lane
Markus Schaber <schabi@logix-tt.com> writes: > Hmm, how does effective_cach_size correspond with it? Shouldn't a high > effective_cache_size have a similar effect? It seems reasonable to suppose that effective_cache_size ought to be used as a number indicating how much "stuff" would hang around from query to query. Right now it's not used that way... regards, tom lane
On Tue, Apr 18, 2006 at 06:22:26PM -0400, Tom Lane wrote: > "Jim C. Nasby" <jnasby@pervasive.com> writes: > > Actually, if you run with stats_block_level turned on you have a > > first-order approximation of what is and isn't cached. > > Only if those stats decayed (pretty fast) with time; which they don't. Good point. :/ I'm guessing there's no easy way to see how many blocks for a given relation are in shared memory, either... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Tue, Apr 18, 2006 at 06:26:48PM -0400, Tom Lane wrote: > Markus Schaber <schabi@logix-tt.com> writes: > > Hmm, how does effective_cach_size correspond with it? Shouldn't a high > > effective_cache_size have a similar effect? > > It seems reasonable to suppose that effective_cache_size ought to be > used as a number indicating how much "stuff" would hang around from > query to query. Right now it's not used that way... Maybe it would be a reasonable first pass to have estimators calculate the cost if a node found everything it wanted in cache and then do a linear interpolation between that and the costs we currently come up with? Something like pg_class.relpages / sum(pg_class.relpages) would give an idea of how much of a relation is likely to be cached, which could be used for the linear interpolation. Of course having *any* idea as to how much of a relation was actually in shared_buffers (or better yet, the OS cache) would be a lot more accurate, but this simple method might be a good enough first-pass. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Markus Schaber <schabi@logix-tt.com> writes: > An easy first approach would be to add a user tunable cache probability > value to each index (and possibly table) between 0 and 1. Then simply > multiply random_page_cost with (1-that value) for each scan. That's not the way you'd need to use it. But on reflection I do think there's some merit in a "cache probability" parameter, ranging from zero (giving current planner behavior) to one (causing the planner to assume everything is already in cache from prior queries). We'd have to look at exactly how such an assumption should affect the cost equations ... regards, tom lane
Jim C. Nasby wrote: > On Tue, Apr 18, 2006 at 06:22:26PM -0400, Tom Lane wrote: >> "Jim C. Nasby" <jnasby@pervasive.com> writes: >>> Actually, if you run with stats_block_level turned on you have a >>> first-order approximation of what is and isn't cached. >> Only if those stats decayed (pretty fast) with time; which they don't. > > Good point. :/ I'm guessing there's no easy way to see how many blocks > for a given relation are in shared memory, either... contrib/pg_buffercache will tell you this - what buffers from what relation are in shared_buffers (if you want to interrogate the os file buffer cache, that's a different story - tho I've been toying with doing a utility for Freebsd that would do this). Cheers Mark
On Wed, Apr 19, 2006 at 04:47:40PM +1200, Mark Kirkwood wrote: > Jim C. Nasby wrote: > >On Tue, Apr 18, 2006 at 06:22:26PM -0400, Tom Lane wrote: > >>"Jim C. Nasby" <jnasby@pervasive.com> writes: > >>>Actually, if you run with stats_block_level turned on you have a > >>>first-order approximation of what is and isn't cached. > >>Only if those stats decayed (pretty fast) with time; which they don't. > > > >Good point. :/ I'm guessing there's no easy way to see how many blocks > >for a given relation are in shared memory, either... > > contrib/pg_buffercache will tell you this - what buffers from what > relation are in shared_buffers (if you want to interrogate the os file So theoretically with that code we could make the cost estimator functions more intelligent about actual query costs. Now, how you'd actually see how those estimates improved... > buffer cache, that's a different story - tho I've been toying with doing > a utility for Freebsd that would do this). Well, the problem is that I doubt anything that OS-specific would be accepted into core. What we really need is some method that's OS-agnostic... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Mark Kirkwood <markir@paradise.net.nz> writes: > Jim C. Nasby wrote: >> Good point. :/ I'm guessing there's no easy way to see how many blocks >> for a given relation are in shared memory, either... > contrib/pg_buffercache will tell you this - I think the key word in Jim's comment was "easy", ie, cheap. Grovelling through many thousands of buffers to count the matches to a given relation doesn't sound appetizing, especially not if it gets done over again several times during each query-planning cycle. Trying to keep centralized counts somewhere would be even worse (because of locking/ contention issues). regards, tom lane
Tom Lane wrote: > Mark Kirkwood <markir@paradise.net.nz> writes: >> Jim C. Nasby wrote: >>> Good point. :/ I'm guessing there's no easy way to see how many blocks >>> for a given relation are in shared memory, either... > >> contrib/pg_buffercache will tell you this - > > I think the key word in Jim's comment was "easy", ie, cheap. Grovelling > through many thousands of buffers to count the matches to a given > relation doesn't sound appetizing, especially not if it gets done over > again several times during each query-planning cycle. Trying to keep > centralized counts somewhere would be even worse (because of locking/ > contention issues). > Yeah - not sensible for a transaction oriented system - might be ok for DSS tho. Cheers mark
On Wed, Apr 19, 2006 at 01:25:28AM -0400, Tom Lane wrote: > Mark Kirkwood <markir@paradise.net.nz> writes: > > Jim C. Nasby wrote: > >> Good point. :/ I'm guessing there's no easy way to see how many blocks > >> for a given relation are in shared memory, either... > > > contrib/pg_buffercache will tell you this - > > I think the key word in Jim's comment was "easy", ie, cheap. Grovelling > through many thousands of buffers to count the matches to a given > relation doesn't sound appetizing, especially not if it gets done over > again several times during each query-planning cycle. Trying to keep > centralized counts somewhere would be even worse (because of locking/ > contention issues). Very true. OTOH, it might not be unreasonable to periodically slog through the buffers and store that information, perhaps once a minute, or every X number of transactions. I think a bigger issue is that we currently have no way to really measure the effictiveness of the planner. Without that it's impossible to come up with any real data on whether cost formula A is better or worse than cost formula B. The only means I can think of for doing this would be to measure estimated cost vs actual cost, but with the overhead of EXPLAIN ANALYZE and other variables that might not prove terribly practical. Maybe someone else has some ideas... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461