Thread: Nested loops overpriced
This query does some sort of analysis on an email archive: SELECT eh_subj.header_body AS subject, count(distinct eh_from.header_body) FROM email JOIN mime_part USING (email_id) JOIN email_header eh_subj USING (email_id, mime_part_id) JOIN email_header eh_from USING (email_id, mime_part_id) WHERE eh_subj.header_name = 'subject' AND eh_from.header_name = 'from' AND mime_part_id = 0 AND (time >= timestamp '2007-05-05 17:01:59' AND time < timestamp '2007-05-05 17:01:59' + interval '60 min') GROUP BY eh_subj.header_body; The planner chooses this plan: QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=87142.18..87366.58 rows=11220 width=184) (actual time=7883.541..8120.647 rows=35000 loops=1) -> Sort (cost=87142.18..87170.23 rows=11220 width=184) (actual time=7883.471..7926.031 rows=35000 loops=1) Sort Key: eh_subj.header_body -> Hash Join (cost=46283.30..86387.42 rows=11220 width=184) (actual time=5140.182..7635.615 rows=35000 loops=1) Hash Cond: (eh_subj.email_id = email.email_id) -> Bitmap Heap Scan on email_header eh_subj (cost=11853.68..50142.87 rows=272434 width=104) (actual time=367.956..1719.736rows=280989 loops=1) Recheck Cond: ((mime_part_id = 0) AND (header_name = 'subject'::text)) -> BitmapAnd (cost=11853.68..11853.68 rows=27607 width=0) (actual time=326.507..326.507 rows=0 loops=1) -> Bitmap Index Scan on idx__email_header__header_body_subject (cost=0.00..5836.24 rows=272434width=0) (actual time=178.041..178.041 rows=280989 loops=1) -> Bitmap Index Scan on idx__email_header__header_name (cost=0.00..5880.97 rows=281247 width=0)(actual time=114.574..114.574 rows=280989 loops=1) Index Cond: (header_name = 'subject'::text) -> Hash (cost=34291.87..34291.87 rows=11020 width=120) (actual time=4772.148..4772.148 rows=35000 loops=1) -> Hash Join (cost=24164.59..34291.87 rows=11020 width=120) (actual time=3131.067..4706.997 rows=35000loops=1) Hash Cond: (mime_part.email_id = email.email_id) -> Seq Scan on mime_part (cost=0.00..8355.81 rows=265804 width=12) (actual time=0.038..514.291rows=267890 loops=1) Filter: (mime_part_id = 0) -> Hash (cost=24025.94..24025.94 rows=11092 width=112) (actual time=3130.982..3130.982 rows=35000loops=1) -> Hash Join (cost=22244.54..24025.94 rows=11092 width=112) (actual time=996.556..3069.280rows=35000 loops=1) Hash Cond: (eh_from.email_id = email.email_id) -> Bitmap Heap Scan on email_header eh_from (cost=15576.58..16041.55 rows=107156width=104) (actual time=569.762..1932.017 rows=280990 loops=1) Recheck Cond: ((mime_part_id = 0) AND (header_name = 'from'::text)) -> BitmapAnd (cost=15576.58..15576.58 rows=160 width=0) (actual time=532.217..532.217rows=0 loops=1) -> Bitmap Index Scan on dummy_index (cost=0.00..3724.22 rows=107156width=0) (actual time=116.386..116.386 rows=280990 loops=1) -> Bitmap Index Scan on idx__email_header__from_local (cost=0.00..5779.24rows=107156 width=0) (actual time=174.883..174.883 rows=280990 loops=1) -> Bitmap Index Scan on dummy2_index (cost=0.00..5992.25 rows=107156width=0) (actual time=173.575..173.575 rows=280990 loops=1) -> Hash (cost=6321.79..6321.79 rows=27694 width=8) (actual time=426.739..426.739rows=35000 loops=1) -> Index Scan using idx__email__time on email (cost=0.00..6321.79 rows=27694width=8) (actual time=50.000..375.021 rows=35000 loops=1) Index Cond: (("time" >= '2007-05-05 17:01:59'::timestamp without timezone) AND ("time" < '2007-05-05 18:01:59'::timestamp without time zone)) Total runtime: 8160.442 ms The estimates all look pretty good and reasonable. A faster plan, however, is this: QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=1920309.81..1920534.21 rows=11220 width=184) (actual time=5349.493..5587.536 rows=35000 loops=1) -> Sort (cost=1920309.81..1920337.86 rows=11220 width=184) (actual time=5349.427..5392.110 rows=35000 loops=1) Sort Key: eh_subj.header_body -> Nested Loop (cost=15576.58..1919555.05 rows=11220 width=184) (actual time=537.938..5094.377 rows=35000 loops=1) -> Nested Loop (cost=15576.58..475387.23 rows=11020 width=120) (actual time=537.858..4404.330 rows=35000loops=1) -> Nested Loop (cost=15576.58..430265.44 rows=11092 width=112) (actual time=537.768..4024.184 rows=35000loops=1) -> Bitmap Heap Scan on email_header eh_from (cost=15576.58..16041.55 rows=107156 width=104)(actual time=537.621..1801.032 rows=280990 loops=1) Recheck Cond: ((mime_part_id = 0) AND (header_name = 'from'::text)) -> BitmapAnd (cost=15576.58..15576.58 rows=160 width=0) (actual time=500.006..500.006rows=0 loops=1) -> Bitmap Index Scan on dummy_index (cost=0.00..3724.22 rows=107156 width=0) (actualtime=85.025..85.025 rows=280990 loops=1) -> Bitmap Index Scan on idx__email_header__from_local (cost=0.00..5779.24 rows=107156width=0) (actual time=173.006..173.006 rows=280990 loops=1) -> Bitmap Index Scan on dummy2_index (cost=0.00..5992.25 rows=107156 width=0) (actualtime=174.463..174.463 rows=280990 loops=1) -> Index Scan using email_pkey on email (cost=0.00..3.85 rows=1 width=8) (actual time=0.005..0.005rows=0 loops=280990) Index Cond: (email.email_id = eh_from.email_id) Filter: (("time" >= '2007-05-05 17:01:59'::timestamp without time zone) AND ("time" < '2007-05-0518:01:59'::timestamp without time zone)) -> Index Scan using mime_part_pkey on mime_part (cost=0.00..4.06 rows=1 width=12) (actual time=0.005..0.006rows=1 loops=35000) Index Cond: ((email.email_id = mime_part.email_id) AND (mime_part.mime_part_id = 0)) -> Index Scan using idx__email_header__email_id__mime_part_id on email_header eh_subj (cost=0.00..130.89rows=13 width=104) (actual time=0.009..0.015 rows=1 loops=35000) Index Cond: ((email.email_id = eh_subj.email_id) AND (0 = eh_subj.mime_part_id)) Filter: (header_name = 'subject'::text) Total runtime: 5625.024 ms Note how spectacularly overpriced this plan is. The costs for the nested loops are calculated approximately as number of outer tuples times cost of the inner scan. So slight overestimations of the inner scans such as Index Scan using email_pkey on email (cost=0.00..3.85 rows=1 width=8) (actual time=0.005..0.005 rows=0 loops=280990) kill this calculation. Most likely, all of these database is cached, so I tried reducing seq_page_cost and random_page_cost, but I needed to turn them all the way down to 0.02 or 0.03, which is almost like cpu_tuple_cost. Is that reasonable? Or what is wrong here? PostgreSQL 8.2.1 on x86_64-unknown-linux-gnu work_mem = 256MB effective_cache_size = 384MB The machine has 1GB of RAM. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Peter Eisentraut <peter_e@gmx.net> writes: > Note how spectacularly overpriced this plan is. Hmm, I'd have expected it to discount the repeated indexscans a lot more than it seems to be doing for you. As an example in the regression database, note what happens to the inner indexscan cost estimate when the number of outer tuples grows: regression=# set enable_hashjoin TO 0; SET regression=# set enable_mergejoin TO 0; SET regression=# set enable_bitmapscan TO 0; SET regression=# explain select * from tenk1 a join tenk1 b using (thousand) where a.unique1 = 1; QUERY PLAN ---------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..52.82 rows=10 width=484) -> Index Scan using tenk1_unique1 on tenk1 a (cost=0.00..8.27 rows=1 width=244) Index Cond: (unique1 = 1) -> Index Scan using tenk1_thous_tenthous on tenk1 b (cost=0.00..44.42 rows=10 width=244) Index Cond: (a.thousand = b.thousand) (5 rows) regression=# explain select * from tenk1 a join tenk1 b using (thousand) where a.ten = 1; QUERY PLAN --------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..2531.08 rows=9171 width=484) -> Seq Scan on tenk1 a (cost=0.00..483.00 rows=900 width=244) Filter: (ten = 1) -> Index Scan using tenk1_thous_tenthous on tenk1 b (cost=0.00..2.15 rows=10 width=244) Index Cond: (a.thousand = b.thousand) (5 rows) This is with 8.2.4 but AFAICS from the CVS logs, 8.2's cost estimation code didn't change since 8.2.1. What do you get for a comparably simple case? regards, tom lane
Am Dienstag, 8. Mai 2007 17:53 schrieb Tom Lane: > Hmm, I'd have expected it to discount the repeated indexscans a lot more > than it seems to be doing for you. As an example in the regression > database, note what happens to the inner indexscan cost estimate when > the number of outer tuples grows: I can reproduce your results in the regression test database. 8.2.1 and 8.2.4 behave the same. I checked the code around cost_index(), and the assumptions appear to be correct (at least this query doesn't produce wildly unusual data). Apparently, however, the caching effects are much more significant than the model takes into account. -- Peter Eisentraut http://developer.postgresql.org/~petere/
I'm having something weird too... Look: Nested Loop Left Join (cost=93.38..7276.26 rows=93 width=58) (actual time=99.211..4804.525 rows=2108 loops=1) -> Hash Join (cost=93.38..3748.18 rows=93 width=4) (actual time=0.686..20.632 rows=45 loops=1) Hash Cond: ((u.i)::text = (m.i)::text) -> Seq Scan on u (cost=0.00..2838.80 rows=10289 width=4) (actual time=0.010..7.813 rows=10291 loops=1) -> Hash (cost=87.30..87.30 rows=30 width=7) (actual time=0.445..0.445 rows=45 loops=1) -> Index Scan using m_pkey on m (cost=0.00..87.30 rows=30 width=7) (actual time=0.046..0.371 rows=45 loops=1) Index Cond: (t = 1613) Filter: ((a)::text = 'Y'::text) -> Index Scan using s_pkey on s (cost=0.00..37.33 rows=3 width=58) (actual time=19.864..106.198 rows=47 loops=45) Index Cond: ((u.i = s.u) AND ((s.p)::text = '4'::text) AND (s.t = 1613) AND ((s.c)::text = 'cmi.core.total_time'::text)) Total runtime: 4805.975 ms And disabling all the joins Tom said: Nested Loop Left Join (cost=0.00..16117.12 rows=93 width=58) (actual time=2.706..168.556 rows=2799 loops=1) -> Nested Loop (cost=0.00..13187.94 rows=93 width=4) (actual time=2.622..125.739 rows=50 loops=1) -> Seq Scan on u (cost=0.00..2838.80 rows=10289 width=4) (actual time=0.012..9.863 rows=10291 loops=1) -> Index Scan using m_pkey on m (cost=0.00..0.80 rows=1 width=7) (actual time=0.009..0.009 rows=0 loops=10291) Index Cond: ((m.t = 1615) AND ((u.i)::text = (m.i)::text)) Filter: ((a)::text = 'Y'::text) -> Index Scan using s_pkey on s (cost=0.00..31.09 rows=2 width=58) (actual time=0.047..0.778 rows=56 loops=50) Index Cond: ((u.i = s.u) AND ((s.p)::text = '4'::text) AND (s.t = 1615) AND ((s.c)::text = 'cmi.core.total_time'::text)) Total runtime: 169.836 ms I had PostgreSQL 8.2.3 on x86_64-redhat-linux-gnu, shared_buffers with 1640MB, effective_cache_size with 5400MB and 8GB of RAM, where all shared_buffers blocks are used (pg_buffercache, relfilenode IS NOT NULL). Note that even when I set default_statistics_target to 500, and calling "ANALYZE s;", I cannot see the number of estimated rows on the index scan on s close to the actual rows. Could it be related? 2007/5/9, Peter Eisentraut <peter_e@gmx.net>: > Am Dienstag, 8. Mai 2007 17:53 schrieb Tom Lane: > > Hmm, I'd have expected it to discount the repeated indexscans a lot more > > than it seems to be doing for you. As an example in the regression > > database, note what happens to the inner indexscan cost estimate when > > the number of outer tuples grows: > > I can reproduce your results in the regression test database. 8.2.1 and 8.2.4 > behave the same. > > I checked the code around cost_index(), and the assumptions appear to be > correct (at least this query doesn't produce wildly unusual data). > Apparently, however, the caching effects are much more significant than the > model takes into account. > > -- > Peter Eisentraut > http://developer.postgresql.org/~petere/ > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate > -- Daniel Cristian Cruz
Peter Eisentraut <peter_e@gmx.net> writes: > Am Dienstag, 8. Mai 2007 17:53 schrieb Tom Lane: >> Hmm, I'd have expected it to discount the repeated indexscans a lot more >> than it seems to be doing for you. As an example in the regression >> database, note what happens to the inner indexscan cost estimate when >> the number of outer tuples grows: > I can reproduce your results in the regression test database. 8.2.1 and 8.2.4 > behave the same. Well, there's something funny going on here. You've got for instance -> Index Scan using email_pkey on email (cost=0.00..3.85 rows=1 width=8) (actual time=0.005..0.005 rows=0 loops=280990) Index Cond: (email.email_id = eh_from.email_id) Filter: (("time" >= '2007-05-05 17:01:59'::timestamp without time zone) AND ("time" < '2007-05-05 18:01:59'::timestampwithout time zone)) on the inside of a nestloop whose outer side is predicted to return 107156 rows. That should've been discounted to *way* less than 3.85 cost units per iteration. Are you using any nondefault planner settings? How big are these tables, anyway? regards, tom lane
"Daniel Cristian Cruz" <danielcristian@gmail.com> writes: > -> Nested Loop (cost=0.00..13187.94 rows=93 width=4) (actual time=2.622..125.739 rows=50 loops=1) > -> Seq Scan on u (cost=0.00..2838.80 rows=10289 width=4) (actual time=0.012..9.863 rows=10291 loops=1) > -> Index Scan using m_pkey on m (cost=0.00..0.80 rows=1 width=7) (actual time=0.009..0.009 rows=0 loops=10291) That's not discounting the nested loop for cache effect at all! What is your effective_cache_size for this? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
2007/5/9, Gregory Stark <stark@enterprisedb.com>: > > "Daniel Cristian Cruz" <danielcristian@gmail.com> writes: > > > -> Nested Loop (cost=0.00..13187.94 rows=93 width=4) (actual time=2.622..125.739 rows=50 loops=1) > > -> Seq Scan on u (cost=0.00..2838.80 rows=10289 width=4) (actual time=0.012..9.863 rows=10291 loops=1) > > -> Index Scan using m_pkey on m (cost=0.00..0.80 rows=1 width=7) (actual time=0.009..0.009 rows=0 loops=10291) > > That's not discounting the nested loop for cache effect at all! > > What is your effective_cache_size for this? effective_cache_size is 5400MB. I forgot to mention a modifications on cost: cpu_tuple_cost = 0.2 Which forced a usage of indexes. I set it to 0.01 and the plan has a index scan on m before the hash on u, being 15% slower: Hash Cond: ((u.i)::text = (m.i)::text) -> Seq Scan on u (cost=0.00..2838.80 rows=10289 width=4) (actual time=0.007..6.138 rows=10292 loops=1) -> Hash (cost=87.30..87.30 rows=30 width=7) (actual time=0.185..0.185 rows=50 loops=1) -> Index Scan using m_pkey on m (cost=0.00..87.30 rows=30 width=7) (actual time=0.021..0.144 rows=50 loops=1) Index Cond: (t = 1615) Filter: ((a)::text = 'Y'::text) I'm still confused since I didn't understood what "That's not discounting the nested loop for cache effect at all!" could mean... Thanks for the help. -- Daniel Cristian Cruz
Am Mittwoch, 9. Mai 2007 16:11 schrieb Tom Lane: > Well, there's something funny going on here. You've got for instance > > -> Index Scan using email_pkey on email (cost=0.00..3.85 > rows=1 width=8) (actual time=0.005..0.005 rows=0 loops=280990) Index Cond: > (email.email_id = eh_from.email_id) > Filter: (("time" >= '2007-05-05 17:01:59'::timestamp > without time zone) AND ("time" < '2007-05-05 18:01:59'::timestamp without > time zone)) > > on the inside of a nestloop whose outer side is predicted to return > 107156 rows. That should've been discounted to *way* less than 3.85 > cost units per iteration. This is the new plan with 8.2.4. It's still got the same problem, though. QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=5627064.21..5627718.73 rows=32726 width=184) (actual time=4904.834..5124.585 rows=35000 loops=1) -> Sort (cost=5627064.21..5627146.03 rows=32726 width=184) (actual time=4904.771..4947.892 rows=35000 loops=1) Sort Key: eh_subj.header_body -> Nested Loop (cost=0.00..5624610.06 rows=32726 width=184) (actual time=0.397..4628.141 rows=35000 loops=1) -> Nested Loop (cost=0.00..1193387.12 rows=28461 width=120) (actual time=0.322..3960.360 rows=35000 loops=1) -> Nested Loop (cost=0.00..1081957.26 rows=28648 width=112) (actual time=0.238..3572.023 rows=35000loops=1) -> Index Scan using dummy_index on email_header eh_from (cost=0.00..13389.15 rows=280662 width=104)(actual time=0.133..1310.248 rows=280990 loops=1) -> Index Scan using email_pkey on email (cost=0.00..3.79 rows=1 width=8) (actual time=0.005..0.005rows=0 loops=280990) Index Cond: (email.email_id = eh_from.email_id) Filter: (("time" >= '2007-05-05 17:01:59'::timestamp without time zone) AND ("time" < '2007-05-0518:01:59'::timestamp without time zone)) -> Index Scan using mime_part_pkey on mime_part (cost=0.00..3.88 rows=1 width=12) (actual time=0.005..0.006rows=1 loops=35000) Index Cond: ((email.email_id = mime_part.email_id) AND (mime_part.mime_part_id = 0)) -> Index Scan using idx__email_header__email_id__mime_part_id on email_header eh_subj (cost=0.00..155.47rows=18 width=104) (actual time=0.009..0.014 rows=1 loops=35000) Index Cond: ((email.email_id = eh_subj.email_id) AND (0 = eh_subj.mime_part_id)) Filter: (header_name = 'subject'::text) Total runtime: 5161.390 ms > Are you using any nondefault planner settings? random_page_cost = 3 effective_cache_size = 384MB > How big are these tables, anyway? email 35 MB email_header 421 MB mime_part 37 MB Everything is analyzed, vacuumed, and reindexed. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Peter Eisentraut <peter_e@gmx.net> writes: >> Are you using any nondefault planner settings? > random_page_cost = 3 > effective_cache_size = 384MB >> How big are these tables, anyway? > email 35 MB > email_header 421 MB > mime_part 37 MB Hmmm ... I see at least part of the problem, which is that email_header is joined twice in this query, which means that it's counted twice in figuring the total volume of pages competing for cache space. So the thing thinks cache space is oversubscribed nearly 3X when in reality the database is fully cached. I remember having dithered about whether to try to avoid counting the same physical relation more than once in total_table_pages, but this example certainly suggests that we shouldn't. Meanwhile, do the estimates get better if you set effective_cache_size to 1GB or so? To return to your original comment: if you're trying to model a situation with a fully cached database, I think it's sensible to set random_page_cost = seq_page_cost = 0.1 or so. You had mentioned having to decrease them to 0.02, which seems unreasonably small to me too, but maybe with the larger effective_cache_size you won't have to go that far. regards, tom lane
Am Mittwoch, 9. Mai 2007 19:40 schrieb Tom Lane: > I remember having dithered about whether > to try to avoid counting the same physical relation more than once in > total_table_pages, but this example certainly suggests that we > shouldn't. Meanwhile, do the estimates get better if you set > effective_cache_size to 1GB or so? Yes, that makes the plan significantly cheaper (something like 500,000 instead of 5,000,000), but still a lot more expensive than the hash join (about 100,000). > To return to your original comment: if you're trying to model a > situation with a fully cached database, I think it's sensible > to set random_page_cost = seq_page_cost = 0.1 or so. You had > mentioned having to decrease them to 0.02, which seems unreasonably > small to me too, but maybe with the larger effective_cache_size > you won't have to go that far. Heh, when I decrease these parameters, the hash join gets cheaper as well. I can't actually get it to pick the nested-loop join. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Am Mittwoch, 9. Mai 2007 19:40 schrieb Tom Lane: > Hmmm ... I see at least part of the problem, which is that email_header > is joined twice in this query, which means that it's counted twice in > figuring the total volume of pages competing for cache space. So the > thing thinks cache space is oversubscribed nearly 3X when in reality > the database is fully cached. I should add that other, similar queries in this database that do not involve joining the same table twice produce seemingly optimal plans. (It picks hash joins which are actually faster than nested loops.) -- Peter Eisentraut http://developer.postgresql.org/~petere/
Peter Eisentraut <peter_e@gmx.net> writes: > Am Mittwoch, 9. Mai 2007 19:40 schrieb Tom Lane: >> Hmmm ... I see at least part of the problem, which is that email_header >> is joined twice in this query, which means that it's counted twice in >> figuring the total volume of pages competing for cache space. So the >> thing thinks cache space is oversubscribed nearly 3X when in reality >> the database is fully cached. > I should add that other, similar queries in this database that do not > involve joining the same table twice produce seemingly optimal plans. > (It picks hash joins which are actually faster than nested loops.) It strikes me that in a situation like this, where the same table is being scanned twice by concurrent indexscans, we ought to amortize the fetches across *both* scans rather than treating them independently; so there are actually two different ways in which we're being too pessimistic about the indexscanning cost. Difficult to see how to fix that in the current planner design however; since it's a bottom-up process, we have to cost the individual scans without any knowledge of what approach will be chosen for other scans. regards, tom lane