Thread: Apparently useless bitmap scans
There's another odd thing about this plan from yesterday. Query: SELECT eh_subj.header_body AS subject, count(distinct eh_from.header_body) FROM email JOIN mime_part USING (email_id) JOIN email_header eh_subj USING (email_id, mime_part_id) JOIN email_header eh_from USING (email_id, mime_part_id) WHERE eh_subj.header_name = 'subject' AND eh_from.header_name = 'from' AND mime_part_id = 0 AND (time >= timestamp '2007-05-05 17:01:59' AND time < timestamp '2007-05-05 17:01:59' + interval '60 min') GROUP BY eh_subj.header_body; Plan: QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=1920309.81..1920534.21 rows=11220 width=184) (actual time=5349.493..5587.536 rows=35000 loops=1) -> Sort (cost=1920309.81..1920337.86 rows=11220 width=184) (actual time=5349.427..5392.110 rows=35000 loops=1) Sort Key: eh_subj.header_body -> Nested Loop (cost=15576.58..1919555.05 rows=11220 width=184) (actual time=537.938..5094.377 rows=35000 loops=1) -> Nested Loop (cost=15576.58..475387.23 rows=11020 width=120) (actual time=537.858..4404.330 rows=35000loops=1) -> Nested Loop (cost=15576.58..430265.44 rows=11092 width=112) (actual time=537.768..4024.184 rows=35000loops=1) -> Bitmap Heap Scan on email_header eh_from (cost=15576.58..16041.55 rows=107156 width=104)(actual time=537.621..1801.032 rows=280990 loops=1) Recheck Cond: ((mime_part_id = 0) AND (header_name = 'from'::text)) -> BitmapAnd (cost=15576.58..15576.58 rows=160 width=0) (actual time=500.006..500.006rows=0 loops=1) -> Bitmap Index Scan on dummy_index (cost=0.00..3724.22 rows=107156 width=0) (actualtime=85.025..85.025 rows=280990 loops=1) -> Bitmap Index Scan on idx__email_header__from_local (cost=0.00..5779.24 rows=107156width=0) (actual time=173.006..173.006 rows=280990 loops=1) -> Bitmap Index Scan on dummy2_index (cost=0.00..5992.25 rows=107156 width=0) (actualtime=174.463..174.463 rows=280990 loops=1) -> Index Scan using email_pkey on email (cost=0.00..3.85 rows=1 width=8) (actual time=0.005..0.005rows=0 loops=280990) Index Cond: (email.email_id = eh_from.email_id) Filter: (("time" >= '2007-05-05 17:01:59'::timestamp without time zone) AND ("time" < '2007-05-0518:01:59'::timestamp without time zone)) -> Index Scan using mime_part_pkey on mime_part (cost=0.00..4.06 rows=1 width=12) (actual time=0.005..0.006rows=1 loops=35000) Index Cond: ((email.email_id = mime_part.email_id) AND (mime_part.mime_part_id = 0)) -> Index Scan using idx__email_header__email_id__mime_part_id on email_header eh_subj (cost=0.00..130.89rows=13 width=104) (actual time=0.009..0.015 rows=1 loops=35000) Index Cond: ((email.email_id = eh_subj.email_id) AND (0 = eh_subj.mime_part_id)) Filter: (header_name = 'subject'::text) Total runtime: 5625.024 ms I'm wondering what it wants to achieve with these three index scans: -> Bitmap Index Scan on dummy_index (cost=0.00..3724.22 rows=107156 width=0) (actual time=85.025..85.025 rows=280990loops=1) -> Bitmap Index Scan on idx__email_header__from_local (cost=0.00..5779.24 rows=107156 width=0) (actual time=173.006..173.006rows=280990 loops=1) -> Bitmap Index Scan on dummy2_index (cost=0.00..5992.25 rows=107156 width=0) (actual time=174.463..174.463 rows=280990loops=1) The indexes in question are: CREATE INDEX dummy_index ON email_header ((555)) WHERE mime_part_id = 0 AND header_name = 'from'; CREATE INDEX dummy2_index ON email_header (substr(header_body,5)) WHERE mime_part_id = 0 AND header_name = 'from'; CREATE INDEX idx__email_header__from_local ON email_header (get_localpart(header_body)) WHERE mime_part_id = 0 AND header_name= 'from'; It appears to want to use these indexes to get the restriction AND eh_from.header_name = 'from' AND mime_part_id = 0 from the query, but why does it need three of them to do it, when all of them have the same predicate and none of them has an indexed expression that appears in the query? There are more partial indexes with the same predicate, but it appears to always use three. (The two "dummy" indexes are just leftovers from these experiments.) -- Peter Eisentraut http://developer.postgresql.org/~petere/
Peter Eisentraut wrote: > There's another odd thing about this plan from yesterday. Is this still 8.2.1? The logic to choose bitmap indexes was rewritten just before 8.2.4, 2007-04-17 16:03 tgl * src/backend/optimizer/path/indxpath.c: Rewrite choose_bitmap_and() to make it more robust in the presence of competing alternatives for indexes to use in a bitmap scan. The former coding took estimated selectivity as an overriding factor, causing it to sometimes choose indexes that were much slower to scan than ones with a slightly worse selectivity. It was also too narrow-minded about which combinations of indexes to consider ANDing. The rewrite makes it pay more attention to index scan cost than selectivity; this seems sane since it's impossible to have very bad selectivity with low cost, whereas the reverse isn't true. Also, we now consider each index alone, as well as adding each index to an AND-group led by each prior index, for a total of about O(N^2) rather than O(N) combinations considered. This makes the results much less dependent on the exact order in which the indexes are considered. It's still a lot cheaper than an O(2^N) exhaustive search. A prefilter step eliminates all but the cheapest of those indexes using the same set of WHERE conditions, to keep the effective value of N down in scenarios where the DBA has created lots of partially-redundant indexes. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Peter Eisentraut <peter_e@gmx.net> writes: > I'm wondering what it wants to achieve with these three index scans: See if you still get that with 8.2.4. choose_bitmap_and was fairly far out in left field before that :-( ... particularly for cases with partially redundant indexes available. regards, tom lane
Am Mittwoch, 9. Mai 2007 16:29 schrieb Alvaro Herrera: > Peter Eisentraut wrote: > > There's another odd thing about this plan from yesterday. > > Is this still 8.2.1? The logic to choose bitmap indexes was rewritten > just before 8.2.4, OK, upgrading to 8.2.4 fixes this odd plan choice. The query does run a bit faster too, but the cost estimate has actually gone up! 8.2.1: QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=87142.18..87366.58 rows=11220 width=184) (actual time=7883.541..8120.647 rows=35000 loops=1) -> Sort (cost=87142.18..87170.23 rows=11220 width=184) (actual time=7883.471..7926.031 rows=35000 loops=1) Sort Key: eh_subj.header_body -> Hash Join (cost=46283.30..86387.42 rows=11220 width=184) (actual time=5140.182..7635.615 rows=35000 loops=1) Hash Cond: (eh_subj.email_id = email.email_id) -> Bitmap Heap Scan on email_header eh_subj (cost=11853.68..50142.87 rows=272434 width=104) (actual time=367.956..1719.736rows=280989 loops=1) Recheck Cond: ((mime_part_id = 0) AND (header_name = 'subject'::text)) -> BitmapAnd (cost=11853.68..11853.68 rows=27607 width=0) (actual time=326.507..326.507 rows=0 loops=1) -> Bitmap Index Scan on idx__email_header__header_body_subject (cost=0.00..5836.24 rows=272434width=0) (actual time=178.041..178.041 rows=280989 loops=1) -> Bitmap Index Scan on idx__email_header__header_name (cost=0.00..5880.97 rows=281247 width=0)(actual time=114.574..114.574 rows=280989 loops=1) Index Cond: (header_name = 'subject'::text) -> Hash (cost=34291.87..34291.87 rows=11020 width=120) (actual time=4772.148..4772.148 rows=35000 loops=1) -> Hash Join (cost=24164.59..34291.87 rows=11020 width=120) (actual time=3131.067..4706.997 rows=35000loops=1) Hash Cond: (mime_part.email_id = email.email_id) -> Seq Scan on mime_part (cost=0.00..8355.81 rows=265804 width=12) (actual time=0.038..514.291rows=267890 loops=1) Filter: (mime_part_id = 0) -> Hash (cost=24025.94..24025.94 rows=11092 width=112) (actual time=3130.982..3130.982 rows=35000loops=1) -> Hash Join (cost=22244.54..24025.94 rows=11092 width=112) (actual time=996.556..3069.280rows=35000 loops=1) Hash Cond: (eh_from.email_id = email.email_id) -> Bitmap Heap Scan on email_header eh_from (cost=15576.58..16041.55 rows=107156width=104) (actual time=569.762..1932.017 rows=280990 loops=1) Recheck Cond: ((mime_part_id = 0) AND (header_name = 'from'::text)) -> BitmapAnd (cost=15576.58..15576.58 rows=160 width=0) (actual time=532.217..532.217rows=0 loops=1) -> Bitmap Index Scan on dummy_index (cost=0.00..3724.22 rows=107156width=0) (actual time=116.386..116.386 rows=280990 loops=1) -> Bitmap Index Scan on idx__email_header__from_local (cost=0.00..5779.24rows=107156 width=0) (actual time=174.883..174.883 rows=280990 loops=1) -> Bitmap Index Scan on dummy2_index (cost=0.00..5992.25 rows=107156width=0) (actual time=173.575..173.575 rows=280990 loops=1) -> Hash (cost=6321.79..6321.79 rows=27694 width=8) (actual time=426.739..426.739rows=35000 loops=1) -> Index Scan using idx__email__time on email (cost=0.00..6321.79 rows=27694width=8) (actual time=50.000..375.021 rows=35000 loops=1) Index Cond: (("time" >= '2007-05-05 17:01:59'::timestamp without timezone) AND ("time" < '2007-05-05 18:01:59'::timestamp without time zone)) Total runtime: 8160.442 ms 8.2.4: QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=100086.52..100658.46 rows=28597 width=182) (actual time=6063.766..6281.818 rows=35000 loops=1) -> Sort (cost=100086.52..100158.01 rows=28597 width=182) (actual time=6063.697..6105.215 rows=35000 loops=1) Sort Key: eh_subj.header_body -> Hash Join (cost=36729.27..97969.83 rows=28597 width=182) (actual time=3690.316..5790.094 rows=35000 loops=1) Hash Cond: (eh_subj.email_id = email.email_id) -> Bitmap Heap Scan on email_header eh_subj (cost=5903.20..63844.68 rows=267832 width=103) (actual time=214.699..1564.804rows=280989 loops=1) Recheck Cond: ((mime_part_id = 0) AND (header_name = 'subject'::text)) -> Bitmap Index Scan on idx__email_header__header_body_subject (cost=0.00..5836.24 rows=267832 width=0)(actual time=172.188..172.188 rows=280989 loops=1) -> Hash (cost=30468.98..30468.98 rows=28567 width=119) (actual time=3475.484..3475.484 rows=35000 loops=1) -> Hash Join (cost=13773.73..30468.98 rows=28567 width=119) (actual time=1260.579..3409.443 rows=35000loops=1) Hash Cond: (eh_from.email_id = email.email_id) -> Index Scan using dummy_index on email_header eh_from (cost=0.00..13286.00 rows=277652 width=103)(actual time=0.076..1391.974 rows=280990 loops=1) -> Hash (cost=13429.63..13429.63 rows=27528 width=20) (actual time=1260.422..1260.422 rows=35000loops=1) -> Hash Join (cost=1799.41..13429.63 rows=27528 width=20) (actual time=114.765..1206.500rows=35000 loops=1) Hash Cond: (mime_part.email_id = email.email_id) -> Seq Scan on mime_part (cost=0.00..8355.81 rows=266589 width=12) (actual time=0.036..407.539rows=267890 loops=1) Filter: (mime_part_id = 0) -> Hash (cost=1454.07..1454.07 rows=27627 width=8) (actual time=114.644..114.644rows=35000 loops=1) -> Index Scan using idx__email__time on email (cost=0.00..1454.07 rows=27627width=8) (actual time=0.144..63.017 rows=35000 loops=1) Index Cond: (("time" >= '2007-05-05 17:01:59'::timestamp without timezone) AND ("time" < '2007-05-05 18:01:59'::timestamp without time zone)) Total runtime: 6320.790 ms (21 Zeilen) The only significant change is that the first Bitmap Heap Scan (line 6) became more expensive. You will notice that in the old plan, you had a pretty good correspondence of 10 cost units to 1 millisecond throughout, whereas in the new plan that does not apply to said Bitmap Heap Scan. I'm not sure whether that is cause for concern. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Peter Eisentraut <peter_e@gmx.net> writes: > OK, upgrading to 8.2.4 fixes this odd plan choice. The query does run > a bit faster too, but the cost estimate has actually gone up! Yeah, because the former code was making an unrealistically small estimate of the number of tuples found by BitmapAnd (due to double-counting the selectivities of redundant indexes), and of course that means a smaller estimate of the cost to fetch them in the bitmap heap scan. regards, tom lane