Thread: Apparently useless bitmap scans

Apparently useless bitmap scans

From

Peter Eisentraut

Date:

09 May 2007, 09:11:07

There's another odd thing about this plan from yesterday.

Query:

        SELECT
                eh_subj.header_body AS subject,
                count(distinct eh_from.header_body)
        FROM
                email JOIN mime_part USING (email_id)
                JOIN email_header eh_subj USING (email_id, mime_part_id)
                JOIN email_header eh_from USING (email_id, mime_part_id)
        WHERE
                eh_subj.header_name = 'subject'
                AND eh_from.header_name = 'from'
                AND mime_part_id = 0
                AND (time >= timestamp '2007-05-05 17:01:59' AND time < timestamp '2007-05-05 17:01:59' + interval '60
min')
        GROUP BY
                eh_subj.header_body;

Plan:

                                                                                        QUERY PLAN
                                                                   

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=1920309.81..1920534.21 rows=11220 width=184) (actual time=5349.493..5587.536 rows=35000 loops=1)
   ->  Sort  (cost=1920309.81..1920337.86 rows=11220 width=184) (actual time=5349.427..5392.110 rows=35000 loops=1)
         Sort Key: eh_subj.header_body
         ->  Nested Loop  (cost=15576.58..1919555.05 rows=11220 width=184) (actual time=537.938..5094.377 rows=35000
loops=1)
               ->  Nested Loop  (cost=15576.58..475387.23 rows=11020 width=120) (actual time=537.858..4404.330
rows=35000loops=1) 
                     ->  Nested Loop  (cost=15576.58..430265.44 rows=11092 width=112) (actual time=537.768..4024.184
rows=35000loops=1) 
                           ->  Bitmap Heap Scan on email_header eh_from  (cost=15576.58..16041.55 rows=107156
width=104)(actual time=537.621..1801.032 rows=280990 loops=1) 
                                 Recheck Cond: ((mime_part_id = 0) AND (header_name = 'from'::text))
                                 ->  BitmapAnd  (cost=15576.58..15576.58 rows=160 width=0) (actual
time=500.006..500.006rows=0 loops=1) 
                                       ->  Bitmap Index Scan on dummy_index  (cost=0.00..3724.22 rows=107156 width=0)
(actualtime=85.025..85.025 rows=280990 loops=1) 
                                       ->  Bitmap Index Scan on idx__email_header__from_local  (cost=0.00..5779.24
rows=107156width=0) (actual time=173.006..173.006 rows=280990 loops=1) 
                                       ->  Bitmap Index Scan on dummy2_index  (cost=0.00..5992.25 rows=107156 width=0)
(actualtime=174.463..174.463 rows=280990 loops=1) 
                           ->  Index Scan using email_pkey on email  (cost=0.00..3.85 rows=1 width=8) (actual
time=0.005..0.005rows=0 loops=280990) 
                                 Index Cond: (email.email_id = eh_from.email_id)
                                 Filter: (("time" >= '2007-05-05 17:01:59'::timestamp without time zone) AND ("time" <
'2007-05-0518:01:59'::timestamp without time zone)) 
                     ->  Index Scan using mime_part_pkey on mime_part  (cost=0.00..4.06 rows=1 width=12) (actual
time=0.005..0.006rows=1 loops=35000) 
                           Index Cond: ((email.email_id = mime_part.email_id) AND (mime_part.mime_part_id = 0))
               ->  Index Scan using idx__email_header__email_id__mime_part_id on email_header eh_subj
(cost=0.00..130.89rows=13 width=104) (actual time=0.009..0.015 rows=1 loops=35000) 
                     Index Cond: ((email.email_id = eh_subj.email_id) AND (0 = eh_subj.mime_part_id))
                     Filter: (header_name = 'subject'::text)
 Total runtime: 5625.024 ms


I'm wondering what it wants to achieve with these three index scans:

    ->  Bitmap Index Scan on dummy_index  (cost=0.00..3724.22 rows=107156 width=0) (actual time=85.025..85.025
rows=280990loops=1) 
    ->  Bitmap Index Scan on idx__email_header__from_local  (cost=0.00..5779.24 rows=107156 width=0) (actual
time=173.006..173.006rows=280990 loops=1) 
    ->  Bitmap Index Scan on dummy2_index  (cost=0.00..5992.25 rows=107156 width=0) (actual time=174.463..174.463
rows=280990loops=1) 

The indexes in question are:

CREATE INDEX dummy_index ON email_header ((555)) WHERE mime_part_id = 0 AND header_name = 'from';
CREATE INDEX dummy2_index ON email_header (substr(header_body,5)) WHERE mime_part_id = 0 AND header_name = 'from';
CREATE INDEX idx__email_header__from_local ON email_header (get_localpart(header_body)) WHERE mime_part_id = 0 AND
header_name= 'from'; 

It appears to want to use these indexes to get the restriction

                AND eh_from.header_name = 'from'
                AND mime_part_id = 0

from the query, but why does it need three of them to do it, when all
of them have the same predicate and none of them has an indexed
expression that appears in the query?

There are more partial indexes with the same predicate, but it appears
to always use three.  (The two "dummy" indexes are just leftovers from
these experiments.)

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Apparently useless bitmap scans

From

Alvaro Herrera

Date:

09 May 2007, 11:29:29

Peter Eisentraut wrote:
> There's another odd thing about this plan from yesterday.

Is this still 8.2.1?  The logic to choose bitmap indexes was rewritten
just before 8.2.4,

2007-04-17 16:03  tgl

    * src/backend/optimizer/path/indxpath.c:

Rewrite choose_bitmap_and() to make it more robust in the presence of
competing alternatives for indexes to use in a bitmap scan.  The former
coding took estimated selectivity as an overriding factor, causing it to
sometimes choose indexes that were much slower to scan than ones with a
slightly worse selectivity.  It was also too narrow-minded about which
combinations of indexes to consider ANDing.  The rewrite makes it pay more
attention to index scan cost than selectivity; this seems sane since it's
impossible to have very bad selectivity with low cost, whereas the reverse
isn't true.  Also, we now consider each index alone, as well as adding
each index to an AND-group led by each prior index, for a total of about
O(N^2) rather than O(N) combinations considered.  This makes the results
much less dependent on the exact order in which the indexes are
considered.  It's still a lot cheaper than an O(2^N) exhaustive search.
A prefilter step eliminates all but the cheapest of those indexes using
the same set of WHERE conditions, to keep the effective value of N down in
scenarios where the DBA has created lots of partially-redundant indexes.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Apparently useless bitmap scans

From

Tom Lane

Date:

09 May 2007, 11:49:15

Peter Eisentraut <peter_e@gmx.net> writes:
> I'm wondering what it wants to achieve with these three index scans:

See if you still get that with 8.2.4.  choose_bitmap_and was fairly far
out in left field before that :-( ... particularly for cases with
partially redundant indexes available.

            regards, tom lane

Re: Apparently useless bitmap scans

From

Peter Eisentraut

Date:

09 May 2007, 12:26:13

Am Mittwoch, 9. Mai 2007 16:29 schrieb Alvaro Herrera:
> Peter Eisentraut wrote:
> > There's another odd thing about this plan from yesterday.
>
> Is this still 8.2.1?  The logic to choose bitmap indexes was rewritten
> just before 8.2.4,

OK, upgrading to 8.2.4 fixes this odd plan choice.  The query does run
a bit faster too, but the cost estimate has actually gone up!

8.2.1:

                                                                                              QUERY PLAN
                                                                               

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=87142.18..87366.58 rows=11220 width=184) (actual time=7883.541..8120.647 rows=35000 loops=1)
   ->  Sort  (cost=87142.18..87170.23 rows=11220 width=184) (actual time=7883.471..7926.031 rows=35000 loops=1)
         Sort Key: eh_subj.header_body
         ->  Hash Join  (cost=46283.30..86387.42 rows=11220 width=184) (actual time=5140.182..7635.615 rows=35000
loops=1)
               Hash Cond: (eh_subj.email_id = email.email_id)
               ->  Bitmap Heap Scan on email_header eh_subj  (cost=11853.68..50142.87 rows=272434 width=104) (actual
time=367.956..1719.736rows=280989 loops=1) 
                     Recheck Cond: ((mime_part_id = 0) AND (header_name = 'subject'::text))
                     ->  BitmapAnd  (cost=11853.68..11853.68 rows=27607 width=0) (actual time=326.507..326.507 rows=0
loops=1)
                           ->  Bitmap Index Scan on idx__email_header__header_body_subject  (cost=0.00..5836.24
rows=272434width=0) (actual time=178.041..178.041 rows=280989 loops=1) 
                           ->  Bitmap Index Scan on idx__email_header__header_name  (cost=0.00..5880.97 rows=281247
width=0)(actual time=114.574..114.574 rows=280989 loops=1) 
                                 Index Cond: (header_name = 'subject'::text)
               ->  Hash  (cost=34291.87..34291.87 rows=11020 width=120) (actual time=4772.148..4772.148 rows=35000
loops=1)
                     ->  Hash Join  (cost=24164.59..34291.87 rows=11020 width=120) (actual time=3131.067..4706.997
rows=35000loops=1) 
                           Hash Cond: (mime_part.email_id = email.email_id)
                           ->  Seq Scan on mime_part  (cost=0.00..8355.81 rows=265804 width=12) (actual
time=0.038..514.291rows=267890 loops=1) 
                                 Filter: (mime_part_id = 0)
                           ->  Hash  (cost=24025.94..24025.94 rows=11092 width=112) (actual time=3130.982..3130.982
rows=35000loops=1) 
                                 ->  Hash Join  (cost=22244.54..24025.94 rows=11092 width=112) (actual
time=996.556..3069.280rows=35000 loops=1) 
                                       Hash Cond: (eh_from.email_id = email.email_id)
                                       ->  Bitmap Heap Scan on email_header eh_from  (cost=15576.58..16041.55
rows=107156width=104) (actual time=569.762..1932.017 rows=280990 loops=1) 
                                             Recheck Cond: ((mime_part_id = 0) AND (header_name = 'from'::text))
                                             ->  BitmapAnd  (cost=15576.58..15576.58 rows=160 width=0) (actual
time=532.217..532.217rows=0 loops=1) 
                                                   ->  Bitmap Index Scan on dummy_index  (cost=0.00..3724.22
rows=107156width=0) (actual time=116.386..116.386 rows=280990 loops=1) 
                                                   ->  Bitmap Index Scan on idx__email_header__from_local
(cost=0.00..5779.24rows=107156 width=0) (actual time=174.883..174.883 rows=280990 loops=1) 
                                                   ->  Bitmap Index Scan on dummy2_index  (cost=0.00..5992.25
rows=107156width=0) (actual time=173.575..173.575 rows=280990 loops=1) 
                                       ->  Hash  (cost=6321.79..6321.79 rows=27694 width=8) (actual
time=426.739..426.739rows=35000 loops=1) 
                                             ->  Index Scan using idx__email__time on email  (cost=0.00..6321.79
rows=27694width=8) (actual time=50.000..375.021 rows=35000 loops=1) 
                                                   Index Cond: (("time" >= '2007-05-05 17:01:59'::timestamp without
timezone) AND ("time" < '2007-05-05 18:01:59'::timestamp without time zone)) 
 Total runtime: 8160.442 ms


8.2.4:

                                                                                            QUERY PLAN
                                                                          

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=100086.52..100658.46 rows=28597 width=182) (actual time=6063.766..6281.818 rows=35000 loops=1)
   ->  Sort  (cost=100086.52..100158.01 rows=28597 width=182) (actual time=6063.697..6105.215 rows=35000 loops=1)
         Sort Key: eh_subj.header_body
         ->  Hash Join  (cost=36729.27..97969.83 rows=28597 width=182) (actual time=3690.316..5790.094 rows=35000
loops=1)
               Hash Cond: (eh_subj.email_id = email.email_id)
               ->  Bitmap Heap Scan on email_header eh_subj  (cost=5903.20..63844.68 rows=267832 width=103) (actual
time=214.699..1564.804rows=280989 loops=1) 
                     Recheck Cond: ((mime_part_id = 0) AND (header_name = 'subject'::text))
                     ->  Bitmap Index Scan on idx__email_header__header_body_subject  (cost=0.00..5836.24 rows=267832
width=0)(actual time=172.188..172.188 rows=280989 loops=1) 
               ->  Hash  (cost=30468.98..30468.98 rows=28567 width=119) (actual time=3475.484..3475.484 rows=35000
loops=1)
                     ->  Hash Join  (cost=13773.73..30468.98 rows=28567 width=119) (actual time=1260.579..3409.443
rows=35000loops=1) 
                           Hash Cond: (eh_from.email_id = email.email_id)
                           ->  Index Scan using dummy_index on email_header eh_from  (cost=0.00..13286.00 rows=277652
width=103)(actual time=0.076..1391.974 rows=280990 loops=1) 
                           ->  Hash  (cost=13429.63..13429.63 rows=27528 width=20) (actual time=1260.422..1260.422
rows=35000loops=1) 
                                 ->  Hash Join  (cost=1799.41..13429.63 rows=27528 width=20) (actual
time=114.765..1206.500rows=35000 loops=1) 
                                       Hash Cond: (mime_part.email_id = email.email_id)
                                       ->  Seq Scan on mime_part  (cost=0.00..8355.81 rows=266589 width=12) (actual
time=0.036..407.539rows=267890 loops=1) 
                                             Filter: (mime_part_id = 0)
                                       ->  Hash  (cost=1454.07..1454.07 rows=27627 width=8) (actual
time=114.644..114.644rows=35000 loops=1) 
                                             ->  Index Scan using idx__email__time on email  (cost=0.00..1454.07
rows=27627width=8) (actual time=0.144..63.017 rows=35000 loops=1) 
                                                   Index Cond: (("time" >= '2007-05-05 17:01:59'::timestamp without
timezone) AND ("time" < '2007-05-05 18:01:59'::timestamp without time zone)) 
 Total runtime: 6320.790 ms
(21 Zeilen)


The only significant change is that the first Bitmap Heap Scan (line 6)
became more expensive.  You will notice that in the old plan, you had a
pretty good correspondence of 10 cost units to 1 millisecond throughout,
whereas in the new plan that does not apply to said Bitmap Heap Scan.
I'm not sure whether that is cause for concern.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Apparently useless bitmap scans

From

Tom Lane

Date:

09 May 2007, 12:56:23

Peter Eisentraut <peter_e@gmx.net> writes:
> OK, upgrading to 8.2.4 fixes this odd plan choice.  The query does run
> a bit faster too, but the cost estimate has actually gone up!

Yeah, because the former code was making an unrealistically small
estimate of the number of tuples found by BitmapAnd (due to
double-counting the selectivities of redundant indexes), and of course
that means a smaller estimate of the cost to fetch them in the bitmap
heap scan.

            regards, tom lane