Migration study, step 2: rewriting queries - Mailing list pgsql-performance

From Mikael Carneholm
Subject Migration study, step 2: rewriting queries
Date
Msg-id 7F10D26ECFA1FB458B89C5B4B0D72C2B3E3CF0@sesrv12.wirelesscar.com
Whole thread Raw
Responses Re: Migration study, step 2: rewriting queries
List pgsql-performance
Since my first post in this series, data has been successfully exported
from the proprietary db (in CSV format) and imported into postgres
(PostgreSQL 8.1.3 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.4
20050721 (Red Hat 3.4.4-2)) using
COPY. The tablespace holding the tables+indexes is now 60Gb, total time
for exporting+transfering over  network+importing+indexing+vacuuming is
about 7 hours.

I now have a query (originally using an analytical function) that
behaves strangely as soon as I go from 2 person_information__id:s to 3
person_information__id:s (see below):

//
// 2 PID:s
//
mica@TEST=> explain analyze select distinct on
(driver_activity.person_information__id)
TEST-> driver_activity.person_information__id,
TEST-> driver_activity.end_time as prev_timestamp
TEST-> from
TEST-> driver_activity
TEST-> where
TEST-> driver_activity.person_information__id in (5398,5399)
TEST-> and driver_activity.driver_activity_type__id = 5
TEST-> and driver_activity.start_time < '2006-04-01'
TEST-> order by
TEST-> driver_activity.person_information__id,
driver_activity.start_time desc;


QUERY PLAN

------------------------------------------------------------------------
------------------------------------------------------------------------
----------------------------
 Unique  (cost=1909.90..1912.32 rows=1 width=20) (actual
time=1.153..1.623 rows=2 loops=1)
   ->  Sort  (cost=1909.90..1911.11 rows=485 width=20) (actual
time=1.151..1.389 rows=213 loops=1)
         Sort Key: person_information__id, start_time
         ->  Bitmap Heap Scan on driver_activity  (cost=6.91..1888.26
rows=485 width=20) (actual time=0.141..0.677 rows=213 loops=1)
               Recheck Cond: (((person_information__id = 5398) AND
(driver_activity_type__id = 5)) OR ((person_information__id = 5399) AND
(driver_activity_type__id = 5)))
               Filter: (start_time < '2006-04-01 00:00:00'::timestamp
without time zone)
               ->  BitmapOr  (cost=6.91..6.91 rows=485 width=0) (actual
time=0.102..0.102 rows=0 loops=1)
                     ->  Bitmap Index Scan on xda_pi_dat
(cost=0.00..3.46 rows=243 width=0) (actual time=0.042..0.042 rows=56
loops=1)
                           Index Cond: ((person_information__id = 5398)
AND (driver_activity_type__id = 5))
                     ->  Bitmap Index Scan on xda_pi_dat
(cost=0.00..3.46 rows=243 width=0) (actual time=0.055..0.055 rows=157
loops=1)
                           Index Cond: ((person_information__id = 5399)
AND (driver_activity_type__id = 5))
 Total runtime: 1.693 ms
(12 rows)

//
// 3 PID:s
//
mica@TEST=> explain analyze select distinct on
(driver_activity.person_information__id)
TEST-> driver_activity.person_information__id,
TEST-> driver_activity.end_time as prev_timestamp
TEST-> from
TEST-> driver_activity
TEST-> where
TEST-> driver_activity.person_information__id in (5398,5399,5400)
TEST-> and driver_activity.driver_activity_type__id = 5
TEST-> and driver_activity.start_time < '2006-04-01'
TEST-> order by
TEST-> driver_activity.person_information__id,
driver_activity.start_time desc;

QUERY PLAN

------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
-----------------------------------------------------------------
 Unique  (cost=2808.35..2811.98 rows=1 width=20) (actual
time=5450.281..5450.948 rows=3 loops=1)
   ->  Sort  (cost=2808.35..2810.17 rows=727 width=20) (actual
time=5450.278..5450.607 rows=305 loops=1)
         Sort Key: person_information__id, start_time
         ->  Bitmap Heap Scan on driver_activity  (cost=2713.82..2773.80
rows=727 width=20) (actual time=5436.259..5449.043 rows=305 loops=1)
               Recheck Cond: ((((person_information__id = 5398) AND
(driver_activity_type__id = 5)) OR ((person_information__id = 5399) AND
(driver_activity_type__id = 5)) OR ((person_information__id = 5400) AND
(driver_activity_type__id = 5))) AND (driver_activity_type__id = 5))
               Filter: (start_time < '2006-04-01 00:00:00'::timestamp
without time zone)
               ->  BitmapAnd  (cost=2713.82..2713.82 rows=15 width=0)
(actual time=5436.148..5436.148 rows=0 loops=1)
                     ->  BitmapOr  (cost=10.37..10.37 rows=728 width=0)
(actual time=0.384..0.384 rows=0 loops=1)
                           ->  Bitmap Index Scan on xda_pi_dat
(cost=0.00..3.46 rows=243 width=0) (actual time=0.135..0.135 rows=56
loops=1)
                                 Index Cond: ((person_information__id =
5398) AND (driver_activity_type__id = 5))
                           ->  Bitmap Index Scan on xda_pi_dat
(cost=0.00..3.46 rows=243 width=0) (actual time=0.115..0.115 rows=157
loops=1)
                                 Index Cond: ((person_information__id =
5399) AND (driver_activity_type__id = 5))
                           ->  Bitmap Index Scan on xda_pi_dat
(cost=0.00..3.46 rows=243 width=0) (actual time=0.126..0.126 rows=93
loops=1)
                                 Index Cond: ((person_information__id =
5400) AND (driver_activity_type__id = 5))
                     ->  Bitmap Index Scan on xda_dat
(cost=0.00..2703.21 rows=474916 width=0) (actual time=5435.431..5435.431
rows=451541 loops=1)
                           Index Cond: (driver_activity_type__id = 5)
 Total runtime: 5451.094 ms
(17 rows)

Question: why is the extra step (Bitmap Index Scan on xda_dat)
introduced in the latter case? I can see it is originating from

((((person_information__id = 5398) AND (driver_activity_type__id = 5))
OR ((person_information__id = 5399) AND (driver_activity_type__id = 5))
OR ((person_information__id = 5400) AND (driver_activity_type__id = 5)))
AND (driver_activity_type__id = 5))

..which, indented for readability, looks like this:

(
    (
        (
            (person_information__id = 5398) AND
(driver_activity_type__id = 5)
        )
        OR
        (
            (person_information__id = 5399) AND
(driver_activity_type__id = 5)
        )
        OR
        (
            (person_information__id = 5400) AND
(driver_activity_type__id = 5)
        )
    )
    AND
    (driver_activity_type__id = 5)
)

..and this last AND seems unnessecary, since the predicate on
(driver_activity_type__id = 5) is included in each of the above
conditions.

Can this be a bug in the planner?

/Mikael

pgsql-performance by date:

Previous
From: Francisco Reyes
Date:
Subject: Re: Inserts optimization?
Next
From: Tom Lane
Date:
Subject: Re: Migration study, step 2: rewriting queries