Thread: [8.1.4] Help optimizing query

[8.1.4] Help optimizing query

From
Hayes
Date:
Hi all,

I need help making the below query go faster.  There are about 5 of
these being done
in one report.  Some use joins instead of subselects but they all go
about the same speed.
I'm not much of a SQL person so I just threw indices on everything
involved but it still
takes about at least 19sec and sometimes as much as 60s.  I'd be happy
to get it to about
5s.

Other info:  random_page_cost is 4 as I/O on this box as kinda slow.
It's a converted desktop with a single IDE drive.  shared_buffers is
12500 and effective_page_cache is 100M.  I upped the statistics on
entry_date and session_id to 1000.  I analyzed the tables after
modifying the statistics.  The actual deployment platform is a lot
beefier but I'd like these queries to at least be tolerable on this
machine.

I can see that the estimate for the GroupAggregate is off, if I'm
interpreting things correctly, but I don't know what to do about it.

tia,
arturo

Query:

SELECT subscription_id,
       to_char(sum(session_length), 'HH24:MI:SS') as session_length,
       sum(hits) as hits,
       2006 as theYear,
       2 as theQuarter,
       sum(count) as count
FROM (
     SELECT subscription_id,
       count(distinct session_id) as count,
       age(MAX(entry_date),MIN(entry_date)) as session_length,
       COUNT(action) as hits
     FROM
   extended_user JOIN user_tracking USING (user_id)
   WHERE subscription_id > 0 AND
         EXTRACT(year from entry_date) = 2006 AND
         EXTRACT(quarter from entry_date) = 2
         GROUP BY session_id,
          subscription_id
          ) as session_stuff
              WHERE subscription_id > 0
         GROUP BY subscription_id
         ORDER BY subscription_id;


Sort  (cost=123305.88..123306.38 rows=200 width=36) (actual
time=75039.706..75040.500 rows=258 loops=1)
   Sort Key: session_stuff.subscription_id
   ->  HashAggregate  (cost=123294.24..123298.24 rows=200 width=36)
(actual time=75036.487..75038.360 rows=258 loops=1)
         ->  GroupAggregate  (cost=108839.34..118475.94 rows=240915
width=72) (actual time=68016.583..74702.710 rows=38369 loops=1)
               ->  Sort  (cost=108839.34..109441.63 rows=240915
width=72) (actual time=67978.193..68982.962 rows=245727 loops=1)
                     Sort Key: user_tracking.session_id,
extended_user.subscription_id
                     ->  Hash Join  (cost=7746.59..75492.37 rows=240915
width=72) (actual time=16944.487..50737.230 rows=245727 loops=1)
                           Hash Cond: ("outer".user_id = "inner".user_id)
                           ->  Bitmap Heap Scan on user_tracking
(cost=7524.10..68644.10 rows=240950 width=72) (actual
time=16843.695..48306.383 rows=258923 loops=1)
                                 Recheck Cond:
((date_part('quarter'::text, entry_date) = 2::double precision) AND
(date_part('year'::text, entry_date) = 2006::double precision))
                                 ->  BitmapAnd  (cost=7524.10..7524.10
rows=240950 width=0) (actual time=16779.178..16779.178 rows=0 loops=1)
                                       ->  Bitmap Index Scan on
user_tracking_quarter_idx  (cost=0.00..3331.51 rows=533288 width=0)
(actual time=9079.545..9079.545 rows=533492 loops=1)
                                             Index Cond:
(date_part('quarter'::text, entry_date) = 2::double precision)
                                       ->  Bitmap Index Scan on
user_tracking_year_idx  (cost=0.00..4192.34 rows=671239 width=0) (actual
time=7685.906..7685.906 rows=671787 loops=1)
                                             Index Cond:
(date_part('year'::text, entry_date) = 2006::double precision)
                           ->  Hash  (cost=206.42..206.42 rows=6428
width=8) (actual time=100.754..100.754 rows=6411 loops=1)
                                 ->  Seq Scan on extended_user
(cost=0.00..206.42 rows=6428 width=8) (actual time=0.020..28.873
rows=6411 loops=1)
                                       Filter: ((subscription_id > 0)
AND (subscription_id > 0))
 Total runtime: 75069.453 ms

Tables:

This one has about 6-7k rows.

               Table "public.extended_user"
      Column       |           Type           | Modifiers
-------------------+--------------------------+-----------
 create_date       | timestamp with time zone | not null
 email             | character varying(99)    |
 first_name        | character varying(99)    | not null
 last_name         | character varying(99)    | not null
 license_agreement | boolean                  | not null
 license_date      | timestamp with time zone |
 password          | character varying(32)    | not null
 subscription_id   | integer                  | not null
 user_id           | integer                  | not null
 user_name         | character varying(99)    | not null
Indexes:
    "extended_user_pkey" PRIMARY KEY, btree (user_id)
    "extended_user_subscription_id_idx" btree (subscription_id)
    "extended_user_subscription_idx" btree (subscription_id)
Foreign-key constraints:
    "extended_user_subscription_id_fkey" FOREIGN KEY (subscription_id)
REFERENCES subscription(subscription_id) DEFERRABLE INITIALLY DEFERRED

This one has about 2k rows.

               Table "public.subscription"
      Column      |           Type           | Modifiers
------------------+--------------------------+-----------
 allow_printing   | boolean                  | not null
 company_id       | character varying(50)    | not null
 company_name     | character varying(100)   | not null
 end_date         | timestamp with time zone |
 licenses         | integer                  | not null
 pass_through_key | character varying(50)    |
 start_date       | timestamp with time zone | not null
 subscription_id  | integer                  | not null
Indexes:
    "subscription_pkey" PRIMARY KEY, btree (subscription_id)


This one has about 1.4M rows.  It's kind of a log of pages visited.

                                               Table
"public.user_tracking"
      Column      |            Type             |
Modifiers
------------------+-----------------------------+------------------------
--------------------------------------------------
 action           | character varying(255)      | not null
 entry_date       | timestamp without time zone | not null
 note             | text                        |
 report_id        | integer                     |
 session_id       | character varying(255)      | not null
 user_id          | integer                     |
 user_tracking_id | integer                     | not null default
nextval('user_tracking_user_tracking_id_seq'::regclass)
Indexes:
    "user_tracking_pkey" PRIMARY KEY, btree (user_tracking_id)
    "user_tracking_entry_date_idx" btree (entry_date)
    "user_tracking_month_idx" btree (date_part('month'::text,
entry_date))
    "user_tracking_quarter_idx" btree (date_part('quarter'::text,
entry_date))
    "user_tracking_report_id_idx" btree (report_id)
    "user_tracking_session_idx" btree (session_id)
    "user_tracking_user_id_idx" btree (user_id)
    "user_tracking_year_idx" btree (date_part('year'::text, entry_date))
Foreign-key constraints:
    "user_tracking_report_id_fkey" FOREIGN KEY (report_id) REFERENCES
article(article_id) DEFERRABLE INITIALLY DEFERRED
    "user_tracking_user_id_fkey" FOREIGN KEY (user_id) REFERENCES
extended_user(user_id) DEFERRABLE INITIALLY DEFERRED

Re: [8.1.4] Help optimizing query

From
"George Pavlov"
Date:
Without having looked at this in detail my first suggestion would be to
do away with those date_part indices. I have found that indexes with few
distinct values usually hurt more then help and the PG optimizer is not
always smart enough to ignore them and the BitmapAnd and scan for dates
seem like a waste since you can consolidate that information from the
get-go, e.g. could you rewrite your WHERE clause to be something like:

  WHERE date_trunc('quarter', entry_date) =  '2006-04-01' -- for 2nd
quarter of '06

or

  WHERE entry_date >= '2006-04-01'
    AND entry_date <  '2006-07-01'

You could try an index on either the date_trunc or the entry_date
itself, as appropriate.

I assume your user_tracking table is being inserted onto on each visit,
so you may want to be very cautious with what indexes you have on it
anyway and pare those down.

BTW, you have a redundant "WHERE subscription_id > 0" in the outer
query, not that that affects much.

From then on you may want to materialize the subselect so that all of
your five queries use it or, if possible, consolidate those five queries
into one or at least less than five. You can go even further -- seems
like this would be a good candidate for an aggregated reporting table
(essentially your subselect as a separate table updated by triggers, or
every night, or whatever). Especially since the situation will only get
worse as you have more data in your system.

George


> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
> [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Hayes
> Sent: Saturday, August 26, 2006 1:12 PM
> To: pgsql-performance@postgresql.org
> Subject: [PERFORM] [8.1.4] Help optimizing query
>
> Hi all,
>
> I need help making the below query go faster.  There are about 5 of
> these being done
> in one report.  Some use joins instead of subselects but they all go
> about the same speed.
> I'm not much of a SQL person so I just threw indices on everything
> involved but it still
> takes about at least 19sec and sometimes as much as 60s.  I'd
> be happy
> to get it to about
> 5s.
>
> Other info:  random_page_cost is 4 as I/O on this box as kinda slow.
> It's a converted desktop with a single IDE drive.  shared_buffers is
> 12500 and effective_page_cache is 100M.  I upped the statistics on
> entry_date and session_id to 1000.  I analyzed the tables after
> modifying the statistics.  The actual deployment platform is a lot
> beefier but I'd like these queries to at least be tolerable on this
> machine.
>
> I can see that the estimate for the GroupAggregate is off, if I'm
> interpreting things correctly, but I don't know what to do about it.
>
> tia,
> arturo
>
> Query:
>
> SELECT subscription_id,
>        to_char(sum(session_length), 'HH24:MI:SS') as session_length,
>        sum(hits) as hits,
>        2006 as theYear,
>        2 as theQuarter,
>        sum(count) as count
> FROM (
>      SELECT subscription_id,
>        count(distinct session_id) as count,
>        age(MAX(entry_date),MIN(entry_date)) as session_length,
>        COUNT(action) as hits
>      FROM
>    extended_user JOIN user_tracking USING (user_id)
>    WHERE subscription_id > 0 AND
>          EXTRACT(year from entry_date) = 2006 AND
>          EXTRACT(quarter from entry_date) = 2
>          GROUP BY session_id,
>           subscription_id
>           ) as session_stuff
>               WHERE subscription_id > 0
>          GROUP BY subscription_id
>          ORDER BY subscription_id;
>
>
> Sort  (cost=123305.88..123306.38 rows=200 width=36) (actual
> time=75039.706..75040.500 rows=258 loops=1)
>    Sort Key: session_stuff.subscription_id
>    ->  HashAggregate  (cost=123294.24..123298.24 rows=200 width=36)
> (actual time=75036.487..75038.360 rows=258 loops=1)
>          ->  GroupAggregate  (cost=108839.34..118475.94 rows=240915
> width=72) (actual time=68016.583..74702.710 rows=38369 loops=1)
>                ->  Sort  (cost=108839.34..109441.63 rows=240915
> width=72) (actual time=67978.193..68982.962 rows=245727 loops=1)
>                      Sort Key: user_tracking.session_id,
> extended_user.subscription_id
>                      ->  Hash Join  (cost=7746.59..75492.37
> rows=240915
> width=72) (actual time=16944.487..50737.230 rows=245727 loops=1)
>                            Hash Cond: ("outer".user_id =
> "inner".user_id)
>                            ->  Bitmap Heap Scan on user_tracking
> (cost=7524.10..68644.10 rows=240950 width=72) (actual
> time=16843.695..48306.383 rows=258923 loops=1)
>                                  Recheck Cond:
> ((date_part('quarter'::text, entry_date) = 2::double precision) AND
> (date_part('year'::text, entry_date) = 2006::double precision))
>                                  ->  BitmapAnd
> (cost=7524.10..7524.10
> rows=240950 width=0) (actual time=16779.178..16779.178 rows=0 loops=1)
>                                        ->  Bitmap Index Scan on
> user_tracking_quarter_idx  (cost=0.00..3331.51 rows=533288 width=0)
> (actual time=9079.545..9079.545 rows=533492 loops=1)
>                                              Index Cond:
> (date_part('quarter'::text, entry_date) = 2::double precision)
>                                        ->  Bitmap Index Scan on
> user_tracking_year_idx  (cost=0.00..4192.34 rows=671239
> width=0) (actual
> time=7685.906..7685.906 rows=671787 loops=1)
>                                              Index Cond:
> (date_part('year'::text, entry_date) = 2006::double precision)
>                            ->  Hash  (cost=206.42..206.42 rows=6428
> width=8) (actual time=100.754..100.754 rows=6411 loops=1)
>                                  ->  Seq Scan on extended_user
> (cost=0.00..206.42 rows=6428 width=8) (actual time=0.020..28.873
> rows=6411 loops=1)
>                                        Filter: ((subscription_id > 0)
> AND (subscription_id > 0))
>  Total runtime: 75069.453 ms
>
> Tables:
>
> This one has about 6-7k rows.
>
>                Table "public.extended_user"
>       Column       |           Type           | Modifiers
> -------------------+--------------------------+-----------
>  create_date       | timestamp with time zone | not null
>  email             | character varying(99)    |
>  first_name        | character varying(99)    | not null
>  last_name         | character varying(99)    | not null
>  license_agreement | boolean                  | not null
>  license_date      | timestamp with time zone |
>  password          | character varying(32)    | not null
>  subscription_id   | integer                  | not null
>  user_id           | integer                  | not null
>  user_name         | character varying(99)    | not null
> Indexes:
>     "extended_user_pkey" PRIMARY KEY, btree (user_id)
>     "extended_user_subscription_id_idx" btree (subscription_id)
>     "extended_user_subscription_idx" btree (subscription_id)
> Foreign-key constraints:
>     "extended_user_subscription_id_fkey" FOREIGN KEY
> (subscription_id)
> REFERENCES subscription(subscription_id) DEFERRABLE INITIALLY DEFERRED
>
> This one has about 2k rows.
>
>                Table "public.subscription"
>       Column      |           Type           | Modifiers
> ------------------+--------------------------+-----------
>  allow_printing   | boolean                  | not null
>  company_id       | character varying(50)    | not null
>  company_name     | character varying(100)   | not null
>  end_date         | timestamp with time zone |
>  licenses         | integer                  | not null
>  pass_through_key | character varying(50)    |
>  start_date       | timestamp with time zone | not null
>  subscription_id  | integer                  | not null
> Indexes:
>     "subscription_pkey" PRIMARY KEY, btree (subscription_id)
>
>
> This one has about 1.4M rows.  It's kind of a log of pages visited.
>
>                                                Table
> "public.user_tracking"
>       Column      |            Type             |
>
> Modifiers
> ------------------+-----------------------------+-------------
> -----------
> --------------------------------------------------
>  action           | character varying(255)      | not null
>  entry_date       | timestamp without time zone | not null
>  note             | text                        |
>  report_id        | integer                     |
>  session_id       | character varying(255)      | not null
>  user_id          | integer                     |
>  user_tracking_id | integer                     | not null default
> nextval('user_tracking_user_tracking_id_seq'::regclass)
> Indexes:
>     "user_tracking_pkey" PRIMARY KEY, btree (user_tracking_id)
>     "user_tracking_entry_date_idx" btree (entry_date)
>     "user_tracking_month_idx" btree (date_part('month'::text,
> entry_date))
>     "user_tracking_quarter_idx" btree (date_part('quarter'::text,
> entry_date))
>     "user_tracking_report_id_idx" btree (report_id)
>     "user_tracking_session_idx" btree (session_id)
>     "user_tracking_user_id_idx" btree (user_id)
>     "user_tracking_year_idx" btree (date_part('year'::text,
> entry_date))
> Foreign-key constraints:
>     "user_tracking_report_id_fkey" FOREIGN KEY (report_id) REFERENCES
> article(article_id) DEFERRABLE INITIALLY DEFERRED
>     "user_tracking_user_id_fkey" FOREIGN KEY (user_id) REFERENCES
> extended_user(user_id) DEFERRABLE INITIALLY DEFERRED
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
>