Thread: Query tuning

Query tuning

From

Kevin Kempter

Date:

19 August 2009, 13:28:53

Hi all;

we've been fighting this query for a few days now. we bumped up the statistict
target for the a.id , c.url_hits_id and the b.id columns below to 250 and ran
an analyze on the relevant tables.  we killed it after 8hrs.

Note the url_hits table has > 1.4billion rows

Any suggestions?



$ psql -ef expl.sql pwreport
explain
select
a.id,
ident_id,
time,
customer_name,
extract('day' from timezone(e.name, to_timestamp(a.time))) as day,
category_id
from
pwreport.url_hits a left outer join
pwreport.url_hits_category_jt c on (a.id = c.url_hits_id),
pwreport.ident b,
pwreport.timezone e
where
a.ident_id = b.id
and b.timezone_id = e.id
and time >= extract ('epoch' from timestamp '2009-08-12')
and time < extract ('epoch' from timestamp '2009-08-13' )
and direction = 'REQUEST'
;
                                                                                                     QUERY
PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Right Join  (cost=47528508.61..180424544.59 rows=10409251 width=53)
   Merge Cond: (c.url_hits_id = a.id)
   ->  Index Scan using mt_url_hits_category_jt_url_hits_id_index on
url_hits_category_jt c  (cost=0.00..122162596.63 rows=4189283233 width=8)
   ->  Sort  (cost=47528508.61..47536931.63 rows=3369210 width=49)
         Sort Key: a.id
         ->  Hash Join  (cost=2565.00..47163219.21 rows=3369210 width=49)
               Hash Cond: (b.timezone_id = e.id)
               ->  Hash Join  (cost=2553.49..47116881.07 rows=3369210
width=37)
                     Hash Cond: (a.ident_id = b.id)
                     ->  Seq Scan on url_hits a  (cost=0.00..47051154.89
rows=3369210 width=12)
                           Filter: ((direction =
'REQUEST'::proxy_direction_enum) AND (("time")::double precision >=
1250035200::double precision) AND (("time")::double precision <
1250121600::double precision))
                     ->  Hash  (cost=2020.44..2020.44 rows=42644 width=29)
                           ->  Seq Scan on ident b  (cost=0.00..2020.44
rows=42644 width=29)
               ->  Hash  (cost=6.78..6.78 rows=378 width=20)
                     ->  Seq Scan on timezone e  (cost=0.00..6.78 rows=378
width=20)
(15 rows)

Re: Query tuning

From

Grzegorz Jaśkiewicz

Date:

19 August 2009, 13:38:24

that seems to be the killer:

and time >= extract ('epoch' from timestamp '2009-08-12')
and time < extract ('epoch' from timestamp '2009-08-13' )

You probably need an index on time/epoch:

CREATE INDEX foo ON table(extract ('epoch' from timestamp time );

or something like that, vacuum analyze and retry.

Re: Query tuning

From

Scott Carey

Date:

19 August 2009, 14:17:52



On 8/19/09 9:28 AM, "Kevin Kempter" <kevink@consistentstate.com> wrote:

> Hi all;
>
> we've been fighting this query for a few days now. we bumped up the statistict
> target for the a.id , c.url_hits_id and the b.id columns below to 250 and ran
> an analyze on the relevant tables.  we killed it after 8hrs.
>
> Note the url_hits table has > 1.4billion rows
>
> Any suggestions?
>

Have you tried setting work_mem higher for just this query?

The big estimated cost is the sequential scan on url_hits.  But in reality,
if the estimates are off the sort and index scan at the end might be your
bottleneck.  Larger work_mem might make it choose another plan there.

But if the true cost is the sequential scan on url_hits, then only an index
there will help.

>
>
> $ psql -ef expl.sql pwreport
> explain
> select
> a.id,
> ident_id,
> time,
> customer_name,
> extract('day' from timezone(e.name, to_timestamp(a.time))) as day,
> category_id
> from
> pwreport.url_hits a left outer join
> pwreport.url_hits_category_jt c on (a.id = c.url_hits_id),
> pwreport.ident b,
> pwreport.timezone e
> where
> a.ident_id = b.id
> and b.timezone_id = e.id
> and time >= extract ('epoch' from timestamp '2009-08-12')
> and time < extract ('epoch' from timestamp '2009-08-13' )
> and direction = 'REQUEST'
> ;
>
> QUERY
> PLAN
> ------------------------------------------------------------------------------
> ------------------------------------------------------------------------------
> --------------------------------------------------------
>  Merge Right Join  (cost=47528508.61..180424544.59 rows=10409251 width=53)
>    Merge Cond: (c.url_hits_id = a.id)
>    ->  Index Scan using mt_url_hits_category_jt_url_hits_id_index on
> url_hits_category_jt c  (cost=0.00..122162596.63 rows=4189283233 width=8)
>    ->  Sort  (cost=47528508.61..47536931.63 rows=3369210 width=49)
>          Sort Key: a.id
>          ->  Hash Join  (cost=2565.00..47163219.21 rows=3369210 width=49)
>                Hash Cond: (b.timezone_id = e.id)
>                ->  Hash Join  (cost=2553.49..47116881.07 rows=3369210
> width=37)
>                      Hash Cond: (a.ident_id = b.id)
>                      ->  Seq Scan on url_hits a  (cost=0.00..47051154.89
> rows=3369210 width=12)
>                            Filter: ((direction =
> 'REQUEST'::proxy_direction_enum) AND (("time")::double precision >=
> 1250035200::double precision) AND (("time")::double precision <
> 1250121600::double precision))
>                      ->  Hash  (cost=2020.44..2020.44 rows=42644 width=29)
>                            ->  Seq Scan on ident b  (cost=0.00..2020.44
> rows=42644 width=29)
>                ->  Hash  (cost=6.78..6.78 rows=378 width=20)
>                      ->  Seq Scan on timezone e  (cost=0.00..6.78 rows=378
> width=20)
> (15 rows)
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: Query tuning

From

Nikolas Everett

Date:

19 August 2009, 14:32:03

2009/8/19 Grzegorz Jaśkiewicz <gryzman@gmail.com>

that seems to be the killer:

and time >= extract ('epoch' from timestamp '2009-08-12')
and time < extract ('epoch' from timestamp '2009-08-13' )

You probably need an index on time/epoch:

CREATE INDEX foo ON table(extract ('epoch' from timestamp time );

It looks like those extracts just make constant integer times. You probably just create an index on the time column.

Also, why not store times as timestamps?

or something like that, vacuum analyze and retry.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Query tuning

From

Kevin Kempter

Date:

19 August 2009, 14:37:12

On Wednesday 19 August 2009 11:17:26 Scott Carey wrote:
> On 8/19/09 9:28 AM, "Kevin Kempter" <kevink@consistentstate.com> wrote:
> > Hi all;
> >
> > we've been fighting this query for a few days now. we bumped up the
> > statistict target for the a.id , c.url_hits_id and the b.id columns below
> > to 250 and ran an analyze on the relevant tables.  we killed it after
> > 8hrs.
> >
> > Note the url_hits table has > 1.4billion rows
> >
> > Any suggestions?
>
> Have you tried setting work_mem higher for just this query?

Yes, we upped it to 500Meg


>
> The big estimated cost is the sequential scan on url_hits.  But in reality,
> if the estimates are off the sort and index scan at the end might be your
> bottleneck.  Larger work_mem might make it choose another plan there.
>
> But if the true cost is the sequential scan on url_hits, then only an index
> there will help.
>
> > $ psql -ef expl.sql pwreport
> > explain
> > select
> > a.id,
> > ident_id,
> > time,
> > customer_name,
> > extract('day' from timezone(e.name, to_timestamp(a.time))) as day,
> > category_id
> > from
> > pwreport.url_hits a left outer join
> > pwreport.url_hits_category_jt c on (a.id = c.url_hits_id),
> > pwreport.ident b,
> > pwreport.timezone e
> > where
> > a.ident_id = b.id
> > and b.timezone_id = e.id
> > and time >= extract ('epoch' from timestamp '2009-08-12')
> > and time < extract ('epoch' from timestamp '2009-08-13' )
> > and direction = 'REQUEST'
> > ;
> >
> > QUERY
> > PLAN
> > -------------------------------------------------------------------------
> >-----
> > -------------------------------------------------------------------------
> >----- --------------------------------------------------------
> >  Merge Right Join  (cost=47528508.61..180424544.59 rows=10409251
> > width=53) Merge Cond: (c.url_hits_id = a.id)
> >    ->  Index Scan using mt_url_hits_category_jt_url_hits_id_index on
> > url_hits_category_jt c  (cost=0.00..122162596.63 rows=4189283233 width=8)
> >    ->  Sort  (cost=47528508.61..47536931.63 rows=3369210 width=49)
> >          Sort Key: a.id
> >          ->  Hash Join  (cost=2565.00..47163219.21 rows=3369210 width=49)
> >                Hash Cond: (b.timezone_id = e.id)
> >                ->  Hash Join  (cost=2553.49..47116881.07 rows=3369210
> > width=37)
> >                      Hash Cond: (a.ident_id = b.id)
> >                      ->  Seq Scan on url_hits a  (cost=0.00..47051154.89
> > rows=3369210 width=12)
> >                            Filter: ((direction =
> > 'REQUEST'::proxy_direction_enum) AND (("time")::double precision >=
> > 1250035200::double precision) AND (("time")::double precision <
> > 1250121600::double precision))
> >                      ->  Hash  (cost=2020.44..2020.44 rows=42644
> > width=29) ->  Seq Scan on ident b  (cost=0.00..2020.44 rows=42644
> > width=29)
> >                ->  Hash  (cost=6.78..6.78 rows=378 width=20)
> >                      ->  Seq Scan on timezone e  (cost=0.00..6.78
> > rows=378 width=20)
> > (15 rows)
> >
> >
> > --
> > Sent via pgsql-performance mailing list
> > (pgsql-performance@postgresql.org) To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-performance

Re: Query tuning

From

Kevin Kempter

Date:

19 August 2009, 14:38:06

On Wednesday 19 August 2009 11:31:30 Nikolas Everett wrote:
> 2009/8/19 Grzegorz Jaśkiewicz <gryzman@gmail.com>
>
> > that seems to be the killer:
> >
> > and time >= extract ('epoch' from timestamp '2009-08-12')
> > and time < extract ('epoch' from timestamp '2009-08-13' )
> >
> > You probably need an index on time/epoch:
> >
> > CREATE INDEX foo ON table(extract ('epoch' from timestamp time );
>
> It looks like those extracts just make constant integer times. You probably
> just create an index on the time column.
>
> Also, why not store times as timestamps?
>
> > or something like that, vacuum analyze and retry.
> >
> > --
> > Sent via pgsql-performance mailing list
> > (pgsql-performance@postgresql.org) To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-performance


We do have an index on url_hits.time

not sure why timestamps were not used, I was not here for the design phase.


Thx

Re: Query tuning

From

Віталій Тимчишин

Date:

20 August 2009, 09:03:21

2009/8/19 Kevin Kempter <kevink@consistentstate.com>

We do have an index on url_hits.time

not sure why timestamps were not used, I was not here for the design phase.

What's type of time column? I don't like it casts it to double in explain. If it is integer, may be you need to change

and time >= extract ('epoch' from timestamp '2009-08-12')
and time < extract ('epoch' from timestamp '2009-08-13' )

to

and time >= extract ('epoch' from timestamp '2009-08-12')::int4
and time < extract ('epoch' from timestamp '2009-08-13' )::int4

for the index to be used?