Thread: Query tuning
Hi all; we've been fighting this query for a few days now. we bumped up the statistict target for the a.id , c.url_hits_id and the b.id columns below to 250 and ran an analyze on the relevant tables. we killed it after 8hrs. Note the url_hits table has > 1.4billion rows Any suggestions? $ psql -ef expl.sql pwreport explain select a.id, ident_id, time, customer_name, extract('day' from timezone(e.name, to_timestamp(a.time))) as day, category_id from pwreport.url_hits a left outer join pwreport.url_hits_category_jt c on (a.id = c.url_hits_id), pwreport.ident b, pwreport.timezone e where a.ident_id = b.id and b.timezone_id = e.id and time >= extract ('epoch' from timestamp '2009-08-12') and time < extract ('epoch' from timestamp '2009-08-13' ) and direction = 'REQUEST' ; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Merge Right Join (cost=47528508.61..180424544.59 rows=10409251 width=53) Merge Cond: (c.url_hits_id = a.id) -> Index Scan using mt_url_hits_category_jt_url_hits_id_index on url_hits_category_jt c (cost=0.00..122162596.63 rows=4189283233 width=8) -> Sort (cost=47528508.61..47536931.63 rows=3369210 width=49) Sort Key: a.id -> Hash Join (cost=2565.00..47163219.21 rows=3369210 width=49) Hash Cond: (b.timezone_id = e.id) -> Hash Join (cost=2553.49..47116881.07 rows=3369210 width=37) Hash Cond: (a.ident_id = b.id) -> Seq Scan on url_hits a (cost=0.00..47051154.89 rows=3369210 width=12) Filter: ((direction = 'REQUEST'::proxy_direction_enum) AND (("time")::double precision >= 1250035200::double precision) AND (("time")::double precision < 1250121600::double precision)) -> Hash (cost=2020.44..2020.44 rows=42644 width=29) -> Seq Scan on ident b (cost=0.00..2020.44 rows=42644 width=29) -> Hash (cost=6.78..6.78 rows=378 width=20) -> Seq Scan on timezone e (cost=0.00..6.78 rows=378 width=20) (15 rows)
that seems to be the killer: and time >= extract ('epoch' from timestamp '2009-08-12') and time < extract ('epoch' from timestamp '2009-08-13' ) You probably need an index on time/epoch: CREATE INDEX foo ON table(extract ('epoch' from timestamp time ); or something like that, vacuum analyze and retry.
On 8/19/09 9:28 AM, "Kevin Kempter" <kevink@consistentstate.com> wrote: > Hi all; > > we've been fighting this query for a few days now. we bumped up the statistict > target for the a.id , c.url_hits_id and the b.id columns below to 250 and ran > an analyze on the relevant tables. we killed it after 8hrs. > > Note the url_hits table has > 1.4billion rows > > Any suggestions? > Have you tried setting work_mem higher for just this query? The big estimated cost is the sequential scan on url_hits. But in reality, if the estimates are off the sort and index scan at the end might be your bottleneck. Larger work_mem might make it choose another plan there. But if the true cost is the sequential scan on url_hits, then only an index there will help. > > > $ psql -ef expl.sql pwreport > explain > select > a.id, > ident_id, > time, > customer_name, > extract('day' from timezone(e.name, to_timestamp(a.time))) as day, > category_id > from > pwreport.url_hits a left outer join > pwreport.url_hits_category_jt c on (a.id = c.url_hits_id), > pwreport.ident b, > pwreport.timezone e > where > a.ident_id = b.id > and b.timezone_id = e.id > and time >= extract ('epoch' from timestamp '2009-08-12') > and time < extract ('epoch' from timestamp '2009-08-13' ) > and direction = 'REQUEST' > ; > > QUERY > PLAN > ------------------------------------------------------------------------------ > ------------------------------------------------------------------------------ > -------------------------------------------------------- > Merge Right Join (cost=47528508.61..180424544.59 rows=10409251 width=53) > Merge Cond: (c.url_hits_id = a.id) > -> Index Scan using mt_url_hits_category_jt_url_hits_id_index on > url_hits_category_jt c (cost=0.00..122162596.63 rows=4189283233 width=8) > -> Sort (cost=47528508.61..47536931.63 rows=3369210 width=49) > Sort Key: a.id > -> Hash Join (cost=2565.00..47163219.21 rows=3369210 width=49) > Hash Cond: (b.timezone_id = e.id) > -> Hash Join (cost=2553.49..47116881.07 rows=3369210 > width=37) > Hash Cond: (a.ident_id = b.id) > -> Seq Scan on url_hits a (cost=0.00..47051154.89 > rows=3369210 width=12) > Filter: ((direction = > 'REQUEST'::proxy_direction_enum) AND (("time")::double precision >= > 1250035200::double precision) AND (("time")::double precision < > 1250121600::double precision)) > -> Hash (cost=2020.44..2020.44 rows=42644 width=29) > -> Seq Scan on ident b (cost=0.00..2020.44 > rows=42644 width=29) > -> Hash (cost=6.78..6.78 rows=378 width=20) > -> Seq Scan on timezone e (cost=0.00..6.78 rows=378 > width=20) > (15 rows) > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
2009/8/19 Grzegorz Jaśkiewicz <gryzman@gmail.com>
It looks like those extracts just make constant integer times. You probably just create an index on the time column.
Also, why not store times as timestamps?
that seems to be the killer:You probably need an index on time/epoch:
and time >= extract ('epoch' from timestamp '2009-08-12')
and time < extract ('epoch' from timestamp '2009-08-13' )
CREATE INDEX foo ON table(extract ('epoch' from timestamp time );
It looks like those extracts just make constant integer times. You probably just create an index on the time column.
Also, why not store times as timestamps?
or something like that, vacuum analyze and retry.
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On Wednesday 19 August 2009 11:17:26 Scott Carey wrote: > On 8/19/09 9:28 AM, "Kevin Kempter" <kevink@consistentstate.com> wrote: > > Hi all; > > > > we've been fighting this query for a few days now. we bumped up the > > statistict target for the a.id , c.url_hits_id and the b.id columns below > > to 250 and ran an analyze on the relevant tables. we killed it after > > 8hrs. > > > > Note the url_hits table has > 1.4billion rows > > > > Any suggestions? > > Have you tried setting work_mem higher for just this query? Yes, we upped it to 500Meg > > The big estimated cost is the sequential scan on url_hits. But in reality, > if the estimates are off the sort and index scan at the end might be your > bottleneck. Larger work_mem might make it choose another plan there. > > But if the true cost is the sequential scan on url_hits, then only an index > there will help. > > > $ psql -ef expl.sql pwreport > > explain > > select > > a.id, > > ident_id, > > time, > > customer_name, > > extract('day' from timezone(e.name, to_timestamp(a.time))) as day, > > category_id > > from > > pwreport.url_hits a left outer join > > pwreport.url_hits_category_jt c on (a.id = c.url_hits_id), > > pwreport.ident b, > > pwreport.timezone e > > where > > a.ident_id = b.id > > and b.timezone_id = e.id > > and time >= extract ('epoch' from timestamp '2009-08-12') > > and time < extract ('epoch' from timestamp '2009-08-13' ) > > and direction = 'REQUEST' > > ; > > > > QUERY > > PLAN > > ------------------------------------------------------------------------- > >----- > > ------------------------------------------------------------------------- > >----- -------------------------------------------------------- > > Merge Right Join (cost=47528508.61..180424544.59 rows=10409251 > > width=53) Merge Cond: (c.url_hits_id = a.id) > > -> Index Scan using mt_url_hits_category_jt_url_hits_id_index on > > url_hits_category_jt c (cost=0.00..122162596.63 rows=4189283233 width=8) > > -> Sort (cost=47528508.61..47536931.63 rows=3369210 width=49) > > Sort Key: a.id > > -> Hash Join (cost=2565.00..47163219.21 rows=3369210 width=49) > > Hash Cond: (b.timezone_id = e.id) > > -> Hash Join (cost=2553.49..47116881.07 rows=3369210 > > width=37) > > Hash Cond: (a.ident_id = b.id) > > -> Seq Scan on url_hits a (cost=0.00..47051154.89 > > rows=3369210 width=12) > > Filter: ((direction = > > 'REQUEST'::proxy_direction_enum) AND (("time")::double precision >= > > 1250035200::double precision) AND (("time")::double precision < > > 1250121600::double precision)) > > -> Hash (cost=2020.44..2020.44 rows=42644 > > width=29) -> Seq Scan on ident b (cost=0.00..2020.44 rows=42644 > > width=29) > > -> Hash (cost=6.78..6.78 rows=378 width=20) > > -> Seq Scan on timezone e (cost=0.00..6.78 > > rows=378 width=20) > > (15 rows) > > > > > > -- > > Sent via pgsql-performance mailing list > > (pgsql-performance@postgresql.org) To make changes to your subscription: > > http://www.postgresql.org/mailpref/pgsql-performance
On Wednesday 19 August 2009 11:31:30 Nikolas Everett wrote: > 2009/8/19 Grzegorz Jaśkiewicz <gryzman@gmail.com> > > > that seems to be the killer: > > > > and time >= extract ('epoch' from timestamp '2009-08-12') > > and time < extract ('epoch' from timestamp '2009-08-13' ) > > > > You probably need an index on time/epoch: > > > > CREATE INDEX foo ON table(extract ('epoch' from timestamp time ); > > It looks like those extracts just make constant integer times. You probably > just create an index on the time column. > > Also, why not store times as timestamps? > > > or something like that, vacuum analyze and retry. > > > > -- > > Sent via pgsql-performance mailing list > > (pgsql-performance@postgresql.org) To make changes to your subscription: > > http://www.postgresql.org/mailpref/pgsql-performance We do have an index on url_hits.time not sure why timestamps were not used, I was not here for the design phase. Thx
2009/8/19 Kevin Kempter <kevink@consistentstate.com>
We do have an index on url_hits.time
not sure why timestamps were not used, I was not here for the design phase.
What's type of time column? I don't like it casts it to double in explain. If it is integer, may be you need to change
and time >= extract ('epoch' from timestamp '2009-08-12')
and time < extract ('epoch' from timestamp '2009-08-13' )
to
and time >= extract ('epoch' from timestamp '2009-08-12')::int4
and time < extract ('epoch' from timestamp '2009-08-13' )::int4
for the index to be used?