Thread: Help with a seq scan on multi-million row table
Hello, I have a little 2-table JOIN, GROUP BY, ORDER BY query that does a sequential scan on a multi-million row table. I _thought_I had all the appropriate indices, but apparently I do not. I was wondering if anyone can spot a way I can speedup this query. The query currently takes... *gulp*: 381119.201 ms :( There are only 2 tables in the game: user_url and user_url_tag. The latter has FKs pointing to the former. The sequentialscan happens on the latter - user_url_tag: EXPLAIN ANALYZE select DISTINCT userurltag0_.tag as x0_0_, COUNT(*) as x1_0_ from user_url_tag userurltag0_, user_url userurl1_WHERE (((userurl1_.user_id=1 )) AND ((userurltag0_.user_url_id=userurl1_.id ))) GROUP BY userurltag0_.tag ORDERBY count(*) DESC; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Unique (cost=140972.22..140996.28 rows=3207 width=10) (actual time=381082.868..381110.094 rows=2546 loops=1) -> Sort (cost=140972.22..140980.24 rows=3207 width=10) (actual time=381082.858..381091.733 rows=2546 loops=1) SortKey: count(*), userurltag0_.tag -> HashAggregate (cost=140777.45..140785.46 rows=3207 width=10) (actual time=381032.844..381064.068rows=2546 loops=1) -> Hash Join (cost=2797.65..140758.50 rows=3790 width=10)(actual time=248.530..380635.132 rows=8544 loops=1) Hash Cond: ("outer".user_url_id = "inner".id) -> Seq Scan on user_url_tag userurltag0_ (cost=0.00..106650.30 rows=6254530 width=14) (actualtime=0.017..212256.630 rows=6259553 loops=1) -> Hash (cost=2795.24..2795.24 rows=962 width=4)(actual time=199.840..199.840 rows=0 loops=1) -> Index Scan using ix_user_url_user_id_url_idon user_url userurl1_ (cost=0.00..2795.24 rows=962 width=4) (actual time=0.048..193.707 rows=1666loops=1) Index Cond: (user_id = 1) Total runtime: 381119.201 ms (11 rows) This is what the two tables look like (extra colums removed): Table "public.user_url_tag" Column | Type | Modifiers -------------+-----------------------+-------------------------------------------------------------- id | integer | not null default nextval('public.user_url_tag_id_seq'::text) user_url_id | integer | tag | character varying(64) | Indexes: "pk_user_url_tag_id" PRIMARY KEY, btree (id) "ix_user_url_tag_tag" btree (tag) "ix_user_url_tag_user_url_id"btree (user_url_id) Foreign-key constraints: "fk_user_url_tag_user_url_id" FOREIGN KEY (user_url_id) REFERENCES user_url(id) Table "public.user_url" Column | Type | Modifiers ------------------+-----------------------------+---------------------------------------------------------- id | integer | not null default nextval('public.user_url_id_seq'::text) user_id | integer | url_id | integer | Indexes: "pk_user_url_id" PRIMARY KEY, btree (id) "ix_user_url_url_id_user_id" UNIQUE, btree (url_id, user_id) "ix_user_url_user_id_url_id" UNIQUE, btree (user_id, url_id) Does anyone see a way to speed up this s-l-o-w query? I cache DB results, but I'd love to get rid of that sequential scan. Thanks, Otis
On Wed, May 10, 2006 at 13:13:59 -0500, ogjunk-pgjedan@yahoo.com wrote: > Hello, > > I have a little 2-table JOIN, GROUP BY, ORDER BY query that does a sequential scan on a multi-million row table. I _thought_I had all the appropriate indices, but apparently I do not. I was wondering if anyone can spot a way I can speedup this query. > The query currently takes... *gulp*: 381119.201 ms :( > > There are only 2 tables in the game: user_url and user_url_tag. The latter has FKs pointing to the former. The sequentialscan happens on the latter - user_url_tag: > > EXPLAIN ANALYZE select DISTINCT userurltag0_.tag as x0_0_, COUNT(*) as x1_0_ from user_url_tag userurltag0_, user_url userurl1_WHERE (((userurl1_.user_id=1 )) AND ((userurltag0_.user_url_id=userurl1_.id ))) GROUP BY userurltag0_.tag ORDERBY count(*) DESC; While this isn't a big issue, it looks like DISTINCT is redundant in your query and seems to be adding some extra work.
Hi, Thanks Bruno. That was indeed a redundant DISTINCT. It did reduce the amount of work, but as you said it doesn't get ridof the sequential scan, which is the real problem with this query. Otis ----- Original Message ---- From: Bruno Wolff III <bruno@wolff.to> To: ogjunk-pgjedan@yahoo.com Cc: pgsql-sql@postgresql.org Sent: Wednesday, May 10, 2006 3:23:29 PM Subject: Re: [SQL] Help with a seq scan on multi-million row table On Wed, May 10, 2006 at 13:13:59 -0500, ogjunk-pgjedan@yahoo.com wrote: > Hello, > > I have a little 2-table JOIN, GROUP BY, ORDER BY query that does a sequential scan on a multi-million row table. I _thought_I had all the appropriate indices, but apparently I do not. I was wondering if anyone can spot a way I can speedup this query. > The query currently takes... *gulp*: 381119.201 ms :( > > There are only 2 tables in the game: user_url and user_url_tag. The latter has FKs pointing to the former. The sequentialscan happens on the latter - user_url_tag: > > EXPLAIN ANALYZE select DISTINCT userurltag0_.tag as x0_0_, COUNT(*) as x1_0_ from user_url_tag userurltag0_, user_url userurl1_WHERE (((userurl1_.user_id=1 )) AND ((userurltag0_.user_url_id=userurl1_.id ))) GROUP BY userurltag0_.tag ORDERBY count(*) DESC; While this isn't a big issue, it looks like DISTINCT is redundant in your query and seems to be adding some extra work. ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
<ogjunk-pgjedan@yahoo.com> writes: > -> Hash Join (cost=2797.65..140758.50 rows=3790 width=10) (actual time=248.530..380635.132 rows=8544 loops=1) > Hash Cond: ("outer".user_url_id = "inner".id) > -> Seq Scan on user_url_tag userurltag0_ (cost=0.00..106650.30 rows=6254530 width=14) (actual time=0.017..212256.630rows=6259553 loops=1) > -> Hash (cost=2795.24..2795.24 rows=962 width=4) (actual time=199.840..199.840 rows=0 loops=1) > -> Index Scan using ix_user_url_user_id_url_id on user_url userurl1_ (cost=0.00..2795.24 rows=962width=4) (actual time=0.048..193.707 rows=1666 loops=1) > Index Cond: (user_id = 1) Hm, I'm not sure why it's choosing that join plan. A nestloop indexscan wouldn't be terribly cheap, but just counting on my fingers it seems like it ought to come in at less than 100000 cost units. What do you get if you set enable_hashjoin off? (Then try disabling its second-choice join type too --- I'm interested to see EXPLAIN ANALYZE output for all three join types.) What PG version is this exactly? regards, tom lane
Aha! set hashjoin=off did the trick. The PG version is: 8.0.3 NB: I removed that redundant "DISTINCT" after the SELECT. EXPLAIN ANALYZE select userurltag0_.tag as x0_0_, COUNT(*) as x1_0_ from user_url_tag userurltag0_, user_url userurl1_ where(((userurl1_.user_id=1 ))AND((userurltag0_.user_url_id=userurl1_.id ))) group by userurltag0_.tag order by count(*)DESC; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Sort (cost=155766.79..155774.81 rows=3207 width=10) (actual time=2387.756..2396.578 rows=2546 loops=1) Sort Key: count(*) -> HashAggregate (cost=155572.02..155580.03 rows=3207 width=10) (actual time=2365.643..2376.626 rows=2546 loops=1) -> Nested Loop (cost=0.00..155552.68 rows=3867 width=10) (actual time=0.135..2222.028 rows=8544 loops=1) -> Index Scan using ix_user_url_user_id_url_id on user_url userurl1_ (cost=0.00..2798.12 rows=963 width=4) (actual time=0.067..9.744rows=1666 loops=1) Index Cond: (user_id = 1) -> Index Scan using ix_user_url_tag_user_url_idon user_url_tag userurltag0_ (cost=0.00..157.34 rows=103 width=14) (actual time=1.223..1.281rows=5 loops=1666) Index Cond: (userurltag0_.user_url_id = "outer".id)Total runtime:2405.691 ms (9 rows) Are you still interested in other "its second-choice join type"? If you are, please tell me what join types those are, thisis a bit beyond me. :( Is there a way to force PG to use the index automatically? This query is executed from something called Hibernate, and I'mnot sure if that will let me set enable_hashjoin=off through its API... Thanks, Otis ----- Original Message ---- From: Tom Lane <tgl@sss.pgh.pa.us> To: ogjunk-pgjedan@yahoo.com Cc: pgsql-sql@postgresql.org Sent: Wednesday, May 10, 2006 8:27:01 PM Subject: Re: [SQL] Help with a seq scan on multi-million row table <ogjunk-pgjedan@yahoo.com> writes: > -> Hash Join (cost=2797.65..140758.50 rows=3790 width=10) (actual time=248.530..380635.132 rows=8544 loops=1) > Hash Cond: ("outer".user_url_id = "inner".id) > -> Seq Scan on user_url_tag userurltag0_ (cost=0.00..106650.30 rows=6254530 width=14) (actual time=0.017..212256.630rows=6259553 loops=1) > -> Hash (cost=2795.24..2795.24 rows=962 width=4) (actual time=199.840..199.840 rows=0 loops=1) > -> Index Scan using ix_user_url_user_id_url_id on user_url userurl1_ (cost=0.00..2795.24 rows=962width=4) (actual time=0.048..193.707 rows=1666 loops=1) > Index Cond: (user_id = 1) Hm, I'm not sure why it's choosing that join plan. A nestloop indexscan wouldn't be terribly cheap, but just counting on my fingers it seems like it ought to come in at less than 100000 cost units. What do you get if you set enable_hashjoin off? (Then try disabling its second-choice join type too --- I'm interested to see EXPLAIN ANALYZE output for all three join types.) What PG version is this exactly? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend
<ogjunk-pgjedan@yahoo.com> writes: > Aha! set hashjoin=off did the trick. > -> Index Scan using ix_user_url_tag_user_url_id on user_url_tag userurltag0_ (cost=0.00..157.34 rows=103width=14) (actual time=1.223..1.281 rows=5 loops=1666) > Index Cond: (userurltag0_.user_url_id = "outer".id) This seems to be the problem right here: the estimate of matching rows is off by a factor of 20, and that inflates the overall cost estimate for this plan about the same, causing the planner to think the other way is cheaper. What does the pg_stats row for user_url_tag.user_url_id contain? Have you analyzed that table recently? regards, tom lane
Not sure if I'm showing you what you asked for, but here it is: select * from pg_stats where tablename='user_url_tag' and attname='user_url_id';schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+--------------+-------------+-----------+-----------+------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------+-------------public | user_url_tag | user_url_id | 0 | 4 | 60825 | {458321,1485346,16304,68027,125417,153465,182503,201175,202973,218423}| {0.00133333,0.001,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667}| {195,195993,325311,480323,647778,782598,1014527,1201726,1424822,1614712,1853719}| 0.795521 You asked if the table has been analyzed recently. I think so - I run ANALYZE on the whole DB every night, like this: $ psql -U me -c "ANALYZE;" mydb For a good measure, I just analyzed the table now: $ psql -U me -c "ANALYZE user_url_tag;" mydb Then I set the enable_hashjoin back to ON and re-run the EXPLAIN ANALYZE. I still get the sequential scan, even after analyzing the table :( I'm not sure which numbers you are referring to when you said the estimate is off, but here are some numbers: The whole tablehas 6-7 M rows. That query matches about 2500 rows. If there are other things I can play with and help narrow this down, please let me know. Thanks, Otis ----- Original Message ---- From: Tom Lane <tgl@sss.pgh.pa.us> To: ogjunk-pgjedan@yahoo.com Cc: pgsql-sql@postgresql.org Sent: Wednesday, May 10, 2006 9:53:49 PM Subject: Re: [SQL] Help with a seq scan on multi-million row table <ogjunk-pgjedan@yahoo.com> writes: > Aha! set hashjoin=off did the trick. > -> Index Scan using ix_user_url_tag_user_url_id on user_url_tag userurltag0_ (cost=0.00..157.34 rows=103width=14) (actual time=1.223..1.281 rows=5 loops=1666) > Index Cond: (userurltag0_.user_url_id = "outer".id) This seems to be the problem right here: the estimate of matching rows is off by a factor of 20, and that inflates the overall cost estimate for this plan about the same, causing the planner to think the other way is cheaper. What does the pg_stats row for user_url_tag.user_url_id contain? Have you analyzed that table recently? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend
Hi, For example I have a table like this ID Name 1 A 1 B 2 C 2 D And I have a Query like this: SELECT * FROM table ORDER BY ID. Would my result always give me the same order or is there may be a possible different result? For example? 1 B 1 A 2 D 2 C
On May 11, 2006, at 21:00 , Christian Paul Cosinas wrote: > For example I have a table like this > > ID Name > 1 A > 1 B > 2 C > 2 D > > And I have a Query like this: > SELECT * FROM table ORDER BY ID. > > Would my result always give me the same order or is there may be a > possible > different result? It may give you a different result: if the order is not explicitly specified by the ORDER BY clause, you are not guaranteed to get the rows back in any particular order. So for cases where you may not be ordering on a unique column (such as above), you may want to do something like: SELECT * FROM table ORDER BY "ID", "Name"; Hope this helps. Michael Glaesemann grzm seespotcode net
Hi, Otis, ogjunk-pgjedan@yahoo.com wrote: > I'm not sure which numbers you are referring to when you said the estimate is off, but here are some numbers: > The whole table has 6-7 M rows. > That query matches about 2500 rows. > > If there are other things I can play with and help narrow this down, please let me know. Did you try to set higher statistics targets for this columns? For experimenting, I'd try to set it to 100 or even higher, then ANALYZE the table, and then retest the query. HTH, Marks -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org
Markus Schaber <schabi@logix-tt.com> writes: > Did you try to set higher statistics targets for this columns? Yeah, I was about to suggest the same after looking at the pg_stats entry. It shows estimated num_distinct as 60825, ie only about 60k distinct values in the column, which sounds kinda low (do you know the true figure?). A larger statistics target would probably produce a better number and hence a better join estimate. regards, tom lane
Hi Markus & Tom, Higher statistics for this column.... hm, I'd love to try changing it to see how that changes things, but I'm afraid I don'tknow how to do that. How can I change the statistics target value for this column? Ah, I think I found the place: => select * from pg_attribute where attname='user_url_id';attrelid | attname | atttypid | attstattarget | attlen | attnum| attndims | attcacheoff | atttypmod | attbyval | attstorage | attalign | attnotnull | atthasdef | attisdropped | attislocal| attinhcount ----------+-------------+----------+---------------+--------+--------+----------+-------------+-----------+----------+------------+----------+------------+-----------+--------------+------------+------------- 6124839| user_url_id | 23 | -1 | 4 | 1 | 0 | -1 | -1 | t | p | i | f | f | f | t | 0 1646081 | user_url_id | 23 | -1 | 4 | 2 | 0 | -1 | -1 | t | p | i | f | f |f | t | 010048109 | user_url_id | 23 | -1 | 4 | 3 | 0 | -1 | -1 | t | p | i | f | f | f | t | 010048123 | user_url_id | 23 | -1 | 4 | 2 | 0 | -1 | -1 | t | p | i | f | f | f | t | 0 Hm, 4 rows. I need to change the value of the 'attstattarget' column, but for which of these rows? Only attrelid is different. I tried looking at pg_class, but didn't find anything with the above attrelid's. I used: => select * from pg_class where relname like 'user_url%'; Tom: you asked about distinct values. pg_stats shows cca. 60K distinct values, but the real number is: select count(distinct user_url_id) from user_url_tag; count ---------1505933 This number grows daily by... not sure how much, probably 5k a day currently. Thanks, Otis ----- Original Message ---- From: Markus Schaber <schabi@logix-tt.com> To: ogjunk-pgjedan@yahoo.com Cc: pgsql-sql@postgresql.org Sent: Thursday, May 11, 2006 6:33:55 AM Subject: Re: [SQL] Help with a seq scan on multi-million row table Hi, Otis, ogjunk-pgjedan@yahoo.com wrote: > I'm not sure which numbers you are referring to when you said the estimate is off, but here are some numbers: > The whole table has 6-7 M rows. > That query matches about 2500 rows. > > If there are other things I can play with and help narrow this down, please let me know. Did you try to set higher statistics targets for this columns? For experimenting, I'd try to set it to 100 or even higher, then ANALYZE the table, and then retest the query. HTH, Marks -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org
On Thu, May 11, 2006 at 10:09:44AM -0700, ogjunk-pgjedan@yahoo.com wrote: > Hi Markus & Tom, > > Higher statistics for this column.... hm, I'd love to try changing > it to see how that changes things, but I'm afraid I don't know how > to do that. How can I change the statistics target value for this > column? > > Ah, I think I found the place: No. Just ALTER TABLE [name] ALTER [column] SET STATISTICS. See http://www.postgresql.org/docs/8.1/interactive/sql-altertable.html for more. You'll need to ANALYSE afterwards. A -- Andrew Sullivan | ajs@crankycanuck.ca In the future this spectacle of the middle classes shocking the avant- garde will probably become the textbook definition of Postmodernism. --Brad Holland
Hello Andrew, Markus, and Tom - thanks for all the help! You've just helped a large Simpy community! :) I'll try to postsome performance charts to http://blog.simpy.com/ shortly. In short, this immediately dropped the load from 2-3-4-5-6-7+to circa 0.25. Thanks! Otis ----- Original Message ---- From: Andrew Sullivan <ajs@crankycanuck.ca> To: pgsql-sql@postgresql.org Sent: Thursday, May 11, 2006 1:18:08 PM Subject: Re: [SQL] Help with a seq scan on multi-million row table On Thu, May 11, 2006 at 10:09:44AM -0700, ogjunk-pgjedan@yahoo.com wrote: > Hi Markus & Tom, > > Higher statistics for this column.... hm, I'd love to try changing > it to see how that changes things, but I'm afraid I don't know how > to do that. How can I change the statistics target value for this > column? > > Ah, I think I found the place: No. Just ALTER TABLE [name] ALTER [column] SET STATISTICS. See http://www.postgresql.org/docs/8.1/interactive/sql-altertable.html for more. You'll need to ANALYSE afterwards. A -- Andrew Sullivan | ajs@crankycanuck.ca