Thread: Help with a seq scan on multi-million row table

Help with a seq scan on multi-million row table

From
Date:
Hello,

I have a little 2-table JOIN, GROUP BY, ORDER BY query that does a sequential scan on a multi-million row table.  I
_thought_I had all the appropriate indices, but apparently I do not.  I was wondering if anyone can spot a way I can
speedup this query.
 
The query currently takes... *gulp*: 381119.201 ms :( 
There are only 2 tables in the game: user_url and user_url_tag.  The latter has FKs pointing to the former.  The
sequentialscan happens on the latter - user_url_tag:
 

EXPLAIN ANALYZE select DISTINCT userurltag0_.tag as x0_0_, COUNT(*) as x1_0_ from user_url_tag userurltag0_, user_url
userurl1_WHERE (((userurl1_.user_id=1 )) AND ((userurltag0_.user_url_id=userurl1_.id ))) GROUP BY userurltag0_.tag
ORDERBY  count(*) DESC;                                                                                        QUERY
PLAN  
 

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique  (cost=140972.22..140996.28 rows=3207 width=10) (actual time=381082.868..381110.094 rows=2546 loops=1)    ->
Sort (cost=140972.22..140980.24 rows=3207 width=10) (actual time=381082.858..381091.733 rows=2546 loops=1)
SortKey: count(*), userurltag0_.tag          ->  HashAggregate  (cost=140777.45..140785.46 rows=3207 width=10) (actual
time=381032.844..381064.068rows=2546 loops=1)                ->  Hash Join  (cost=2797.65..140758.50 rows=3790
width=10)(actual time=248.530..380635.132 rows=8544 loops=1)                      Hash Cond: ("outer".user_url_id =
"inner".id)                     ->  Seq Scan on user_url_tag userurltag0_  (cost=0.00..106650.30 rows=6254530 width=14)
(actualtime=0.017..212256.630 rows=6259553 loops=1)                      ->  Hash  (cost=2795.24..2795.24 rows=962
width=4)(actual time=199.840..199.840 rows=0 loops=1)                            ->  Index Scan using
ix_user_url_user_id_url_idon user_url userurl1_  (cost=0.00..2795.24 rows=962 width=4) (actual time=0.048..193.707
rows=1666loops=1)                                  Index Cond: (user_id = 1)  Total runtime: 381119.201 ms  
 
(11 rows)     This is what the two tables look like (extra colums removed):
                                   Table "public.user_url_tag"    Column    |         Type          |
      Modifiers  
 
-------------+-----------------------+--------------------------------------------------------------  id          |
integer              | not null default nextval('public.user_url_tag_id_seq'::text)  user_url_id | integer
|  tag         | character varying(64) |  
 
Indexes:     "pk_user_url_tag_id" PRIMARY KEY, btree (id)     "ix_user_url_tag_tag" btree (tag)
"ix_user_url_tag_user_url_id"btree (user_url_id)  
 
Foreign-key constraints:     "fk_user_url_tag_user_url_id" FOREIGN KEY (user_url_id) REFERENCES user_url(id)
                               Table "public.user_url"       Column      |            Type             |
       Modifiers  
 
------------------+-----------------------------+----------------------------------------------------------  id
     | integer                     | not null default nextval('public.user_url_id_seq'::text)  user_id          |
integer                    |  url_id           | integer                     |  
 
Indexes:     "pk_user_url_id" PRIMARY KEY, btree (id)     "ix_user_url_url_id_user_id" UNIQUE, btree (url_id, user_id)
  "ix_user_url_user_id_url_id" UNIQUE, btree (user_id, url_id)     Does anyone see a way to speed up this s-l-o-w
query?
I cache DB results, but I'd love to get rid of that sequential scan.

Thanks,
Otis





Re: Help with a seq scan on multi-million row table

From
Bruno Wolff III
Date:
On Wed, May 10, 2006 at 13:13:59 -0500, ogjunk-pgjedan@yahoo.com wrote:
>   Hello,
> 
> I have a little 2-table JOIN, GROUP BY, ORDER BY query that does a sequential scan on a multi-million row table.  I
_thought_I had all the appropriate indices, but apparently I do not.  I was wondering if anyone can spot a way I can
speedup this query.
 
> The query currently takes... *gulp*: 381119.201 ms :(
>   
> There are only 2 tables in the game: user_url and user_url_tag.  The latter has FKs pointing to the former.  The
sequentialscan happens on the latter - user_url_tag:
 
> 
> EXPLAIN ANALYZE select DISTINCT userurltag0_.tag as x0_0_, COUNT(*) as x1_0_ from user_url_tag userurltag0_, user_url
userurl1_WHERE (((userurl1_.user_id=1 )) AND ((userurltag0_.user_url_id=userurl1_.id ))) GROUP BY userurltag0_.tag
ORDERBY  count(*) DESC;  
 

While this isn't a big issue, it looks like DISTINCT is redundant in your
query and seems to be adding some extra work.


Re: Help with a seq scan on multi-million row table

From
Date:
Hi,

Thanks Bruno.  That was indeed a redundant DISTINCT.  It did reduce the amount of work, but as you said it doesn't get
ridof the sequential scan, which is the real problem with this query.
 

Otis

----- Original Message ----
From: Bruno Wolff III <bruno@wolff.to>
To: ogjunk-pgjedan@yahoo.com
Cc: pgsql-sql@postgresql.org
Sent: Wednesday, May 10, 2006 3:23:29 PM
Subject: Re: [SQL] Help with a seq scan on multi-million row table

On Wed, May 10, 2006 at 13:13:59 -0500, ogjunk-pgjedan@yahoo.com wrote:
>   Hello,
> 
> I have a little 2-table JOIN, GROUP BY, ORDER BY query that does a sequential scan on a multi-million row table.  I
_thought_I had all the appropriate indices, but apparently I do not.  I was wondering if anyone can spot a way I can
speedup this query.
 
> The query currently takes... *gulp*: 381119.201 ms :(
>   
> There are only 2 tables in the game: user_url and user_url_tag.  The latter has FKs pointing to the former.  The
sequentialscan happens on the latter - user_url_tag:
 
> 
> EXPLAIN ANALYZE select DISTINCT userurltag0_.tag as x0_0_, COUNT(*) as x1_0_ from user_url_tag userurltag0_, user_url
userurl1_WHERE (((userurl1_.user_id=1 )) AND ((userurltag0_.user_url_id=userurl1_.id ))) GROUP BY userurltag0_.tag
ORDERBY  count(*) DESC;  
 

While this isn't a big issue, it looks like DISTINCT is redundant in your
query and seems to be adding some extra work.

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?
              http://www.postgresql.org/docs/faq





Re: Help with a seq scan on multi-million row table

From
Tom Lane
Date:
<ogjunk-pgjedan@yahoo.com> writes:
>                ->  Hash Join  (cost=2797.65..140758.50 rows=3790 width=10) (actual time=248.530..380635.132 rows=8544
loops=1) 
 
>                      Hash Cond: ("outer".user_url_id = "inner".id)  
>                      ->  Seq Scan on user_url_tag userurltag0_  (cost=0.00..106650.30 rows=6254530 width=14) (actual
time=0.017..212256.630rows=6259553 loops=1)  
 
>                      ->  Hash  (cost=2795.24..2795.24 rows=962 width=4) (actual time=199.840..199.840 rows=0 loops=1)

 
>                            ->  Index Scan using ix_user_url_user_id_url_id on user_url userurl1_  (cost=0.00..2795.24
rows=962width=4) (actual time=0.048..193.707 rows=1666 loops=1)  
 
>                                  Index Cond: (user_id = 1)  

Hm, I'm not sure why it's choosing that join plan.  A nestloop indexscan
wouldn't be terribly cheap, but just counting on my fingers it seems
like it ought to come in at less than 100000 cost units.  What do you
get if you set enable_hashjoin off?  (Then try disabling its
second-choice join type too --- I'm interested to see EXPLAIN ANALYZE
output for all three join types.)

What PG version is this exactly?
        regards, tom lane


Re: Help with a seq scan on multi-million row table

From
Date:
Aha!  set hashjoin=off did the trick.
The PG version is: 8.0.3

NB: I removed that redundant "DISTINCT" after the SELECT.

EXPLAIN ANALYZE select userurltag0_.tag as x0_0_, COUNT(*) as x1_0_ from user_url_tag userurltag0_, user_url userurl1_
where(((userurl1_.user_id=1 ))AND((userurltag0_.user_url_id=userurl1_.id ))) group by  userurltag0_.tag order by
count(*)DESC;                                                                                 QUERY PLAN 
 

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Sort
(cost=155766.79..155774.81 rows=3207 width=10) (actual time=2387.756..2396.578 rows=2546 loops=1)  Sort Key: count(*)
-> HashAggregate  (cost=155572.02..155580.03 rows=3207 width=10) (actual time=2365.643..2376.626 rows=2546 loops=1)
  ->  Nested Loop  (cost=0.00..155552.68 rows=3867 width=10) (actual time=0.135..2222.028 rows=8544 loops=1)
 ->  Index Scan using ix_user_url_user_id_url_id on user_url userurl1_  (cost=0.00..2798.12 rows=963 width=4) (actual
time=0.067..9.744rows=1666 loops=1)                    Index Cond: (user_id = 1)              ->  Index Scan using
ix_user_url_tag_user_url_idon user_url_tag userurltag0_  (cost=0.00..157.34 rows=103 width=14) (actual
time=1.223..1.281rows=5 loops=1666)                    Index Cond: (userurltag0_.user_url_id = "outer".id)Total
runtime:2405.691 ms
 
(9 rows)


Are you still interested in other "its second-choice join type"?  If you are, please tell me what join types those are,
thisis a bit beyond me. :(
 

Is there a way to force PG to use the index automatically?  This query is executed from something called Hibernate, and
I'mnot sure if that will let me set enable_hashjoin=off through its API...
 

Thanks,
Otis


----- Original Message ----
From: Tom Lane <tgl@sss.pgh.pa.us>
To: ogjunk-pgjedan@yahoo.com
Cc: pgsql-sql@postgresql.org
Sent: Wednesday, May 10, 2006 8:27:01 PM
Subject: Re: [SQL] Help with a seq scan on multi-million row table 

<ogjunk-pgjedan@yahoo.com> writes:
>                ->  Hash Join  (cost=2797.65..140758.50 rows=3790 width=10) (actual time=248.530..380635.132 rows=8544
loops=1) 
 
>                      Hash Cond: ("outer".user_url_id = "inner".id)  
>                      ->  Seq Scan on user_url_tag userurltag0_  (cost=0.00..106650.30 rows=6254530 width=14) (actual
time=0.017..212256.630rows=6259553 loops=1)  
 
>                      ->  Hash  (cost=2795.24..2795.24 rows=962 width=4) (actual time=199.840..199.840 rows=0 loops=1)

 
>                            ->  Index Scan using ix_user_url_user_id_url_id on user_url userurl1_  (cost=0.00..2795.24
rows=962width=4) (actual time=0.048..193.707 rows=1666 loops=1)  
 
>                                  Index Cond: (user_id = 1)  

Hm, I'm not sure why it's choosing that join plan.  A nestloop indexscan
wouldn't be terribly cheap, but just counting on my fingers it seems
like it ought to come in at less than 100000 cost units.  What do you
get if you set enable_hashjoin off?  (Then try disabling its
second-choice join type too --- I'm interested to see EXPLAIN ANALYZE
output for all three join types.)

What PG version is this exactly?
           regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend





Re: Help with a seq scan on multi-million row table

From
Tom Lane
Date:
<ogjunk-pgjedan@yahoo.com> writes:
> Aha!  set hashjoin=off did the trick.

>                ->  Index Scan using ix_user_url_tag_user_url_id on user_url_tag userurltag0_  (cost=0.00..157.34
rows=103width=14) (actual time=1.223..1.281 rows=5 loops=1666)
 
>                      Index Cond: (userurltag0_.user_url_id = "outer".id)

This seems to be the problem right here: the estimate of matching rows
is off by a factor of 20, and that inflates the overall cost estimate
for this plan about the same, causing the planner to think the other way
is cheaper.

What does the pg_stats row for user_url_tag.user_url_id contain?
Have you analyzed that table recently?
        regards, tom lane


Re: Help with a seq scan on multi-million row table

From
Date:
Not sure if I'm showing you what you asked for, but here it is:

select * from pg_stats  where tablename='user_url_tag' and attname='user_url_id';schemaname |  tablename   |   attname
| null_frac | avg_width | n_distinct |                            most_common_vals                           |
                                      most_common_freqs                   |
histogram_bounds                                | correlation
 

------------+--------------+-------------+-----------+-----------+------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------+-------------public
   | user_url_tag | user_url_id |         0 |         4 |      60825 |
{458321,1485346,16304,68027,125417,153465,182503,201175,202973,218423}|
{0.00133333,0.001,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667}|
{195,195993,325311,480323,647778,782598,1014527,1201726,1424822,1614712,1853719}|    0.795521
 

You asked if the table has been analyzed recently.
I think so - I run ANALYZE on the whole DB every night, like this:

$ psql -U me -c "ANALYZE;" mydb

For a good measure, I just analyzed the table now: $ psql -U me -c "ANALYZE user_url_tag;" mydb
Then I set the enable_hashjoin back to ON and re-run the EXPLAIN ANALYZE.
I still get the sequential scan, even after analyzing the table :(

I'm not sure which numbers you are referring to when you said the estimate is off, but here are some numbers: The whole
tablehas 6-7 M rows. That query matches about 2500 rows.
 

If there are other things I can play with and help narrow this down, please let me know.

Thanks,
Otis


----- Original Message ----
From: Tom Lane <tgl@sss.pgh.pa.us>
To: ogjunk-pgjedan@yahoo.com
Cc: pgsql-sql@postgresql.org
Sent: Wednesday, May 10, 2006 9:53:49 PM
Subject: Re: [SQL] Help with a seq scan on multi-million row table 

<ogjunk-pgjedan@yahoo.com> writes:
> Aha!  set hashjoin=off did the trick.

>                ->  Index Scan using ix_user_url_tag_user_url_id on user_url_tag userurltag0_  (cost=0.00..157.34
rows=103width=14) (actual time=1.223..1.281 rows=5 loops=1666)
 
>                      Index Cond: (userurltag0_.user_url_id = "outer".id)

This seems to be the problem right here: the estimate of matching rows
is off by a factor of 20, and that inflates the overall cost estimate
for this plan about the same, causing the planner to think the other way
is cheaper.

What does the pg_stats row for user_url_tag.user_url_id contain?
Have you analyzed that table recently?
           regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend





ORDER BY question

From
"Christian Paul Cosinas"
Date:
Hi,

For example I have a table like this

ID    Name
1    A
1    B
2    C
2    D

And I have a Query like this:
SELECT * FROM table ORDER BY ID.

Would my result always give me the same order or is there may be a possible
different result?

For example?

1    B
1    A
2    D
2    C



Re: ORDER BY question

From
Michael Glaesemann
Date:
On May 11, 2006, at 21:00 , Christian Paul Cosinas wrote:

> For example I have a table like this
>
> ID    Name
> 1    A
> 1    B
> 2    C
> 2    D
>
> And I have a Query like this:
> SELECT * FROM table ORDER BY ID.
>
> Would my result always give me the same order or is there may be a  
> possible
> different result?

It may give you a different result: if the order is not explicitly  
specified by the ORDER BY clause, you are not guaranteed to get the  
rows back in any particular order. So for cases where you may not be  
ordering on a unique column (such as above), you may want to do  
something like:

SELECT *
FROM table
ORDER BY "ID", "Name";

Hope this helps.

Michael Glaesemann
grzm seespotcode net





Re: Help with a seq scan on multi-million row table

From
Markus Schaber
Date:
Hi, Otis,

ogjunk-pgjedan@yahoo.com wrote:

> I'm not sure which numbers you are referring to when you said the estimate is off, but here are some numbers:
>   The whole table has 6-7 M rows.
>   That query matches about 2500 rows.
> 
> If there are other things I can play with and help narrow this down, please let me know.

Did you try to set higher statistics targets for this columns?

For experimenting, I'd try to set it to 100 or even higher, then ANALYZE
the table, and then retest the query.

HTH,
Marks

-- 
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org


Re: Help with a seq scan on multi-million row table

From
Tom Lane
Date:
Markus Schaber <schabi@logix-tt.com> writes:
> Did you try to set higher statistics targets for this columns?

Yeah, I was about to suggest the same after looking at the pg_stats
entry.  It shows estimated num_distinct as 60825, ie only about 60k
distinct values in the column, which sounds kinda low (do you know
the true figure?).  A larger statistics target would probably produce
a better number and hence a better join estimate.
        regards, tom lane


Re: Help with a seq scan on multi-million row table

From
Date:
Hi Markus & Tom,

Higher statistics for this column.... hm, I'd love to try changing it to see how that changes things, but I'm afraid I
don'tknow how to do that.  How can I change the statistics target value for this column?
 

Ah, I think I found the place:
 => select * from pg_attribute where attname='user_url_id';attrelid |   attname   | atttypid | attstattarget | attlen |
attnum| attndims | attcacheoff | atttypmod | attbyval | attstorage | attalign | attnotnull | atthasdef | attisdropped |
attislocal| attinhcount
 

----------+-------------+----------+---------------+--------+--------+----------+-------------+-----------+----------+------------+----------+------------+-----------+--------------+------------+-------------
6124839| user_url_id |       23 |            -1 |      4 |      1 |        0 |          -1 |        -1 | t        | p
  | i        | f          | f         | f            | t          |           0 1646081 | user_url_id |       23 |
     -1 |      4 |      2 |        0 |          -1 |        -1 | t        | p      | i        | f          | f
|f            | t          |           010048109 | user_url_id |       23 |            -1 |      4 |      3 |        0
|         -1 |        -1 | t        | p      | i        | f          | f         | f            | t          |
010048123 | user_url_id |       23 |            -1 |      4 |      2 |        0 |          -1 |        -1 | t        |
p     | i        | f          | f         | f            | t          |           0
 

Hm, 4 rows.  I need to change the value of the 'attstattarget' column, but for which of these rows?  Only attrelid is
different.
I tried looking at pg_class, but didn't find anything with the above attrelid's.  I used:
 => select * from pg_class where relname like 'user_url%';

Tom: you asked about distinct values.  pg_stats shows cca. 60K distinct values, but the real number is:
select count(distinct user_url_id) from user_url_tag; count
---------1505933

This number grows daily by... not sure how much, probably 5k a day currently.

Thanks,
Otis

----- Original Message ----
From: Markus Schaber <schabi@logix-tt.com>
To: ogjunk-pgjedan@yahoo.com
Cc: pgsql-sql@postgresql.org
Sent: Thursday, May 11, 2006 6:33:55 AM
Subject: Re: [SQL] Help with a seq scan on multi-million row table

Hi, Otis,

ogjunk-pgjedan@yahoo.com wrote:

> I'm not sure which numbers you are referring to when you said the estimate is off, but here are some numbers:
>   The whole table has 6-7 M rows.
>   That query matches about 2500 rows.
> 
> If there are other things I can play with and help narrow this down, please let me know.

Did you try to set higher statistics targets for this columns?

For experimenting, I'd try to set it to 100 or even higher, then ANALYZE
the table, and then retest the query.

HTH,
Marks

-- 
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org





Re: Help with a seq scan on multi-million row table

From
Andrew Sullivan
Date:
On Thu, May 11, 2006 at 10:09:44AM -0700, ogjunk-pgjedan@yahoo.com wrote:
> Hi Markus & Tom,
> 
> Higher statistics for this column.... hm, I'd love to try changing
> it to see how that changes things, but I'm afraid I don't know how
> to do that.  How can I change the statistics target value for this
> column?
> 
> Ah, I think I found the place:

No.  Just ALTER TABLE [name] ALTER [column] SET STATISTICS.  See
http://www.postgresql.org/docs/8.1/interactive/sql-altertable.html
for more.  You'll need to ANALYSE afterwards.

A


-- 
Andrew Sullivan  | ajs@crankycanuck.ca
In the future this spectacle of the middle classes shocking the avant-
garde will probably become the textbook definition of Postmodernism.                --Brad Holland


Re: Help with a seq scan on multi-million row table

From
Date:
Hello Andrew, Markus, and Tom - thanks for all the help!  You've just helped a large Simpy community! :) I'll try to
postsome performance charts to http://blog.simpy.com/ shortly.  In short, this immediately dropped the load from
2-3-4-5-6-7+to circa 0.25.
 

Thanks!
Otis

----- Original Message ----
From: Andrew Sullivan <ajs@crankycanuck.ca>
To: pgsql-sql@postgresql.org
Sent: Thursday, May 11, 2006 1:18:08 PM
Subject: Re: [SQL] Help with a seq scan on multi-million row table

On Thu, May 11, 2006 at 10:09:44AM -0700, ogjunk-pgjedan@yahoo.com wrote:
> Hi Markus & Tom,
> 
> Higher statistics for this column.... hm, I'd love to try changing
> it to see how that changes things, but I'm afraid I don't know how
> to do that.  How can I change the statistics target value for this
> column?
> 
> Ah, I think I found the place:

No.  Just ALTER TABLE [name] ALTER [column] SET STATISTICS.  See
http://www.postgresql.org/docs/8.1/interactive/sql-altertable.html
for more.  You'll need to ANALYSE afterwards.

A
-- 
Andrew Sullivan  | ajs@crankycanuck.ca