Thread: avoiding seqscan?

avoiding seqscan?

From
Palle Girgensohn
Date:
Hi!

I have a SQL statement that I cannot get to use the index. postgresql
insists on using a seqscan and performance is very poor. set enable_seqscan
= true boost performance drastically, as you can see below. Since seqscan
is not always bad, I'd rather not turn it off completely, but rather get
the planner to do the right thing here. Is there another way to do this,
apart from setting enable_seqscan=false?

Thanks,
Palle


the tables are:
person  with about 30000 tuples
dyn_field_content_person, 331156 tuples
dyn_field_person, just 15 tuples
course about 700 tuples
partitcipant with ~ 7800 tuples, where ~ 60 have course_id=707...

uu=# explain analyze
uu-#  select lower(substr(p.last_name,1,1)) as letter, count(*)
uu-#  FROM course c join group_data gd on (c.active_group_id =
gd.this_group_id)
uu-#       join person p on (gd.item_text = p.userid)
uu-#       join dyn_field_person dfp on (dfp.extern_item_id = 10 and
dfp.giver=c.giver)
uu-#       join dyn_field_content_person dfcp on (dfp.id =
dfcp.dyn_field_id and dfcp.userid=p.userid)
uu-#       left outer join participant pt on (pt.userid = p.userid and
pt.course_id = 707)
uu-#  WHERE c.id = 707
uu-#  group by 1
uu-# ;

QUERY PLAN


---------------------------------------------------------------------------
---------------------------------------------------------------------------
---------------------------------------
-------------------
 Aggregate  (cost=10496.30..10498.35 rows=27 width=106) (actual
time=4166.01..4167.23 rows=19 loops=1)
   ->  Group  (cost=10496.30..10497.67 rows=273 width=106) (actual
time=4165.92..4166.80 rows=60 loops=1)
         ->  Sort  (cost=10496.30..10496.98 rows=273 width=106) (actual
time=4165.91..4166.10 rows=60 loops=1)
               Sort Key: lower(substr(p.last_name, 1, 1))
               ->  Merge Join  (cost=10443.75..10485.23 rows=273 width=106)
(actual time=4094.42..4165.20 rows=60 loops=1)
                     Merge Cond: ("outer".userid = "inner".userid)
                     Join Filter: ("inner".course_id = 707)
                     ->  Sort  (cost=9803.86..9804.54 rows=273 width=88)
(actual time=3823.78..3823.97 rows=60 loops=1)
                           Sort Key: dfcp.userid
                           ->  Hash Join  (cost=2444.22..9792.79 rows=273
width=88) (actual time=1140.50..3822.60 rows=60 loops=1)
                                 Hash Cond: ("outer".userid =
"inner".item_text)
                                 Join Filter: ("inner".id =
"outer".dyn_field_id)
                                 ->  Seq Scan on dyn_field_content_person
dfcp  (cost=0.00..5643.56 rows=331156 width=16) (actual time=0.01..2028.31
rows=331156 loops=1)
                                 ->  Hash  (cost=2443.54..2443.54 rows=272
width=72) (actual time=340.24..340.24 rows=0 loops=1)
                                       ->  Nested Loop
(cost=1401.84..2443.54 rows=272 width=72) (actual time=338.76..339.91
rows=60 loops=1)
                                             Join Filter: ("outer".giver =
"inner".giver)
                                             ->  Seq Scan on
dyn_field_person dfp  (cost=0.00..1.19 rows=1 width=16) (actual
time=0.06..0.09 rows=1 loops=1)
                                                   Filter: (extern_item_id
= 10)
                                             ->  Materialize
(cost=2437.67..2437.67 rows=374 width=56) (actual time=338.64..338.82
rows=60 loops=1)
                                                   ->  Hash Join
(cost=1401.84..2437.67 rows=374 width=56) (actual time=7.74..338.36 rows=60
loops=1)
                                                         Hash Cond:
("outer".userid = "inner".item_text)
                                                         ->  Seq Scan on
person p  (cost=0.00..806.09 rows=30009 width=23) (actual time=0.01..203.67
rows=30009 loops=1)
                                                         ->  Hash
(cost=1400.89..1400.89 rows=378 width=33) (actual time=1.60..1.60 rows=0
loops=1)
                                                               ->  Nested
Loop  (cost=0.00..1400.89 rows=378 width=33) (actual time=0.12..1.28
rows=60 loops=1)
                                                                     ->
Index Scan using course_pkey on course c  (cost=0.00..5.08 rows=1 width=16)
(actual time=0.06..0.06 rows=1 loops=1)

Index Cond: (id = 707)
                                                                     ->
Index Scan using group_data_this_idx on group_data gd  (cost=0.00..1390.80
rows=402 width=17) (actual time=0.04..0.6
6 rows=60 loops=1)

Index Cond: ("outer".active_group_id = gd.this_group_id)
                     ->  Sort  (cost=639.90..659.42 rows=7808 width=18)
(actual time=266.55..290.81 rows=7722 loops=1)
                           Sort Key: pt.userid
                           ->  Seq Scan on participant pt
(cost=0.00..135.08 rows=7808 width=18) (actual time=0.02..50.24 rows=7808
loops=1)
 Total runtime: 4170.16 msec
(32 rader)

Tid: 4184,68 ms
uu=# set enable_seqscan = false;
SET
Tid: 1,20 ms
uu=# explain analyze
uu-#  select lower(substr(p.last_name,1,1)) as letter, count(*)
uu-#  FROM course c join group_data gd on (c.active_group_id =
gd.this_group_id)
uu-#       join person p on (gd.item_text = p.userid)
uu-#       join dyn_field_person dfp on (dfp.extern_item_id = 10 and
dfp.giver=c.giver)
uu-#       join dyn_field_content_person dfcp on (dfp.id =
dfcp.dyn_field_id and dfcp.userid=p.userid)
uu-#       left outer join participant pt on (pt.userid = p.userid and
pt.course_id = 707)
uu-#  WHERE c.id = 707
uu-#  group by 1
uu-# ;

QUERY PLAN


---------------------------------------------------------------------------
---------------------------------------------------------------------------
---------------------------------------
---------
 Aggregate  (cost=17928.32..17930.37 rows=27 width=106) (actual
time=171.37..172.58 rows=19 loops=1)
   ->  Group  (cost=17928.32..17929.68 rows=273 width=106) (actual
time=171.27..172.14 rows=60 loops=1)
         ->  Sort  (cost=17928.32..17929.00 rows=273 width=106) (actual
time=171.26..171.45 rows=60 loops=1)
               Sort Key: lower(substr(p.last_name, 1, 1))
               ->  Merge Join  (cost=17545.53..17917.25 rows=273 width=106)
(actual time=36.64..170.53 rows=60 loops=1)
                     Merge Cond: ("outer".userid = "inner".userid)
                     Join Filter: ("inner".course_id = 707)
                     ->  Sort  (cost=17545.53..17546.22 rows=273 width=88)
(actual time=28.62..28.84 rows=60 loops=1)
                           Sort Key: dfcp.userid
                           ->  Nested Loop  (cost=0.00..17534.46 rows=273
width=88) (actual time=7.99..27.49 rows=60 loops=1)
                                 Join Filter: ("outer".id =
"inner".dyn_field_id)
                                 ->  Nested Loop  (cost=0.00..3685.31
rows=272 width=72) (actual time=7.67..8.95 rows=60 loops=1)
                                       Join Filter: ("outer".giver =
"inner".giver)
                                       ->  Index Scan using
df_person_giver_id_idx on dyn_field_person dfp  (cost=0.00..6.20 rows=1
width=16) (actual time=0.14..0.17 rows=1 loops=1)
                                             Filter: (extern_item_id = 10)
                                       ->  Materialize
(cost=3674.43..3674.43 rows=374 width=56) (actual time=7.49..7.69 rows=60
loops=1)
                                             ->  Nested Loop
(cost=0.00..3674.43 rows=374 width=56) (actual time=0.24..7.22 rows=60
loops=1)
                                                   ->  Nested Loop
(cost=0.00..1400.89 rows=378 width=33) (actual time=0.10..1.34 rows=60
loops=1)
                                                         ->  Index Scan
using course_pkey on course c  (cost=0.00..5.08 rows=1 width=16) (actual
time=0.04..0.05 rows=1 loops=1)
                                                               Index Cond:
(id = 707)
                                                         ->  Index Scan
using group_data_this_idx on group_data gd  (cost=0.00..1390.80 rows=402
width=17) (actual time=0.04..0.70 rows=60 lo
ops=1)
                                                               Index Cond:
("outer".active_group_id = gd.this_group_id)
                                                   ->  Index Scan using
person_pkey on person p  (cost=0.00..6.01 rows=1 width=23) (actual
time=0.07..0.08 rows=1 loops=60)
                                                         Index Cond:
("outer".item_text = p.userid)
                                 ->  Index Scan using
df_content_person_userid_id_idx on dyn_field_content_person dfcp
(cost=0.00..50.75 rows=12 width=16) (actual time=0.08..0.23 rows=11 l
oops=60)
                                       Index Cond: (dfcp.userid =
"outer".item_text)
                     ->  Index Scan using participant_uid_cid_idx on
participant pt  (cost=0.00..349.76 rows=7808 width=18) (actual
time=0.07..84.34 rows=7722 loops=1)
 Total runtime: 173.37 msec
(28 rader)

Tid: 183,37 ms


Re: avoiding seqscan?

From
Josh Berkus
Date:
Palle,

> I have a SQL statement that I cannot get to use the index. postgresql
> insists on using a seqscan and performance is very poor. set enable_seqscan
> = true boost performance drastically, as you can see below. Since seqscan
> is not always bad, I'd rather not turn it off completely, but rather get
> the planner to do the right thing here. Is there another way to do this,
> apart from setting enable_seqscan=false?

In your postgresql.conf, try setting effective_cache_size to something like
50% of your system's RAM, and lovering random_page_cost to 2.0 or even 1.5.
Then restart PostgreSQL and try your query again.

What version, btw?


--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: avoiding seqscan?

From
Palle Girgensohn
Date:
Hi,

Indeed, setting random_page_cost does the trick. Thanks!

It seems to make sense to set random_page_cost to this value. Are there any
drawbacks?

postgresql-7.3.4

postgresql.conf:

tcpip_socket = true
max_connections = 100
superuser_reserved_connections = 2

#       Performance
#
shared_buffers = 12000
sort_mem = 8192
vacuum_mem = 32768
effective_cache_size = 64000
random_page_cost = 2

...

--On söndag, september 28, 2003 14.34.25 -0700 Josh Berkus
<josh@agliodbs.com> wrote:

> Palle,
>
>> I have a SQL statement that I cannot get to use the index. postgresql
>> insists on using a seqscan and performance is very poor. set
>> enable_seqscan = true boost performance drastically, as you can see
>> below. Since seqscan is not always bad, I'd rather not turn it off
>> completely, but rather get the planner to do the right thing here. Is
>> there another way to do this, apart from setting enable_seqscan=false?
>
> In your postgresql.conf, try setting effective_cache_size to something
> like  50% of your system's RAM, and lovering random_page_cost to 2.0 or
> even 1.5.   Then restart PostgreSQL and try your query again.
>
> What version, btw?
>
>
> --
> Josh Berkus
> Aglio Database Solutions
> San Francisco
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faqs/FAQ.html





Re: avoiding seqscan?

From
Josh Berkus
Date:
Palle,

> Indeed, setting random_page_cost does the trick. Thanks!
>
> It seems to make sense to set random_page_cost to this value. Are there any
> drawbacks?

Only if your server was heavily multi-tasking, and as a result had little
RAM+CPU available.  Then you'd want to raise the value again.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: avoiding seqscan?

From
Palle Girgensohn
Date:
Will that make a difference? From what I've seen, it does not make much
difference, but I have seen queries speed up when rewritten explicit joins.
I guess it depends on other things, but is it really so that the explicit
joins are bad somehow? Do you have any pointers to documentation about it,
if so?

Thanks,
Palle

--On måndag, september 29, 2003 00.54.43 +0200 Gaetano Mendola
<mendola@bigfoot.com> wrote:

> Palle Girgensohn wrote:
>> uu=# explain analyze
>> uu-#  select lower(substr(p.last_name,1,1)) as letter, count(*)
>> uu-#  FROM course c join group_data gd on (c.active_group_id =
>> gd.this_group_id)
>> uu-#       join person p on (gd.item_text = p.userid)
>> uu-#       join dyn_field_person dfp on (dfp.extern_item_id = 10 and
>> dfp.giver=c.giver)
>> uu-#       join dyn_field_content_person dfcp on (dfp.id =
>> dfcp.dyn_field_id and dfcp.userid=p.userid)
>> uu-#       left outer join participant pt on (pt.userid = p.userid and
>> pt.course_id = 707)
>> uu-#  WHERE c.id = 707
>> uu-#  group by 1
>> uu-# ;
>
> Why are you using this form of join ? When and if is not necessary use
> the implicit form.
>
>
> Regards
> Gaetano Mendola
>
>
>





Re: avoiding seqscan?

From
Palle Girgensohn
Date:
--On måndag, september 29, 2003 15.32.31 +0200 Gaetano Mendola
<mendola@bigfoot.com> wrote:

> Are not absolutelly bad but sometimes that path that you choose is not
> the optimal, in postgres 7.4 use the explicit join will be less
> limitative for the planner.
>
> Regards
> Gaetano Mendola

Ah, OK. True! In this case though, the sql questions are crafted with great
care, since we have a lot of data in a few of the tables, other are almost
empty, so we try to limit the amount of data as early as possible. Our
experience says that we often do a better job than the planner, since we
know which tables are "fat". Hence, we have actually moved to exlicit joins
in questions and sometimes gained speed.

But, in the general case, implicit might be better, I guess.

Regards,
Palle




Re: avoiding seqscan?

From
Gaetano Mendola
Date:
Palle Girgensohn wrote:
> Will that make a difference? From what I've seen, it does not make much
> difference, but I have seen queries speed up when rewritten explicit
> joins. I guess it depends on other things, but is it really so that the
> explicit joins are bad somehow? Do you have any pointers to
> documentation about it, if so?
>
> Thanks,
> Palle


Are not absolutelly bad but sometimes that path that you choose is not
the optimal, in postgres 7.4 use the explicit join will be less
limitative for the planner.

Regards
Gaetano Mendola


Re: avoiding seqscan?

From
Christopher Browne
Date:
girgen@pingpong.net (Palle Girgensohn) writes:
> Will that make a difference? From what I've seen, it does not make
> much difference, but I have seen queries speed up when rewritten
> explicit joins. I guess it depends on other things, but is it really
> so that the explicit joins are bad somehow? Do you have any pointers
> to documentation about it, if so?

The problem is that if you expressly specify the joins, the query
optimizer can't choose its own paths.  And while that may not be
better at the moment, it is quite possible that when you upgrade to a
newer version, those queries, if "not join-specified," could
immediately get faster.

I would expect that the query that uses implicit joins will be clearer
to read, which adds a little further merit to that direction.

That goes along with the usual way that it is preferable to optimize
things, namely that you should start by solving the problem as simply
as you can, and only proceed to further optimization if that actually
proves necessary.  Optimization efforts commonly add complexity and
make code more difficult to maintain; that's not the place to start if
you don't even know the effort is necessary.
--
(format nil "~S@~S" "cbbrowne" "libertyrms.info")
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 646 3304 x124 (land)

Re: avoiding seqscan?

From
Gaetano Mendola
Date:
Palle Girgensohn wrote:
> uu=# explain analyze
> uu-#  select lower(substr(p.last_name,1,1)) as letter, count(*)
> uu-#  FROM course c join group_data gd on (c.active_group_id =
> gd.this_group_id)
> uu-#       join person p on (gd.item_text = p.userid)
> uu-#       join dyn_field_person dfp on (dfp.extern_item_id = 10 and
> dfp.giver=c.giver)
> uu-#       join dyn_field_content_person dfcp on (dfp.id =
> dfcp.dyn_field_id and dfcp.userid=p.userid)
> uu-#       left outer join participant pt on (pt.userid = p.userid and
> pt.course_id = 707)
> uu-#  WHERE c.id = 707
> uu-#  group by 1
> uu-# ;

Why are you using this form of join ? When and if is not necessary use
the implicit form.


Regards
Gaetano Mendola




Re: avoiding seqscan?

From
Gaetano Mendola
Date:
Palle Girgensohn wrote:
> Will that make a difference? From what I've seen, it does not make much
> difference, but I have seen queries speed up when rewritten explicit
> joins. I guess it depends on other things, but is it really so that the
> explicit joins are bad somehow? Do you have any pointers to
> documentation about it, if so?
>
> Thanks,
> Palle


Are not absolutelly bad but sometimes that path that you choose is not
the optimal, in postgres 7.4 the think will be better.

Regards
Gaetano Mendola


Re: avoiding seqscan?

From
Palle Girgensohn
Date:

--On måndag, september 29, 2003 11.12.55 -0400 Christopher Browne
<cbbrowne@libertyrms.info> wrote:

> girgen@pingpong.net (Palle Girgensohn) writes:
>> Will that make a difference? From what I've seen, it does not make
>> much difference, but I have seen queries speed up when rewritten
>> explicit joins. I guess it depends on other things, but is it really
>> so that the explicit joins are bad somehow? Do you have any pointers
>> to documentation about it, if so?
>
> The problem is that if you expressly specify the joins, the query
> optimizer can't choose its own paths.  And while that may not be
> better at the moment, it is quite possible that when you upgrade to a
> newer version, those queries, if "not join-specified," could
> immediately get faster.

You've got a point here. Still, with some queries, since the data is pretty
static and we know much about its distribution over the tables, we had to
explicitally tell postgresql how to optimze the queries to get them fast
enough. We cannot afford any queries to be more than fractions of seconds,
really.

> I would expect that the query that uses implicit joins will be clearer
> to read, which adds a little further merit to that direction.

Depends, I actually don't agree on this, but I guess it depends on which
syntax you're used to.

> That goes along with the usual way that it is preferable to optimize
> things, namely that you should start by solving the problem as simply
> as you can, and only proceed to further optimization if that actually
> proves necessary.  Optimization efforts commonly add complexity and
> make code more difficult to maintain; that's not the place to start if
> you don't even know the effort is necessary.

Oh, but of course. For the queries I refer to, optimization actually proved
necessary, believe me :-)

Cheers,
Palle