Thread: [repost] partial index / funxtional idx or bad sql?

[repost] partial index / funxtional idx or bad sql?

From
csajl
Date:
my apologies - a strange key combination sent the message early.

----
greetings.

i have a query that is taking a rather long time to execute and have been
looking into setting up a partial index to help, although i'm not sure if this
is what i want.

here is the (simplified) table "posts":

id serial
type_id int
areacode smallint
content text

and the other table (areacodes) referenced:

site_id   smallint
areacode  smallint


the query is:

SELECT p.id, p.areacode, p.content
FROM posts p
WHERE p.type_id = ?
AND p.areacode in (
  select areacode from areacodes
   where site_id = ?
 )


the "posts" table has 100,000 rows of varying data, across areacodes and types.
given the type_id and site_id, the query is currently taking ~4 seconds to
return 8500 rows (on a dual proc/ gig ram linux box).

indexes on table "posts" are:
primary key (id)
and another on both (type_id, areacode)

index on the table "areacodes" is (site_id, areacode).

would a parital index help in speeding up this query?
are my current indexes counter productive?
or is it just my sql that need help?


thanks much for any help or pointers to information.

- seth

__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com


Re: [repost] partial index / funxtional idx or bad sql?

From
Josh Berkus
Date:
Seth,

> SELECT p.id, p.areacode, p.content
> FROM posts p
> WHERE p.type_id = ?
> AND p.areacode in (
>   select areacode from areacodes
>    where site_id = ?
>  )

Unless you're using 7.4 from CVS, you want to get rid of that IN:

 SELECT p.id, p.areacode, p.content
 FROM posts p
 WHERE p.type_id = ?
 AND EXISTS (
  select areacode from areacodes
    where site_id = ?
    and p.areacode = areacodes.areacode
  );

See how that works, and if it's still slow, post the EXPLAIN ANALYZE.


--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: [repost] partial index / funxtional idx or bad sql?

From
csajl
Date:
hi josh.

i'm using 7.3.2.  i tried using EXISTS instead of the IN, but the same query
now returns in seven sceonds as opposed to four with the IN.


cmdb=# EXPLAIN ANALYZE
cmdb-# select c.class_id, c.areacode, c.title from classifieds c
cmdb-# where c.class_cat_id = '1'
cmdb-# and c.areacode IN (
cmdb(# select areacode from cm_areacode where site_id = '10')
cmdb-# ;
                                                                    QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using classifieds_dual_idx on classifieds c  (cost=0.00..26622.14
rows=1837 width=39) (actual time=345.48..2305.04 rows=8460 loops=1)
   Index Cond: (class_cat_id = 1)
   Filter: (subplan)
   SubPlan
     ->  Materialize  (cost=3.46..3.46 rows=4 width=2) (actual time=0.00..0.01
rows=5 loops=61966)
           ->  Index Scan using site_cm_areacode_idx on cm_areacode
(cost=0.00..3.46 rows=4 width=2) (actual time=0.14..0.22 rows=5 loops=1)
                 Index Cond: (site_id = 10)
 Total runtime: 2314.14 msec
(8 rows)
----------------------------------

classifieds_dual_idx  is the btree index on (class_type_id, areacode)
and site_cm_areacode_idx is the btree index on (site_id) only.
there is an index in the areacode table that has both (site_id, areacode) but
it's apparently not being used.  would it help the query to use that index
instead?

thanks for your help.



--- Josh Berkus <josh@agliodbs.com> wrote:
> Seth,
>
> > SELECT p.id, p.areacode, p.content
> > FROM posts p
> > WHERE p.type_id = ?
> > AND p.areacode in (
> >   select areacode from areacodes
> >    where site_id = ?
> >  )
>
> Unless you're using 7.4 from CVS, you want to get rid of that IN:
>
>  SELECT p.id, p.areacode, p.content
>  FROM posts p
>  WHERE p.type_id = ?
>  AND EXISTS (
>   select areacode from areacodes
>     where site_id = ?
>     and p.areacode = areacodes.areacode
>   );
>
> See how that works, and if it's still slow, post the EXPLAIN ANALYZE.


__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com


Re: [repost] partial index / funxtional idx or bad sql?

From
Josh Berkus
Date:
Csajl,

> i'm using 7.3.2.  i tried using EXISTS instead of the IN, but the same
> query now returns in seven sceonds as opposed to four with the IN.
<snip>
> classifieds_dual_idx  is the btree index on (class_type_id, areacode)
> and site_cm_areacode_idx is the btree index on (site_id) only.
> there is an index in the areacode table that has both (site_id, areacode)
> but it's apparently not being used.  would it help the query to use that
> index instead?

No.
From the look of things, it's not the index scan that's taking time ... it's
the subplan, which is doing 61,000 loops.   Which is normal for IN, but not
for EXISTS.   You run VACUUM ANALYZE?

--
Josh Berkus
Aglio Database Solutions
San Francisco


Re: [repost] partial index / funxtional idx or bad sql?

From
Stephan Szabo
Date:
On Mon, 12 May 2003, csajl wrote:

> i'm using 7.3.2.  i tried using EXISTS instead of the IN, but the same query
> now returns in seven sceonds as opposed to four with the IN.
>
>
> cmdb=# EXPLAIN ANALYZE
> cmdb-# select c.class_id, c.areacode, c.title from classifieds c
> cmdb-# where c.class_cat_id = '1'
> cmdb-# and c.areacode IN (
> cmdb(# select areacode from cm_areacode where site_id = '10')
> cmdb-# ;

How about something like:

select c.class_id, c.areacode, c.title from
 classifieds c,
 (select distinct areacode from cm_areacode where site_id='10') a
 where c.class_cat_id='1' and c.areacode=a.areacode;


Re: [repost] partial index / funxtional idx or bad sql?

From
csajl
Date:
hi josh.

thanks for your help and time with this.

ran vacuum analyze, still timed in around 3seconds.
i dropped the site_id only index on the areacodes table in favor of the dual
site_id and areacode index and seemingly gained 1/2 second.

by using the IN, i gain another .3 of a second.  (i thought EXISTS was supposed
to be more efficient?)

the loop on the subplan (~62k) is killing me.  any alternatives to what i
thought would be a seemingly innocuous lookup?  the cm_Areacode table is
nothing more than two columns, associating each areacode into a site_id. (292
rows if i remember correctly)


cmdb=# EXPLAIN ANALYZE
cmdb-# select c.class_id, c.areacode, c.title from classifieds c
cmdb-# where c.class_cat_id = '1'
cmdb-# and EXISTS (
cmdb(# select areacode from cm_areacode cm where site_id = '10' and c.areacode
= cm.areacode)
cmdb-# ;
                                                                       QUERY
PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using classifieds_dual_idx on classifieds c  (cost=0.00..493277.77
rows=28413 width=39) (actual time=360.23..2523.08 rows=8460 loops=1)
   Index Cond: (class_cat_id = 1)
   Filter: (subplan)
   SubPlan
     ->  Index Scan using areacode_site_dual_cmareacode on cm_areacode cm
(cost=0.00..4.96 rows=1 width=2) (actual time=0.01..0.01 rows=0 loops=61966)
           Index Cond: ((site_id = 10) AND ($0 = areacode))
 Total runtime: 2533.93 msec
(7 rows)

cmdb=#
------------------------------------

cmdb=# EXPLAIN ANALYZE
cmdb-# select c.class_id, c.areacode, c.title from classifieds c
cmdb-# where c.class_cat_id = '1'
cmdb-# and c.areacode IN (
cmdb(# select areacode from cm_areacode where site_id = '10')
cmdb-# ;
                                                                     QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using classifieds_dual_idx on classifieds c  (cost=0.00..632183.80
rows=28413 width=39) (actual time=344.70..2287.93 rows=8460 loops=1)
   Index Cond: (class_cat_id = 1)
   Filter: (subplan)
   SubPlan
     ->  Materialize  (cost=7.40..7.40 rows=4 width=2) (actual time=0.00..0.00
rows=5 loops=61966)
           ->  Seq Scan on cm_areacode  (cost=0.00..7.40 rows=4 width=2)
(actual time=0.20..0.73 rows=5 loops=1)
                 Filter: (site_id = 10)
 Total runtime: 2296.83 msec
(8 rows)



--- Josh Berkus <josh@agliodbs.com> wrote:
> Csajl,
>
> > i'm using 7.3.2.  i tried using EXISTS instead of the IN, but the same
> > query now returns in seven sceonds as opposed to four with the IN.
> <snip>
> > classifieds_dual_idx  is the btree index on (class_type_id, areacode)
> > and site_cm_areacode_idx is the btree index on (site_id) only.
> > there is an index in the areacode table that has both (site_id, areacode)
> > but it's apparently not being used.  would it help the query to use that
> > index instead?
>
> No.
> From the look of things, it's not the index scan that's taking time ... it's
> the subplan, which is doing 61,000 loops.   Which is normal for IN, but not
> for EXISTS.   You run VACUUM ANALYZE?


__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com


Re: [repost] partial index / funxtional idx or bad sql?

From
csajl
Date:
wow.

that did it.  so much for my knowing SQL...

unbelievable - thanks much.


cmdb=# EXPLAIN ANALYZE
cmdb-# select c.class_id, c.areacode, c.title from classifieds c
cmdb-# , (select distinct areacode from cm_areacode where site_id='10') a
cmdb-# where c.class_cat_id='1' and c.areacode=a.areacode;
                                                                    QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=7.44..1107.53 rows=279 width=41) (actual time=1.13..258.11
rows=8460 loops=1)
   ->  Subquery Scan a  (cost=7.44..7.46 rows=1 width=2) (actual
time=0.86..0.92 rows=5 loops=1)
         ->  Unique  (cost=7.44..7.46 rows=1 width=2) (actual time=0.85..0.88
rows=5 loops=1)
               ->  Sort  (cost=7.44..7.45 rows=4 width=2) (actual
time=0.85..0.86 rows=5 loops=1)
                     Sort Key: areacode
                     ->  Seq Scan on cm_areacode  (cost=0.00..7.40 rows=4
width=2) (actual time=0.20..0.73 rows=5 loops=1)
                           Filter: (site_id = 10)
   ->  Index Scan using classifieds_dual_idx on classifieds c
(cost=0.00..1096.59 rows=279 width=39) (actual time=0.22..44.28 rows=1692
loops=5)
         Index Cond: ((c.class_cat_id = 1) AND (c.areacode = "outer".areacode))
 Total runtime: 267.71 msec
(10 rows)





--- Stephan Szabo <sszabo@megazone23.bigpanda.com> wrote:
>
> On Mon, 12 May 2003, csajl wrote:
>
> > i'm using 7.3.2.  i tried using EXISTS instead of the IN, but the same
> query
> > now returns in seven sceonds as opposed to four with the IN.
> >
> >
> > cmdb=# EXPLAIN ANALYZE
> > cmdb-# select c.class_id, c.areacode, c.title from classifieds c
> > cmdb-# where c.class_cat_id = '1'
> > cmdb-# and c.areacode IN (
> > cmdb(# select areacode from cm_areacode where site_id = '10')
> > cmdb-# ;
>
> How about something like:
>
> select c.class_id, c.areacode, c.title from
>  classifieds c,
>  (select distinct areacode from cm_areacode where site_id='10') a
>  where c.class_cat_id='1' and c.areacode=a.areacode;
>


__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com


Re: [repost] partial index / funxtional idx or bad sql?

From
"Jim C. Nasby"
Date:
On Mon, May 12, 2003 at 09:03:38PM -0700, csajl wrote:
>
> wow.
>
> that did it.  so much for my knowing SQL...
> >
> > How about something like:
> >
> > select c.class_id, c.areacode, c.title from
> >  classifieds c,
> >  (select distinct areacode from cm_areacode where site_id='10') a
> >  where c.class_cat_id='1' and c.areacode=a.areacode;
> >

Wow, I'll have to keep that in mind. Shouldn't the optimizer be able to
handle that? Could this get added to the TODO?
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: [repost] partial index / funxtional idx or bad sql?

From
Tom Lane
Date:
"Jim C. Nasby" <jim@nasby.net> writes:
> Wow, I'll have to keep that in mind. Shouldn't the optimizer be able to
> handle that? Could this get added to the TODO?

No, 'cause it's done (in CVS tip).

I'm actually a bit hesitant now to recommend that people do such things,
because the 7.4 optimizer is likely to produce a better plan from the
unmodified IN query than it will from any explicitly "improved" version.
The 7.4 code knows several ways to do IN efficiently, but when you
hand-transform the query you are forcing the choice; perhaps wrongly.

An example from CVS tip and the regression database in which hand
transformation forces a less efficient plan choice:

regression=# explain analyze select * from tenk1 a where unique1 in (select ten from tenk1);
                                                             QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=483.17..484.91 rows=10 width=248) (actual time=407.14..409.16 rows=10 loops=1)
   Merge Cond: ("outer".unique1 = "inner".ten)
   ->  Index Scan using tenk1_unique1 on tenk1 a  (cost=0.00..1571.97 rows=10000 width=244) (actual time=0.41..1.60
rows=11loops=1) 
   ->  Sort  (cost=483.17..483.19 rows=10 width=4) (actual time=406.57..406.65 rows=10 loops=1)
         Sort Key: tenk1.ten
         ->  HashAggregate  (cost=483.00..483.00 rows=10 width=4) (actual time=406.08..406.26 rows=10 loops=1)
               ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=4) (actual time=0.19..261.84 rows=10000
loops=1)
 Total runtime: 410.74 msec
(8 rows)

regression=# explain analyze select * from tenk1 a, (select distinct ten from tenk1) b
regression-# where a.unique1 = b.ten;
                                                           QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=1122.39..1232.59 rows=10 width=248) (actual time=476.67..666.02 rows=10 loops=1)
   ->  Subquery Scan b  (cost=1122.39..1172.39 rows=10 width=4) (actual time=475.94..662.00 rows=10 loops=1)
         ->  Unique  (cost=1122.39..1172.39 rows=10 width=4) (actual time=475.89..661.65 rows=10 loops=1)
               ->  Sort  (cost=1122.39..1147.39 rows=10000 width=4) (actual time=475.85..559.27 rows=10000 loops=1)
                     Sort Key: ten
                     ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=4) (actual time=0.37..274.87 rows=10000
loops=1)
   ->  Index Scan using tenk1_unique1 on tenk1 a  (cost=0.00..6.01 rows=1 width=244) (actual time=0.27..0.31 rows=1
loops=10)
         Index Cond: (a.unique1 = "outer".ten)
 Total runtime: 687.53 msec
(9 rows)

So, for now, make the transformation ... but keep a note about the IN
version to try whenever you update to 7.4.

            regards, tom lane