need some help understanding slow query (followup) - Mailing list pgsql-sql

From Esger Abbink
Subject need some help understanding slow query (followup)
Date
Msg-id 200111301920.UAA27238@fikkie.vesc.nl
Whole thread Raw
List pgsql-sql
after doing some more investigations with permutations and shortened queries i think the problem is definately due to
thesmall subquery as demonstrated below:
 


explain select * from ents e where e.set_id = 45 and exists (select set_id, entity_id from qtys q where e.set_id =
q.set_idand e.entity_id = q.entity_id) ;
 
NOTICE:  QUERY PLAN:                                                                              
Index Scan using entities_set_id_idx on P_Trails_Entities e  (cost=0.00..207.85 rows=37 width=136) SubPlan   ->  Index
Scanusing quantities_e_id_idx on P_Trails_Quantities q  (cost=0.00..4.25 rows=1 width=8)
 

that query is using the indices, however if i use a in or exists instead of set_id = some_number the planner
immediatelystarts using seq_scans again. now may that be more efficient in the general case, here it isnt because the
resultof that subquery is typically very small (almost always 1). using index scans is definately more efficient here.
sohow do i tell the planner to keep using the indices?
 


-----------------------
Hi,  
i have a little performance problem.

db (simplified):

table current:
current_update_id, ...

table datasets:
set_id, update_id, ....

table ents:
e_id, set_id, ...

table qtys:
set_id, e_id, ...

indexes are defined on all set_id's & datasets.update_id.

an update consists of several sets which in turn consists of several ents, for a specific ent in a set multiple qtys
mayexist.
 

(normal: 1 update - 1 set - few hundred ents - 1 qty per ent)

now i want to do the following: i want to get some specific qty values for the ents of the last update only.

so i do a query like:

select some_other_fields from ents e, qtys q where e.set_id = q.set_id and e.e_id = q.e_id and e.set_id in (select
set_idfrom datasets where update_id in (select cur_update_id from current)) and q.other_field = some_const ;
 

this query takes ages :(

the query plan looks like this:

Merge Join  (cost=0.00..69979.50 rows=252 width=20) ->  Index Scan using qtys_set_id_idx on qtys q (cost=0.00..2054.57
rows=30653->  Index Scan using ents_set_id_idx on ents e (cost=0.00..66847.20 rows=4119       SubPlan          ->
Materialize (cost=1.52..1.52 rows=1 width=4)               ->  Seq Scan on datasets  (cost=0.00..1.52 rows=1 width=4)
             SubPlan                        ->  Seq Scan on current  (cost=0.00..1.01 rows=1
                  width=4)
 

after i created an index on the e_id fields the cost went up even higher (about 90.000 for line 1 & 3).

this performance isnt acceptable so i started a different approach. instead of joining the larger tables and then
checkingfor interesting set_id's i'd first select the appropriate set_id's (into temp table or view).
 

so i tried something like:

select set_id, e_id, some_fields from ents where set_id in (select set_id from datasets where update_id in (select
cur_update_idfrom current)) ;
 

this query is being planned as followed:

Seq Scan on ents  (cost=0.00..43331.89 rows=41197 width=136) SubPlan   ->  Materialize  (cost=1.02..1.02 rows=1
width=4)        ->  Seq Scan on datasets  (cost=0.00..1.02 rows=1 width=4)               InitPlan                 ->
SeqScan on current  (cost=0.00..1.01 rows=1 width=4)
 

this query isnt using the created index (!?) and yes, vacuum analyze was done. the 2nd in can be safely changed to = as
thatis always 1 row. the first in sub-select is returning 1 row about 95% if not 100% of the time, but cant be
guaranteedto. if i ignore that and use = for that one as well i get this:    
 

Index Scan using ents_set_id_idx on ents  (cost=0.00..50.03 rows=37 width=136) InitPlan    ->  Seq Scan on datasets
(cost=0.00..1.02rows=1 width=4)      InitPlan       ->  Seq Scan on current  (cost=0.00..1.01 rows=1 width=4)
 

which is more like the performance we want..

using exists instead of in doesnt speed up the 1st query & results in the same plan.

so now i'm stumped (and stuck), the 1st approach uses the indices but is way too slow (likely because of the large
join?).the 2nd one doesnt use the indices and is (therefore?) way too slow also.
 

clearly i'm doing something stupid here.

but what?


these tests were done on a db with 30k - 40k rows in the ents & qtys tables but the production db must be able to run
withmillions of rows and not take minutes (or hours) to cough up a specific set.
 

any help appreciated.

Esger Abbink

PS. this was on postgres 7.0.3


-- 
NeoMail - Webmail that doesn't suck... as much.
http://neomail.sourceforge.net


pgsql-sql by date:

Previous
From: Esger Abbink
Date:
Subject: need some help understanding sloq query
Next
From: "Ligia Pimentel"
Date:
Subject: An easy question about creating a primary key