Thread: need some help understanding sloq query

need some help understanding sloq query

From
Esger Abbink
Date:
Hi,

i have a little performance problem.

db (simplified):

table current:
current_update_id, ...

table datasets:
set_id, update_id, ....

table ents:
e_id, set_id, ...

table qtys:
set_id, e_id, ...

indexes are defined on all set_id's & datasets.update_id.

an update consists of several sets which in turn consists of several ents, for
a specific ent in a set multiple qtys may exist.
(normal: 1 update - 1 set - few hundred ents - 1 qty per ent)

now i want to do the following: i want to get some specific qty values for the
ents of the last update only.

so i do a query like:

select some_other_fields from ents e, qtys q where e.set_id = q.set_id and
e.e_id = q.e_id and e.set_id in (select set_id from datasets
where update_id in (select cur_update_id from current)) and
q.other_field = some_const ;

this query takes ages :(

the query plan looks like this:

Merge Join  (cost=0.00..69979.50 rows=252 width=20) ->  Index Scan using qtys_set_id_idx on qtys q (cost=0.00..2054.57
rows=30653width=8) ->  Index Scan using ents_set_id_idx on ents e (cost=0.00..66847.20 rows=41196 width=12)
SubPlan         -> Materialize  (cost=1.52..1.52 rows=1 width=4)               ->  Seq Scan on datasets
(cost=0.00..1.52rows=1 width=4)                     SubPlan                        ->  Seq Scan on current
(cost=0.00..1.01rows=1                    width=4)
 

after i created an index on the e_id fields the cost went up even higher (about
90.000 for line 1 & 3).

this performance isnt acceptable so i started a different approach. instead of
joining the larger tables and then checking for interesting set_id's
i'd first select the appropriate set_id's (into temp table or view).

so i tried something like:

select set_id, e_id, some_fields from ents where set_id in (select set_id from
datasets where update_id in (select cur_update_id from current)) ;

this query is being planned as followed:

Seq Scan on ents  (cost=0.00..43331.89 rows=41197 width=136) SubPlan   ->  Materialize  (cost=1.02..1.02 rows=1
width=4)        ->  Seq Scan on datasets  (cost=0.00..1.02 rows=1 width=4)               InitPlan                 ->
SeqScan on current  (cost=0.00..1.01 rows=1 width=4)
 

this query isnt using the created index (!?) and yes, vacuum analyze was done.
the 2nd in can be safely changed to = as that is always 1 row. the first in
sub-select is returning 1 row about 95% if not 100% of the time, but cant be
guaranteed to. if i ignore that and use = for that one as well i get this:

Index Scan using ents_set_id_idx on ents  (cost=0.00..50.03 rows=37 width=136) InitPlan    ->  Seq Scan on datasets
(cost=0.00..1.02rows=1 width=4)      InitPlan       ->  Seq Scan on current  (cost=0.00..1.01 rows=1 width=4)
 

which is more like the performance we want..

using exists instead of in doesnt speed up the 1st query & results in the same
plan.


so now i'm stumped (and stuck), the 1st approach uses the indices but is way
too slow (likely because of the large join?). the 2nd one doesnt use the
indices and is (therefore?) way too slow also.

clearly i'm doing something stupid here.

but what?


these tests were done on a db with 30k - 40k rows in the ents & qtys tables but
the production db must be able to run with millions of rows and not take
minutes (or hours) to cough up a specific set.

any help appreciated.

Esger Abbink

PS. this was on postgres 7.0.3








Re: need some help understanding sloq query

From
"Josh Berkus"
Date:
Esger,

> select some_other_fields from ents e, qtys q where e.set_id =
> q.set_id and
> e.e_id = q.e_id and e.set_id in (select set_id from datasets
> where update_id in (select cur_update_id from current)) and
> q.other_field = some_const ;
> 
> this query takes ages :(

I'm not surprised.  You're doing a nested subselect with the IN
operator; frankly, you couldn't make a relatively simple query any
slower than the above.  It looks like the query parser is doing its best
to optimize, but you've forced it to compare every row in ents JOIN qtys
against the whole datasets table.  What's wrong with:

SELECT some_fields 
FROM ents JOIN qtys USING (set_id)JOIN datasets USING (set_id)JOIN current ON datasets.update_id =
current.cur_update_id
WHERE other_field = some_constant;

That allows Tom's magic query engine to do its work optimizing.
Assuming, of course, that there *are* indexes on update_id and
datasets.set_id ...

I heartily recommend that you read Celko's "SQL for Smarties" (book
reviews: http://techdocs.postgresql.org/bookreviews.php ).  You seem to
have fallen into the trap of using complex queries to answer simple
questions, and your database performance is suffering because of it.

-Josh Berkus


Re: need some help understanding sloq query

From
"Esger Abbink"
Date:

> Esger,
> 
> > select some_other_fields from ents e, qtys q where e.set_id =
> > q.set_id and
> > e.e_id = q.e_id and e.set_id in (select set_id from datasets
> > where update_id in (select cur_update_id from current)) and
> > q.other_field = some_const ;
> > 
> > this query takes ages :(
> 
> I'm not surprised.  You're doing a nested subselect with the IN
> operator; frankly, you couldn't make a relatively simple query any
> slower than the above.  It looks like the query parser is doing its best
> to optimize, but you've forced it to compare every row in ents JOIN qtys
> against the whole datasets table.  What's wrong with:
> 
> SELECT some_fields 
> FROM ents JOIN qtys USING (set_id)
>     JOIN datasets USING (set_id)
>     JOIN current ON datasets.update_id = current.cur_update_id
> WHERE other_field = some_constant;
> 

i figured i wasnt doing things "right" ;)

unfortunately the query above takes down my postgres (7.0.3 tried on 2 servers). 1 join no problem, 2 joins results
in:

DEBUG:  Data Base System is in production state at Thu Dec  6 11:42:18 2001
Server process (pid 1095) exited with status 11 at Thu Dec  6 11:45:08 2001
Terminating any active server processes...
Server processes were terminated at Thu Dec  6 11:45:08 2001
Reinitializing shared memory and semaphores
The Data Base System is starting up

this is directly related to executing the query (or just the part with the first 2 joins) and is 100% reproducible
sofar.


secondly, from the costs explain gives for the query part i can get through (1st join: merge join costs 7800, index
scancosts 2000, index scan costs 4200) it seems that doing a simpler variant of my query (see the other post
-followup-)with IN replaced by '=' is a lot faster. the resulting query would be something like:
 

select some_other_fields from ents e, qtys q where 
e.set_id = q.set_id and
e.e_id = q.e_id and 
e.set_id = X and
q.other_field = some_const ;

this query has negligable costs (nested loop cost 200, index scan cost 50, index scan cost 4 on test db) but should be
envelopedin some for-loop construct to feed it the values for X after which the answers should be "glued" together. the
setof X's has no theoritical maximum that i can recall atm but in practice it currently is always size 1 and i dont
expectmuch if any growth. i cant however assume its size 1.
 


(the above does sound like a complex answer to a simple problem again though..)

> That allows Tom's magic query engine to do its work optimizing.
> Assuming, of course, that there *are* indexes on update_id and
> datasets.set_id ...
> 
> I heartily recommend that you read Celko's "SQL for Smarties" (book
> reviews: http://techdocs.postgresql.org/bookreviews.php ).  You seem to
> have fallen into the trap of using complex queries to answer simple
> questions, and your database performance is suffering because of it.

will investigate.

> 
> -Josh Berkus
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
> 
> 
> 

-- 
NeoMail - Webmail that doesn't suck... as much.
http://neomail.sourceforge.net


Re: need some help understanding sloq query

From
Arian Prins
Date:
Hello Esger,

Esger Abbink schreef:

> i have a little performance problem.
>
> db (simplified):
>
> table current:
> current_update_id, ...
>
> table datasets:
> set_id, update_id, ....
>
> table ents:
> e_id, set_id, ...
>
> table qtys:
> set_id, e_id, ...
>
> indexes are defined on all set_id's & datasets.update_id.
>
> an update consists of several sets which in turn consists of several ents, for
> a specific ent in a set multiple qtys may exist.
> (normal: 1 update - 1 set - few hundred ents - 1 qty per ent)
>
> now i want to do the following: i want to get some specific qty values for the
> ents of the last update only.
>
> so i do a query like:
>
> select some_other_fields from ents e, qtys q where e.set_id = q.set_id and
> e.e_id = q.e_id and e.set_id in (select set_id from datasets
> where update_id in (select cur_update_id from current)) and
> q.other_field = some_const ;

[SNIP]

If I understand correctly, the table current allways contains 1 row, namely the
current_update_id, identifying the update that you want to query on. That would mean you could
start out from this table and Inner Join all your tables... this way your result-set would be
all data for current_update_id:

select some_other_fields
from ents e,    qtys q,    datasets s,    current c
where e.set_id = q.set_id
and   e.e_id = q.e_id
and   s.set_id = e.set_id
and   s.update_id = c.current_update_id
and   q.other_field = some_const;

If you used primary keys and indexes this query should use indexes allways.
Other ideas:

select some_other_fields
from  ents e,        qtys q
where e.set_id = q.set_id
and   e.e_id = q.e_id
and   e.set_id in (select set_id                  from datasets                  where update_id = (select
max(current_update_id)from current))
 

Given your first example and queryplan, I think the problem is in the Mergejoin; both from
qtys and from ents an enormous amount of data is selected that is only narowed down in the
last step, the merge-join.

I think using my first query, the optimizer will be much more efficient. I couldn't reproduce
your query to test though... too much  vagueness.... If this doesn't help then post some
sql-definitions of your database.

Succes,
A. Prins.



Re: need some help understanding sloq query

From
Stephan Szabo
Date:
On Thu, 6 Dec 2001, Esger Abbink wrote:

> > Esger,
> >
> > > select some_other_fields from ents e, qtys q where e.set_id =
> > > q.set_id and
> > > e.e_id = q.e_id and e.set_id in (select set_id from datasets
> > > where update_id in (select cur_update_id from current)) and
> > > q.other_field = some_const ;
> > >
> > > this query takes ages :(
> >
> > I'm not surprised.  You're doing a nested subselect with the IN
> > operator; frankly, you couldn't make a relatively simple query any
> > slower than the above.  It looks like the query parser is doing its best
> > to optimize, but you've forced it to compare every row in ents JOIN qtys
> > against the whole datasets table.  What's wrong with:
> >
> > SELECT some_fields
> > FROM ents JOIN qtys USING (set_id)
> >     JOIN datasets USING (set_id)
> >     JOIN current ON datasets.update_id = current.cur_update_id
> > WHERE other_field = some_constant;
> >
>
> i figured i wasnt doing things "right" ;)
>
> unfortunately the query above takes down my postgres (7.0.3 tried on 2
> servers). 1 join no problem, 2 joins results in:

If you want to use JOIN, you'll need to upgrade. IIRC, 7.0.x had bugs that
caused this kind of failure that were fixed for 7.1.

I think the above is equivalent to:
select some_fields froments, qtys, datasets, currentwhere ents.set_id=qtys.set_id and      ents.set_id=datasets.set_id
and     datasets.update_id = current.cur_update_id and      other_field = some_constant;