Thread: Dealing with complex queries

Dealing with complex queries

From
Francisco Reyes
Date:
Any hints suggestions on dealing with complex queries.
Most of the queries I have done with pgsql up to this point have been
fairly straight forward compared to a current project I have.

Also I find the 7.3 explain harder to read. Never quite fully underestand
the 7.2 explain output, but 7.3 is even more informational/chatty.

Is there an "explain tutorial" somewhere? Specially covering the 7.3
details.

As an example of the query and explain output:
select  jc.type, jc.id,
        jc.last, jc.first,
        jc.track, jc.date,
        jc.race, jc.day,
        ppl.carried_as, pe.jt_id
from    jc_people jc, hraces hr,
        rkeys rk, pplkeys ppl,
        people pe
where   jc.type = 'j' and
        jc.track = rk.track and
        jc.date  = rk.date  and
        jc.race  = rk.race  and
        hr.race_key = rk.race_key and
        ppl.ppl_key = hr.jockey_key and
        substring(ppl.carried_as from 1 for 3)
                = substring(jc.last from 1 for 3) and
        pe.type = 'j' and
        ppl.type= 'j' and
        pe.jt_id = 0 and
        pe.ppl_key = ppl.ppl_key
limit 10;



Limit  (cost=0.00..34349.28 rows=1 width=141)
   ->  Merge Join  (cost=0.00..34349.28 rows=1 width=141)
         Merge Cond: ("outer".ppl_key = "inner".jockey_key)
         Join Filter: ("inner".race_key = "outer".race_key)
         ->  Nested Loop  (cost=0.00..62872423.96 rows=5149 width=133)
               Join Filter: ("outer".race = "inner".race)
               ->  Nested Loop  (cost=0.00..62540586.27 rows=42980 width=116)
                     Join Filter: ("substring"(("outer".carried_as)::text, 1, 3) = "substring"(("inner".last)::text, 1,
3))
                     ->  Nested Loop  (cost=0.00..61156275.74 rows=1280 width=51)
                           Join Filter: ("outer".ppl_key = "inner".ppl_key)
                           ->  Index Scan using people_pplkey on people pe  (cost=0.00..2904.19 rows=6275 width=8)
                                 Filter: ("type" = 'j'::bpchar)
                           ->  Seq Scan on pplkeys ppl  (cost=0.00..8929.70 rows=65324 width=43)
                                 Filter: ("type" = 'j'::bpchar)
                     ->  Seq Scan on jc_people jc  (cost=0.00..963.96 rows=6716 width=65)
                           Filter: ("type" = 'j'::bpchar)
               ->  Index Scan using rk_track_date_eve_race on rkeys rk  (cost=0.00..7.70 rows=1 width=17)
                     Index Cond: (("outer".track = rk.track) AND ("outer".date = rk.date))
         ->  Index Scan using hr_jockey_key on hraces hr  (cost=0.00..26712.29 rows=492390 width=8)




Re: Dealing with complex queries

From
Bruno Wolff III
Date:
On Mon, Feb 03, 2003 at 16:32:10 -0500,
  Francisco Reyes <lists@natserv.com> wrote:
> Any hints suggestions on dealing with complex queries.

>         substring(ppl.carried_as from 1 for 3)
>                 = substring(jc.last from 1 for 3) and

Joining on substrings is a good sign that your design needs rethinking.
This indicates that your data isn't normalized and it isn't going to
be particularly fast test (though I didn't check to see if it is a bottleneck
for the particular query you are doing).

Re: Dealing with complex queries

From
Francisco Reyes
Date:
On Tue, 4 Feb 2003, Bruno Wolff III wrote:

> On Mon, Feb 03, 2003 at 16:32:10 -0500,
>   Francisco Reyes <lists@natserv.com> wrote:
> > Any hints suggestions on dealing with complex queries.
>
> >         substring(ppl.carried_as from 1 for 3)
> >                 = substring(jc.last from 1 for 3) and
>
> Joining on substrings is a good sign that your design needs rethinking.

The tables in the substring comparison are some of our tables compared
to tables supplied by another company. It is a fact that the names don't
always match and have variations.

The whole excersise is to match by names to then populate our table with
their keys so in the future we can match by keys in feeds we receive from
them.


Re: Dealing with complex queries

From
Bruno Wolff III
Date:
On Tue, Feb 04, 2003 at 09:54:59 -0500,
  Francisco Reyes <lists@natserv.com> wrote:
> On Tue, 4 Feb 2003, Bruno Wolff III wrote:
>
> > On Mon, Feb 03, 2003 at 16:32:10 -0500,
> >   Francisco Reyes <lists@natserv.com> wrote:
> > > Any hints suggestions on dealing with complex queries.
> >
> > >         substring(ppl.carried_as from 1 for 3)
> > >                 = substring(jc.last from 1 for 3) and
> >
> > Joining on substrings is a good sign that your design needs rethinking.
>
> The tables in the substring comparison are some of our tables compared
> to tables supplied by another company. It is a fact that the names don't
> always match and have variations.
>
> The whole excersise is to match by names to then populate our table with
> their keys so in the future we can match by keys in feeds we receive from
> them.

I think you might be able to coerce use of an index by using like and
only using substring on one side. You will want to try it both ways
to see which generates a faster query. You probably also want to use
a C locale if possible. You will need an index on the column that is
not in the substring call and you will need to concatenate a '%' on to
the substring being used for the search.

So you might have a clause like:
ppl.carried_as like substring(jc.last from 1 for 3) || '%' and

Another idea if the list doesn't have too many keys is to just get a sorted
distinct list and have a human scan it for matches.

Re: Dealing with complex queries

From
Jeff Eckermann
Date:
Try creating an index on the substrings: you will need
to wrap the substring in a function marked "immutable"
(or "with (iscachable)" for versions prior to 7.3) for
the index to be of any use.
I have done a lot of the sort of thing you describe,
with good results using that method.

--- Francisco Reyes <lists@natserv.com> wrote:
> On Tue, 4 Feb 2003, Bruno Wolff III wrote:
>
> > On Mon, Feb 03, 2003 at 16:32:10 -0500,
> >   Francisco Reyes <lists@natserv.com> wrote:
> > > Any hints suggestions on dealing with complex
> queries.
> >
> > >         substring(ppl.carried_as from 1 for 3)
> > >                 = substring(jc.last from 1 for
> 3) and
> >
> > Joining on substrings is a good sign that your
> design needs rethinking.
>
> The tables in the substring comparison are some of
> our tables compared
> to tables supplied by another company. It is a fact
> that the names don't
> always match and have variations.
>
> The whole excersise is to match by names to then
> populate our table with
> their keys so in the future we can match by keys in
> feeds we receive from
> them.
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html


__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

Re: Dealing with complex queries

From
Francisco J Reyes
Date:
On Mon, 3 Feb 2003, Francisco Reyes wrote:

Follow up to my question.
An explain analyze of the query (sames as before except upped limit to
200)

One thing which I don't understand and seems likely to be a big problem,
is some of the query elementes seem to be doing thousand of loops (ie 28K
loops)

> As an example of the query and explain output:
> select  jc.type, jc.id,
>         jc.last, jc.first,
>         jc.track, jc.date,
>         jc.race, jc.day,
>         ppl.carried_as, pe.jt_id
> from    jc_people jc, hraces hr,
>         rkeys rk, pplkeys ppl,
>         people pe
> where   jc.type = 'j' and
>         jc.track = rk.track and
>         jc.date  = rk.date  and
>         jc.race  = rk.race  and
>         hr.race_key = rk.race_key and
>         ppl.ppl_key = hr.jockey_key and
>         substring(ppl.carried_as from 1 for 3)
>                 = substring(jc.last from 1 for 3) and
>         pe.type = 'j' and
>         ppl.type= 'j' and
>         pe.jt_id = 0 and
>         pe.ppl_key = ppl.ppl_key
> limit 10;


 Limit  (cost=0.00..30224.00 rows=1 width=141) (actual
time=356090.83..1018637.83 rows=44 loops=1)
   ->  Merge Join  (cost=0.00..30224.00 rows=1 width=141) (actual time=356090.81..1018637.26 rows=44 loops=1)
         Merge Cond: ("outer".ppl_key = "inner".jockey_key)
         Join Filter: ("inner".race_key = "outer".race_key)
         ->  Nested Loop  (cost=0.00..22384540.45 rows=1833 width=133) (actual time=532.06..1014419.54 rows=21951
loops=1)
               Join Filter: ("outer".race = "inner".race)
               ->  Nested Loop  (cost=0.00..22266406.22 rows=15301 width=116) (actual time=531.78..1005708.99
rows=28723loops=1) 
                     Join Filter: ("substring"(("outer".carried_as)::text, 1, 3) = "substring"(("inner".last)::text, 1,
3))
                     ->  Nested Loop  (cost=0.00..21773591.67 rows=456 width=51) (actual time=463.34..737215.23
rows=1591loops=1) 
                           Join Filter: ("outer".ppl_key = "inner".ppl_key)
                           ->  Index Scan using people_pplkey on people pe  (cost=0.00..2991.02 rows=2234 width=8)
(actualtime=0.19..397.73 rows=1591 loops=1) 
                                 Filter: (("type" = 'j'::bpchar) AND (jt_id = 0))
                           ->  Seq Scan on pplkeys ppl  (cost=0.00..8929.70 rows=65324 width=43) (actual
time=0.06..421.59rows=6770 loops=1591) 
                                 Filter: ("type" = 'j'::bpchar)
                     ->  Seq Scan on jc_people jc  (cost=0.00..963.96 rows=6716 width=65) (actual time=0.18..113.88
rows=6946loops=1591 
                           Filter: ("type" = 'j'::bpchar)
               ->  Index Scan using rk_track_date_eve_race on rkeys rk (cost=0.00..7.70 rows=1 width=17) (actual
time=0.11..0.22rows=8 loops=28723) 
                     Index Cond: (("outer".track = rk.track) AND ("outer".date = rk.date))
         ->  Index Scan using hr_jockey_key on hraces hr  (cost=0.00..26712.29 rows=492390 width=8) (actual
time=0.22..2569.24rows=207341 loops=1) 
 Total runtime: 1018638.45 msec


Re: Dealing with complex queries

From
Tom Lane
Date:
Francisco J Reyes <fran@natserv.net> writes:
>                      ->  Nested Loop  (cost=0.00..21773591.67 rows=456 width=51) (actual time=463.34..737215.23
rows=1591loops=1) 
>                            Join Filter: ("outer".ppl_key = "inner".ppl_key)
>                            ->  Index Scan using people_pplkey on people pe  (cost=0.00..2991.02 rows=2234 width=8)
(actualtime=0.19..397.73 rows=1591 loops=1) 
>                                  Filter: (("type" = 'j'::bpchar) AND (jt_id = 0))
>                            ->  Seq Scan on pplkeys ppl  (cost=0.00..8929.70 rows=65324 width=43) (actual
time=0.06..421.59rows=6770 loops=1591) 
>                                  Filter: ("type" = 'j'::bpchar)

Seems the main problem is this innermost join --- it's using the
stupidest possible form of join.  At the very least you'd like it
to use an index on pplkeys.  Are people.ppl_key and pplkeys.ppl_key
the same datatype?  Does the latter have an index?

            regards, tom lane

Re: Dealing with complex queries

From
Francisco Reyes
Date:
Jeff Eckermann <jeff_eckermann@yahoo.com> Wrote
>>Try creating an index on the substrings: you will need
>>to wrap the substring in a function marked "immutable"
>>(or "with (iscachable)" for versions prior to 7.3) for
....


Bruno Wolff III <bruno@wolff.to> wrote
>>I think you might be able to coerce use of an
>>index by using like and only using substring
>>on one side.
...

Sorry for the delay. I wanted to give a response after I had tried
different approaches.

Two points worth noting.
The first speedup I experience with the query was by changing
jc.type = 'j' and
pe.type = 'j' and
ppl.type= 'j' and


to

jc.type = 'j' and
pe.type = jc.type and
ppl.type= jc.type and

Didn't measure specific numbers, but the query went from something like 10
minutes to under a minute.

The second point is that after I did a vacuum full that night the query
was even faster to the point that it was less than 10 seconds.

Even though after the vacuum full this query is now speedy I wonder if
there is any suggested numbers of when one should break a query. Was the
query I posted within the parameters of what the server should be able to
handle easily?

What I am wondering is whether there is a number others have bumped into
which are problematic for the server. For example along the lines of "10
tables and 30 conditions". Pulled those numbers out of thin air just to
show the type of sentence, not that I have bumped into any limits myself.
The query I posted is about as complex as I have written them.

For reference my final query was:
select  jc.type, jc.jc_id,
        jc.last_name, jc.first_name,
        jc.track, jc.date,
        jc.race, jc.day,
        ppl.carried_as, pe.jc_id
from    jc_people jc, hraces hr,
        rkeys rk, pplkeys ppl,
        people pe
where   jc.type = 'j' and
        pe.type = jc.type and
        ppl.type= jc.type and
        pe.jc_id = 0 and
        pe.ppl_key = ppl.ppl_key and
        jc.track = rk.track and
        jc.date  = rk.date  and
        jc.race  = rk.race  and
        hr.race_key = rk.race_key and
        ppl.ppl_key = hr.jockey_key and
        substring(ppl.carried_as from 1 for 3)
                = substring(jc.last_name from 1 for 3)
limit 200;