Thread: Dealing with complex queries
Any hints suggestions on dealing with complex queries. Most of the queries I have done with pgsql up to this point have been fairly straight forward compared to a current project I have. Also I find the 7.3 explain harder to read. Never quite fully underestand the 7.2 explain output, but 7.3 is even more informational/chatty. Is there an "explain tutorial" somewhere? Specially covering the 7.3 details. As an example of the query and explain output: select jc.type, jc.id, jc.last, jc.first, jc.track, jc.date, jc.race, jc.day, ppl.carried_as, pe.jt_id from jc_people jc, hraces hr, rkeys rk, pplkeys ppl, people pe where jc.type = 'j' and jc.track = rk.track and jc.date = rk.date and jc.race = rk.race and hr.race_key = rk.race_key and ppl.ppl_key = hr.jockey_key and substring(ppl.carried_as from 1 for 3) = substring(jc.last from 1 for 3) and pe.type = 'j' and ppl.type= 'j' and pe.jt_id = 0 and pe.ppl_key = ppl.ppl_key limit 10; Limit (cost=0.00..34349.28 rows=1 width=141) -> Merge Join (cost=0.00..34349.28 rows=1 width=141) Merge Cond: ("outer".ppl_key = "inner".jockey_key) Join Filter: ("inner".race_key = "outer".race_key) -> Nested Loop (cost=0.00..62872423.96 rows=5149 width=133) Join Filter: ("outer".race = "inner".race) -> Nested Loop (cost=0.00..62540586.27 rows=42980 width=116) Join Filter: ("substring"(("outer".carried_as)::text, 1, 3) = "substring"(("inner".last)::text, 1, 3)) -> Nested Loop (cost=0.00..61156275.74 rows=1280 width=51) Join Filter: ("outer".ppl_key = "inner".ppl_key) -> Index Scan using people_pplkey on people pe (cost=0.00..2904.19 rows=6275 width=8) Filter: ("type" = 'j'::bpchar) -> Seq Scan on pplkeys ppl (cost=0.00..8929.70 rows=65324 width=43) Filter: ("type" = 'j'::bpchar) -> Seq Scan on jc_people jc (cost=0.00..963.96 rows=6716 width=65) Filter: ("type" = 'j'::bpchar) -> Index Scan using rk_track_date_eve_race on rkeys rk (cost=0.00..7.70 rows=1 width=17) Index Cond: (("outer".track = rk.track) AND ("outer".date = rk.date)) -> Index Scan using hr_jockey_key on hraces hr (cost=0.00..26712.29 rows=492390 width=8)
On Mon, Feb 03, 2003 at 16:32:10 -0500, Francisco Reyes <lists@natserv.com> wrote: > Any hints suggestions on dealing with complex queries. > substring(ppl.carried_as from 1 for 3) > = substring(jc.last from 1 for 3) and Joining on substrings is a good sign that your design needs rethinking. This indicates that your data isn't normalized and it isn't going to be particularly fast test (though I didn't check to see if it is a bottleneck for the particular query you are doing).
On Tue, 4 Feb 2003, Bruno Wolff III wrote: > On Mon, Feb 03, 2003 at 16:32:10 -0500, > Francisco Reyes <lists@natserv.com> wrote: > > Any hints suggestions on dealing with complex queries. > > > substring(ppl.carried_as from 1 for 3) > > = substring(jc.last from 1 for 3) and > > Joining on substrings is a good sign that your design needs rethinking. The tables in the substring comparison are some of our tables compared to tables supplied by another company. It is a fact that the names don't always match and have variations. The whole excersise is to match by names to then populate our table with their keys so in the future we can match by keys in feeds we receive from them.
On Tue, Feb 04, 2003 at 09:54:59 -0500, Francisco Reyes <lists@natserv.com> wrote: > On Tue, 4 Feb 2003, Bruno Wolff III wrote: > > > On Mon, Feb 03, 2003 at 16:32:10 -0500, > > Francisco Reyes <lists@natserv.com> wrote: > > > Any hints suggestions on dealing with complex queries. > > > > > substring(ppl.carried_as from 1 for 3) > > > = substring(jc.last from 1 for 3) and > > > > Joining on substrings is a good sign that your design needs rethinking. > > The tables in the substring comparison are some of our tables compared > to tables supplied by another company. It is a fact that the names don't > always match and have variations. > > The whole excersise is to match by names to then populate our table with > their keys so in the future we can match by keys in feeds we receive from > them. I think you might be able to coerce use of an index by using like and only using substring on one side. You will want to try it both ways to see which generates a faster query. You probably also want to use a C locale if possible. You will need an index on the column that is not in the substring call and you will need to concatenate a '%' on to the substring being used for the search. So you might have a clause like: ppl.carried_as like substring(jc.last from 1 for 3) || '%' and Another idea if the list doesn't have too many keys is to just get a sorted distinct list and have a human scan it for matches.
Try creating an index on the substrings: you will need to wrap the substring in a function marked "immutable" (or "with (iscachable)" for versions prior to 7.3) for the index to be of any use. I have done a lot of the sort of thing you describe, with good results using that method. --- Francisco Reyes <lists@natserv.com> wrote: > On Tue, 4 Feb 2003, Bruno Wolff III wrote: > > > On Mon, Feb 03, 2003 at 16:32:10 -0500, > > Francisco Reyes <lists@natserv.com> wrote: > > > Any hints suggestions on dealing with complex > queries. > > > > > substring(ppl.carried_as from 1 for 3) > > > = substring(jc.last from 1 for > 3) and > > > > Joining on substrings is a good sign that your > design needs rethinking. > > The tables in the substring comparison are some of > our tables compared > to tables supplied by another company. It is a fact > that the names don't > always match and have variations. > > The whole excersise is to match by names to then > populate our table with > their keys so in the future we can match by keys in > feeds we receive from > them. > > > ---------------------------(end of > broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
On Mon, 3 Feb 2003, Francisco Reyes wrote: Follow up to my question. An explain analyze of the query (sames as before except upped limit to 200) One thing which I don't understand and seems likely to be a big problem, is some of the query elementes seem to be doing thousand of loops (ie 28K loops) > As an example of the query and explain output: > select jc.type, jc.id, > jc.last, jc.first, > jc.track, jc.date, > jc.race, jc.day, > ppl.carried_as, pe.jt_id > from jc_people jc, hraces hr, > rkeys rk, pplkeys ppl, > people pe > where jc.type = 'j' and > jc.track = rk.track and > jc.date = rk.date and > jc.race = rk.race and > hr.race_key = rk.race_key and > ppl.ppl_key = hr.jockey_key and > substring(ppl.carried_as from 1 for 3) > = substring(jc.last from 1 for 3) and > pe.type = 'j' and > ppl.type= 'j' and > pe.jt_id = 0 and > pe.ppl_key = ppl.ppl_key > limit 10; Limit (cost=0.00..30224.00 rows=1 width=141) (actual time=356090.83..1018637.83 rows=44 loops=1) -> Merge Join (cost=0.00..30224.00 rows=1 width=141) (actual time=356090.81..1018637.26 rows=44 loops=1) Merge Cond: ("outer".ppl_key = "inner".jockey_key) Join Filter: ("inner".race_key = "outer".race_key) -> Nested Loop (cost=0.00..22384540.45 rows=1833 width=133) (actual time=532.06..1014419.54 rows=21951 loops=1) Join Filter: ("outer".race = "inner".race) -> Nested Loop (cost=0.00..22266406.22 rows=15301 width=116) (actual time=531.78..1005708.99 rows=28723loops=1) Join Filter: ("substring"(("outer".carried_as)::text, 1, 3) = "substring"(("inner".last)::text, 1, 3)) -> Nested Loop (cost=0.00..21773591.67 rows=456 width=51) (actual time=463.34..737215.23 rows=1591loops=1) Join Filter: ("outer".ppl_key = "inner".ppl_key) -> Index Scan using people_pplkey on people pe (cost=0.00..2991.02 rows=2234 width=8) (actualtime=0.19..397.73 rows=1591 loops=1) Filter: (("type" = 'j'::bpchar) AND (jt_id = 0)) -> Seq Scan on pplkeys ppl (cost=0.00..8929.70 rows=65324 width=43) (actual time=0.06..421.59rows=6770 loops=1591) Filter: ("type" = 'j'::bpchar) -> Seq Scan on jc_people jc (cost=0.00..963.96 rows=6716 width=65) (actual time=0.18..113.88 rows=6946loops=1591 Filter: ("type" = 'j'::bpchar) -> Index Scan using rk_track_date_eve_race on rkeys rk (cost=0.00..7.70 rows=1 width=17) (actual time=0.11..0.22rows=8 loops=28723) Index Cond: (("outer".track = rk.track) AND ("outer".date = rk.date)) -> Index Scan using hr_jockey_key on hraces hr (cost=0.00..26712.29 rows=492390 width=8) (actual time=0.22..2569.24rows=207341 loops=1) Total runtime: 1018638.45 msec
Francisco J Reyes <fran@natserv.net> writes: > -> Nested Loop (cost=0.00..21773591.67 rows=456 width=51) (actual time=463.34..737215.23 rows=1591loops=1) > Join Filter: ("outer".ppl_key = "inner".ppl_key) > -> Index Scan using people_pplkey on people pe (cost=0.00..2991.02 rows=2234 width=8) (actualtime=0.19..397.73 rows=1591 loops=1) > Filter: (("type" = 'j'::bpchar) AND (jt_id = 0)) > -> Seq Scan on pplkeys ppl (cost=0.00..8929.70 rows=65324 width=43) (actual time=0.06..421.59rows=6770 loops=1591) > Filter: ("type" = 'j'::bpchar) Seems the main problem is this innermost join --- it's using the stupidest possible form of join. At the very least you'd like it to use an index on pplkeys. Are people.ppl_key and pplkeys.ppl_key the same datatype? Does the latter have an index? regards, tom lane
Jeff Eckermann <jeff_eckermann@yahoo.com> Wrote >>Try creating an index on the substrings: you will need >>to wrap the substring in a function marked "immutable" >>(or "with (iscachable)" for versions prior to 7.3) for .... Bruno Wolff III <bruno@wolff.to> wrote >>I think you might be able to coerce use of an >>index by using like and only using substring >>on one side. ... Sorry for the delay. I wanted to give a response after I had tried different approaches. Two points worth noting. The first speedup I experience with the query was by changing jc.type = 'j' and pe.type = 'j' and ppl.type= 'j' and to jc.type = 'j' and pe.type = jc.type and ppl.type= jc.type and Didn't measure specific numbers, but the query went from something like 10 minutes to under a minute. The second point is that after I did a vacuum full that night the query was even faster to the point that it was less than 10 seconds. Even though after the vacuum full this query is now speedy I wonder if there is any suggested numbers of when one should break a query. Was the query I posted within the parameters of what the server should be able to handle easily? What I am wondering is whether there is a number others have bumped into which are problematic for the server. For example along the lines of "10 tables and 30 conditions". Pulled those numbers out of thin air just to show the type of sentence, not that I have bumped into any limits myself. The query I posted is about as complex as I have written them. For reference my final query was: select jc.type, jc.jc_id, jc.last_name, jc.first_name, jc.track, jc.date, jc.race, jc.day, ppl.carried_as, pe.jc_id from jc_people jc, hraces hr, rkeys rk, pplkeys ppl, people pe where jc.type = 'j' and pe.type = jc.type and ppl.type= jc.type and pe.jc_id = 0 and pe.ppl_key = ppl.ppl_key and jc.track = rk.track and jc.date = rk.date and jc.race = rk.race and hr.race_key = rk.race_key and ppl.ppl_key = hr.jockey_key and substring(ppl.carried_as from 1 for 3) = substring(jc.last_name from 1 for 3) limit 200;