Thread: Query optimizing
Hi all,<br /><br />I was hoping to receive some advise on a slow running query in our business' Issue Tracking System. Toshed some light on the below mentioned queries, here is a brief summary of how users interact with the system. The twomain components in the system are a Issues and Followups. An Issue is created by our technical support staff when someform of action needs to be taken in order to resolve a problem. FollowUps are entries entered against an issue depictingthe sequence of events taken to resolve the issue. There are about 15,000 Issues in the system at present and about95,000 FollowUps in the system. As we need the system to be very responsive, each query should ideally run in under1 second.<br /><br />A lot of the reports our technical officers submit to us include a listing of all actioned issuesfor a given day along with the last modified followup of each said issue. With the number of rows in our database increasingat a high rate, these queries are starting to run too slowly. <br /><br />Here is a condensed version of the twotables:<br /><br />Issues:<br />=================================<br />id - integer <br />dt_created- timestamp<br />dt_modified - timestamp<br />t_title - varchar<br />t_description - varchar<br /><br />FollowUps:<br/> =================================<br /> id - integer <br /> dt_created - timestamp<br/> dt_modified - timestamp<br /> t_description - varchar<br />n_issue - foregin key to issues<br /><br />We haveidentified that the slowness in our queries is trying to return the lastest followup for each actioned issue that day.Without further ado here are two variations I have tried within the system (neither of which are making the cut):<br/><br />V1 (correlated subquery - Very bad performance)<br /><br /> (SELECT<br /> fu.*<br /> FROM<br /> manage_followup fu,<br /> manage_issue i<br /> WHERE<br /> <a href="http://i.id">i.id</a> = fu.n_issue<br/> AND<br /> <a href="http://fu.id">fu.id</a> = (SELECT<br /> id<br/> FROM<br /> manage_followup<br /> WHERE<br /> n_issue = <a href="http://i.id">i.id</a><br /> ORDER BY<br /> dt_modified DESC<br /> LIMIT 1)) AS latestfu,<br /><br />V2 (Using Group By,"max" aggregate function and distinct- better performance, but still bad because of distinct) <br /><br /><br />SELECTDISTINCT ON (fu.n_issue)<br /> <a href="http://fu.id">fu.id</a>,<br /> fu.dt_created,<br /> fu.dt_modified,<br /> fu.t_description,<br /> fu.n_issue as issue_id<br /> FROM<br /> manage_followup fu,<br /> (SELECT<br /> n_issue,<br /> max(dt_modified) as dt_modified<br /> FROM<br /> manage_followup<br /> GROUP BY<br /> n_issue) as max_modified<br /> WHERE<br /> max_modified.n_issue = fu.n_issue<br /> AND<br /> fu.dt_modified = max_modified.dt_modified) AS latestfu ON (latestfu.issue_id = <a href="http://i.id">i.id</a>),<br/><br />We must use distinct here as we sometimes use batch scripts to enter followups, whichwill give them all similar, if not equal, modification dates. We also can't use followup ids as an indicator of thelatest followup because users of the system can retrospectively go back and change older followups.<br /><br />I was hopingsome one could provide a solution that does not require a corrolated subquery or make use of the distinct keyword.Any help would be much appreciated. <br /><br />Kind regards,<br />Sebastian <br /><br /><br /><br /><br /><br />
Sebastian Ritter wrote: > A lot of the reports our technical officers submit to us include a listing > of all actioned issues for a given day along with the last modified followup > of each said issue. With the number of rows in our database increasing at a > high rate, these queries are starting to run too slowly. > We have identified that the slowness in our queries is trying to return the > lastest followup for each actioned issue that day. Without further ado here > are two variations I have tried within the system (neither of which are > making the cut): > > V1 (correlated subquery - Very bad performance) > > (SELECT > fu.* > FROM > manage_followup fu, > manage_issue i > WHERE > i.id = fu.n_issue > AND > fu.id = (SELECT > id > FROM > manage_followup > WHERE > n_issue = i.id > ORDER BY > dt_modified DESC > LIMIT 1)) AS latestfu, > Do you have an index on (id,dt_modified) for manage_followup? Can you provide an EXPLAIN ANALYSE for this? -- Richard Huxton Archonet Ltd
Hi Sebastian,<br /><br /> - First of all i think there is an open-parenthesis missing in the query V2.<br />Maybe in theV2 version you cold restrict the results in the INNER query a bit more if you use a restriction clause like "WHERE n_issue= <a href="http://i.id/" target="_blank">i.id</a>" in that. It will certainly lower the number of rows returned byit to only 1 result. <br /><br />Regards<br /><br />--<br />Helio Campos Mello de Andrade<br /> <br /><br /><div class="gmail_quote">OnMon, Nov 10, 2008 at 8:43 AM, Sebastian Ritter <span dir="ltr"><<a href="mailto:ritter.sebastian@gmail.com">ritter.sebastian@gmail.com</a>></span>wrote:<br /><blockquote class="gmail_quote"style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Hi all,<br/><br />I was hoping to receive some advise on a slow running query in our business' Issue Tracking System. To shedsome light on the below mentioned queries, here is a brief summary of how users interact with the system. The two maincomponents in the system are a Issues and Followups. An Issue is created by our technical support staff when some formof action needs to be taken in order to resolve a problem. FollowUps are entries entered against an issue depicting thesequence of events taken to resolve the issue. There are about 15,000 Issues in the system at present and about 95,000FollowUps in the system. As we need the system to be very responsive, each query should ideally run in under 1 second.<br/><br />A lot of the reports our technical officers submit to us include a listing of all actioned issues for agiven day along with the last modified followup of each said issue. With the number of rows in our database increasing ata high rate, these queries are starting to run too slowly. <br /><br />Here is a condensed version of the two tables:<br/><br />Issues:<br />=================================<br />id - integer <br />dt_created - timestamp<br/>dt_modified - timestamp<br />t_title - varchar<br />t_description - varchar<br /><br />FollowUps:<br /> =================================<br/> id - integer <br /> dt_created - timestamp<br /> dt_modified - timestamp<br/> t_description - varchar<br />n_issue - foregin key to issues<br /><br />We have identified that the slownessin our queries is trying to return the lastest followup for each actioned issue that day. Without further ado hereare two variations I have tried within the system (neither of which are making the cut):<br /><br />V1 (correlated subquery- Very bad performance)<br /><br /> (SELECT<br /> fu.*<br /> FROM<br /> manage_followup fu,<br /> manage_issue i<br /> WHERE<br /> <a href="http://i.id" target="_blank">i.id</a> = fu.n_issue<br /> AND<br /> <a href="http://fu.id" target="_blank">fu.id</a> = (SELECT<br /> id<br /> FROM<br /> manage_followup<br /> WHERE<br /> n_issue = <a href="http://i.id" target="_blank">i.id</a><br /> ORDER BY<br/> dt_modified DESC<br /> LIMIT 1)) AS latestfu,<br /><br />V2 (Using GroupBy, "max" aggregate function and distinct- better performance, but still bad because of distinct) <br /><br /><br />SELECTDISTINCT ON (fu.n_issue)<br /> <a href="http://fu.id" target="_blank">fu.id</a>,<br/> fu.dt_created,<br /> fu.dt_modified,<br/> fu.t_description,<br /> fu.n_issue asissue_id<br /> FROM<br /> manage_followup fu,<br /> (SELECT<br /> n_issue,<br /> max(dt_modified) as dt_modified<br /> FROM<br /> manage_followup<br /> GROUP BY<br /> n_issue) as max_modified<br /> WHERE<br /> max_modified.n_issue = fu.n_issue<br /> AND<br /> fu.dt_modified = max_modified.dt_modified) AS latestfu ON (latestfu.issue_id = <a href="http://i.id"target="_blank">i.id</a>),<br /><br />We must use distinct here as we sometimes use batch scripts to enterfollowups, which will give them all similar, if not equal, modification dates. We also can't use followup ids as anindicator of the latest followup because users of the system can retrospectively go back and change older followups.<br/><br />I was hoping some one could provide a solution that does not require a corrolated subquery or make useof the distinct keyword. Any help would be much appreciated. <br /><br />Kind regards,<br /><font color="#888888">Sebastian<br /><br /><br /><br /><br /><br /></font></blockquote></div>
Hi Helio,
Sorry about the parenthesis - Bad copy/pasting skills! To further discuss your suggestion: Wouldn't adding n_issue=i.id as a where clause filter cause the sub-query to become correlated and thus much less efficient ? I may be wrong, or may have miss-understood your suggestion.
Thanks for you help,
Sebastian
Sorry about the parenthesis - Bad copy/pasting skills! To further discuss your suggestion: Wouldn't adding n_issue=i.id as a where clause filter cause the sub-query to become correlated and thus much less efficient ? I may be wrong, or may have miss-understood your suggestion.
Thanks for you help,
Sebastian
On Mon, Nov 10, 2008 at 11:48 AM, Helio Campos Mello de Andrade <helio.campos@gmail.com> wrote:
Hi Sebastian,
- First of all i think there is an open-parenthesis missing in the query V2.
Maybe in the V2 version you cold restrict the results in the INNER query a bit more if you use a restriction clause like "WHERE n_issue = i.id" in that. It will certainly lower the number of rows returned by it to only 1 result.
Regards
--
Helio Campos Mello de Andrade
On Mon, Nov 10, 2008 at 8:43 AM, Sebastian Ritter <ritter.sebastian@gmail.com> wrote:Hi all,
I was hoping to receive some advise on a slow running query in our business' Issue Tracking System. To shed some light on the below mentioned queries, here is a brief summary of how users interact with the system. The two main components in the system are a Issues and Followups. An Issue is created by our technical support staff when some form of action needs to be taken in order to resolve a problem. FollowUps are entries entered against an issue depicting the sequence of events taken to resolve the issue. There are about 15,000 Issues in the system at present and about 95,000 FollowUps in the system. As we need the system to be very responsive, each query should ideally run in under 1 second.
A lot of the reports our technical officers submit to us include a listing of all actioned issues for a given day along with the last modified followup of each said issue. With the number of rows in our database increasing at a high rate, these queries are starting to run too slowly.
Here is a condensed version of the two tables:
Issues:
=================================
id - integer
dt_created - timestamp
dt_modified - timestamp
t_title - varchar
t_description - varchar
FollowUps:
=================================
id - integer
dt_created - timestamp
dt_modified - timestamp
t_description - varchar
n_issue - foregin key to issues
We have identified that the slowness in our queries is trying to return the lastest followup for each actioned issue that day. Without further ado here are two variations I have tried within the system (neither of which are making the cut):
V1 (correlated subquery - Very bad performance)
(SELECT
fu.*
FROM
manage_followup fu,
manage_issue i
WHERE
i.id = fu.n_issue
AND
fu.id = (SELECT
id
FROM
manage_followup
WHERE
n_issue = i.id
ORDER BY
dt_modified DESC
LIMIT 1)) AS latestfu,
V2 (Using Group By, "max" aggregate function and distinct- better performance, but still bad because of distinct)
SELECT DISTINCT ON (fu.n_issue)
fu.id,
fu.dt_created,
fu.dt_modified,
fu.t_description,
fu.n_issue as issue_id
FROM
manage_followup fu,
(SELECT
n_issue,
max(dt_modified) as dt_modified
FROM
manage_followup
GROUP BY
n_issue) as max_modified
WHERE
max_modified.n_issue = fu.n_issue
AND
fu.dt_modified = max_modified.dt_modified) AS latestfu ON (latestfu.issue_id = i.id),
We must use distinct here as we sometimes use batch scripts to enter followups, which will give them all similar, if not equal, modification dates. We also can't use followup ids as an indicator of the latest followup because users of the system can retrospectively go back and change older followups.
I was hoping some one could provide a solution that does not require a corrolated subquery or make use of the distinct keyword. Any help would be much appreciated.
Kind regards,
Sebastian
Richard Huxton wrote: > Do you have an index on (id,dt_modified) for manage_followup? Can you > provide an EXPLAIN ANALYSE for this? > Hi Richard, > > Firstly, thank-you very much for your swift reply. To answer your > question, > I had not been using an index on dt_modfied. I have added it now and > ran explain analyse on the function snippet. I am almost too > embarrassed to show > the result.... > > QUERY PLAN [snip] > Total runtime: 412464.804 ms!!!! Something wrong here. I've attacked a small script that generates 10,000 issues and 10 follow-ups for each. It then pulls off the most recent follow-ups for all issues occurring on a given date. The explain analyse should show both indexes being used and a runtime of a few milliseconds. -- Richard Huxton Archonet Ltd BEGIN; CREATE SCHEMA issuefup; SET search_path = issuefup; CREATE TABLE issues ( id integer, dt_created timestamp(0) with time zone, dt_modified timestamp(0) with time zone, t_title varchar(100), t_description text ); CREATE TABLE followups ( id integer, dt_created timestamp(0) with time zone, dt_modified timestamp(0) with time zone, t_description text, n_issue integer NOT NULL ); INSERT INTO issues (id, dt_created, dt_modified, t_title, t_description) SELECT (d*100 + i), '2008-01-01'::date + (d * '1 day'::interval), '2008-01-01'::date + (d * '1 day'::interval), 'issue title ' || d || '/' || i, 'issue description ' || d || '/' || i FROM generate_series(0,99) AS d, generate_series(0,99) AS i ; INSERT INTO followups (id, dt_created, dt_modified, t_description, n_issue) SELECT (i.id * 10) + d, '2008-01-01'::date + ((i.id + d) * '1 day'::interval), '2008-01-01'::date + ((i.id + d) * '1 day'::interval), 'followup description ' || ((i.id * 10) + d), i.id FROM generate_series(0,9) AS d, issues AS i ; ALTER TABLE issues ADD PRIMARY KEY (id); ALTER TABLE followups ADD PRIMARY KEY (id); ALTER TABLE followups ADD CONSTRAINT n_issue_fkey FOREIGN KEY (n_issue) REFERENCES issues (id); CREATE INDEX issues_dt_idx ON issues (dt_modified); CREATE INDEX followups_nissue_dt_idx ON followups (n_issue, dt_modified); ANALYSE ; EXPLAIN ANALYSE SELECT fu.* FROM issues i, followups fu WHERE i.dt_modified = '2008-01-07 00:00:00+00' AND fu.id = ( SELECT f.id FROM followups f WHERE f.n_issue = i.id ORDER BY f.dt_modified DESC LIMIT 1 ) ; ROLLBACK;
Cheers for this Richard. The more I think about it, I believe the join is being made against ALL issues and followups firstand then filtered by my where clause conditions afterwards. This would in incur a scan against all 15,000 issues and95,000 followups. Set theory tells me that I should not use the entire issue table but rather the subset of interest andthen join it to the followup table, instead of joining the two tables and then filtering the results. I was under theimpression that the postgresql optimizer would have done this logically by itself. Could it have something to do withthe fact that it is a subquery and thus the planner can not deduce filtering conditions from the outer query againstit? My apologises if that made no sense.<br /><br />In summary, what im trying to understand is the following: Willthere be a performance difference between filtering query sets first and then joining them together as opposed to joiningfirst and then filtering? Does the opitmiser not choose the best course of action either way yielding the same result?<br/><br />This might be a stupid question.<br />Sebastian<br /><br /><div class="gmail_quote">On Mon, Nov 10, 2008at 12:03 PM, Richard Huxton <span dir="ltr"><<a href="mailto:dev@archonet.com">dev@archonet.com</a>></span> wrote:<br/><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left:1ex;"><div class="Ih2E3d">Richard Huxton wrote:<br /> > Do you have an index on (id,dt_modified) for manage_followup?Can you<br /> > provide an EXPLAIN ANALYSE for this?<br /><br /></div><div class="Ih2E3d">> Hi Richard,<br/> ><br /> > Firstly, thank-you very much for your swift reply. To answer your<br /> > question,<br />> I had not been using an index on dt_modfied. I have added it now and<br /> > ran explain analyse on the functionsnippet. I am almost too<br /> > embarrassed<br /> to show<br /> > the result....<br /> ><br /> > QUERYPLAN<br /></div>[snip]<br /><div class="Ih2E3d">> Total runtime: 412464.804 ms!!!!<br /><br /></div>Something wronghere. I've attacked a small script that generates 10,000<br /> issues and 10 follow-ups for each. It then pulls offthe most recent<br /> follow-ups for all issues occurring on a given date.<br /><br /> The explain analyse should showboth indexes being used and a runtime of<br /> a few milliseconds.<br /><font color="#888888"><br /> --<br /> RichardHuxton<br /> Archonet Ltd<br /></font></blockquote></div><br />
Sebastian Ritter wrote: > Could it have something > to do with the fact that it is a subquery and thus the planner can not > deduce filtering conditions from the outer query against it? My apologises > if that made no sense. Could make a difference. > In summary, what im trying to understand is the following: Will there be a > performance difference between filtering query sets first and then joining > them together as opposed to joining first and then filtering? Does the > opitmiser not choose the best course of action either way yielding the same > result? There obviously is a performance difference between joining all of the issues table versus joining 1% of it to followups. In most cases the planner can push the condition into the subquery, but not in all cases because:1. It's not provably correct to do so2. The planner isn't smart enough to figure out that it can It's impossible to say which applies to you without knowing the full query. -- Richard Huxton Archonet Ltd
Cheers for you help guys. Having filtered and then joined has substantially reduced the run time.<br /><br />Much obliged,<br/>Sebastian<br /><br /><div class="gmail_quote">On Mon, Nov 10, 2008 at 12:32 PM, Richard Huxton <span dir="ltr"><<ahref="mailto:dev@archonet.com">dev@archonet.com</a>></span> wrote:<br /><blockquote class="gmail_quote"style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><divclass="Ih2E3d">Sebastian Ritter wrote:<br /> > Could it have something<br /> > to do with the fact that itis a subquery and thus the planner can not<br /> > deduce filtering conditions from the outer query against it? My apologises<br/> > if that made no sense.<br /><br /></div>Could make a difference.<br /><div class="Ih2E3d"><br /> >In summary, what im trying to understand is the following: Will there be a<br /> > performance difference betweenfiltering query sets first and then joining<br /> > them together as opposed to joining first and then filtering?Does the<br /> > opitmiser not choose the best course of action either way yielding the same<br /> > result?<br/><br /></div>There obviously is a performance difference between joining all of the<br /> issues table versusjoining 1% of it to followups.<br /><br /> In most cases the planner can push the condition into the subquery, but<br/> not in all cases because:<br /> 1. It's not provably correct to do so<br /> 2. The planner isn't smart enoughto figure out that it can<br /><br /> It's impossible to say which applies to you without knowing the full query.<br/><font color="#888888"><br /> --<br /> Richard Huxton<br /> Archonet Ltd<br /></font></blockquote></div><br />