Thread: Query optimizing

Query optimizing

From

"Sebastian Ritter"

Date:

10 November 2008, 06:43:21

Hi all,<br /><br />I was hoping to receive some advise on a slow running query in our business' Issue Tracking System.
Toshed some light on the below mentioned queries, here is a brief summary of how users interact with the system. The
twomain components in the system are a Issues and Followups. An Issue is created by our technical support staff when
someform of action needs to be taken in order to resolve a problem. FollowUps are entries entered against an issue
depictingthe sequence of events taken to resolve the issue. There are about 15,000 Issues in the system at present and
about95,000 FollowUps in the system. As we need the system to be very responsive, each query should ideally run in
under1 second.<br /><br />A lot of the reports our technical officers submit to us include a listing of all actioned
issuesfor a given day along with the last modified followup of each said issue. With the number of rows in our database
increasingat a high rate, these queries are starting to run too slowly. <br /><br />Here is a condensed version of the
twotables:<br /><br />Issues:<br />=================================<br />id  - integer                 <br
/>dt_created- timestamp<br />dt_modified - timestamp<br />t_title - varchar<br />t_description - varchar<br /><br
/>FollowUps:<br/> =================================<br /> id  - integer                 <br /> dt_created -
timestamp<br/> dt_modified - timestamp<br /> t_description - varchar<br />n_issue - foregin key to issues<br /><br />We
haveidentified that the slowness in our queries is trying to return the lastest followup for each actioned issue that
day.Without further ado here are two variations I have tried within the system (neither of which are making the
cut):<br/><br />V1 (correlated subquery - Very bad performance)<br /><br /> (SELECT<br />         fu.*<br />  FROM<br
/>       manage_followup fu,<br />        manage_issue i<br />  WHERE<br />         <a href="http://i.id">i.id</a> =
fu.n_issue<br/>          AND<br />         <a href="http://fu.id">fu.id</a> = (SELECT<br />                       
id<br/>                   FROM<br />                        manage_followup<br />                    WHERE<br
/>                        n_issue = <a href="http://i.id">i.id</a><br />                     ORDER BY<br
/>                        dt_modified DESC<br />                    LIMIT 1)) AS latestfu,<br /><br />V2 (Using Group
By,"max" aggregate function  and distinct- better performance, but still bad because of distinct) <br /><br /><br
/>SELECTDISTINCT ON (fu.n_issue)<br />                                <a href="http://fu.id">fu.id</a>,<br
/>                               fu.dt_created,<br />                                fu.dt_modified,<br
/>                               fu.t_description,<br />                                 fu.n_issue as issue_id<br
/>                           FROM<br />                                manage_followup fu,<br
/>                               (SELECT<br />                                    n_issue,<br />
                                   max(dt_modified) as dt_modified<br />                                 FROM<br
/>                                   manage_followup<br />                                 GROUP BY<br
/>                                   n_issue) as max_modified<br />                             WHERE<br
/>                               max_modified.n_issue = fu.n_issue<br />                                AND<br
/>                               fu.dt_modified = max_modified.dt_modified) AS latestfu ON (latestfu.issue_id = <a
href="http://i.id">i.id</a>),<br/><br />We must use distinct here as we sometimes use batch scripts to enter followups,
whichwill give them all similar, if not equal, modification dates. We also can't use followup ids as an indicator of
thelatest followup because users of the system can retrospectively go back and change older followups.<br /><br />I was
hopingsome one could provide a solution that does not require a corrolated subquery or make use of the distinct
keyword.Any help would be much appreciated. <br /><br />Kind regards,<br />Sebastian <br /><br /><br /><br /><br /><br
/>

Re: Query optimizing

From

Richard Huxton

Date:

10 November 2008, 07:05:37

Sebastian Ritter wrote:
> A lot of the reports our technical officers submit to us include a listing
> of all actioned issues for a given day along with the last modified followup
> of each said issue. With the number of rows in our database increasing at a
> high rate, these queries are starting to run too slowly.

> We have identified that the slowness in our queries is trying to return the
> lastest followup for each actioned issue that day. Without further ado here
> are two variations I have tried within the system (neither of which are
> making the cut):
> 
> V1 (correlated subquery - Very bad performance)
> 
>  (SELECT
>          fu.*
>   FROM
>         manage_followup fu,
>         manage_issue i
>   WHERE
>          i.id = fu.n_issue
>          AND
>          fu.id = (SELECT
>                         id
>                    FROM
>                         manage_followup
>                     WHERE
>                          n_issue = i.id
>                     ORDER BY
>                          dt_modified DESC
>                     LIMIT 1)) AS latestfu,
> 

Do you have an index on (id,dt_modified) for manage_followup? Can you
provide an EXPLAIN ANALYSE for this?

--  Richard Huxton Archonet Ltd

Re: Query optimizing

From

"Helio Campos Mello de Andrade"

Date:

10 November 2008, 07:48:53

Hi Sebastian,<br /><br /> - First of all i think there is an open-parenthesis missing in the query V2.<br />Maybe in
theV2 version you cold restrict the results in the INNER query a bit more if you use a restriction clause like "WHERE
n_issue= <a href="http://i.id/" target="_blank">i.id</a>" in that. It will certainly lower the number of rows returned
byit to only 1 result. <br /><br />Regards<br /><br />--<br />Helio Campos Mello de Andrade<br /> <br /><br /><div
class="gmail_quote">OnMon, Nov 10, 2008 at 8:43 AM, Sebastian Ritter <span dir="ltr"><<a
href="mailto:ritter.sebastian@gmail.com">ritter.sebastian@gmail.com</a>></span>wrote:<br /><blockquote
class="gmail_quote"style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Hi
all,<br/><br />I was hoping to receive some advise on a slow running query in our business' Issue Tracking System. To
shedsome light on the below mentioned queries, here is a brief summary of how users interact with the system. The two
maincomponents in the system are a Issues and Followups. An Issue is created by our technical support staff when some
formof action needs to be taken in order to resolve a problem. FollowUps are entries entered against an issue depicting
thesequence of events taken to resolve the issue. There are about 15,000 Issues in the system at present and about
95,000FollowUps in the system. As we need the system to be very responsive, each query should ideally run in under 1
second.<br/><br />A lot of the reports our technical officers submit to us include a listing of all actioned issues for
agiven day along with the last modified followup of each said issue. With the number of rows in our database increasing
ata high rate, these queries are starting to run too slowly. <br /><br />Here is a condensed version of the two
tables:<br/><br />Issues:<br />=================================<br />id  - integer                 <br />dt_created -
timestamp<br/>dt_modified - timestamp<br />t_title - varchar<br />t_description - varchar<br /><br />FollowUps:<br />
=================================<br/> id  - integer                 <br /> dt_created - timestamp<br /> dt_modified -
timestamp<br/> t_description - varchar<br />n_issue - foregin key to issues<br /><br />We have identified that the
slownessin our queries is trying to return the lastest followup for each actioned issue that day. Without further ado
hereare two variations I have tried within the system (neither of which are making the cut):<br /><br />V1 (correlated
subquery- Very bad performance)<br /><br /> (SELECT<br />         fu.*<br />  FROM<br />        manage_followup fu,<br
/>       manage_issue i<br />  WHERE<br />         <a href="http://i.id" target="_blank">i.id</a> = fu.n_issue<br />
        AND<br />         <a href="http://fu.id" target="_blank">fu.id</a> = (SELECT<br />                        id<br
/>                  FROM<br />                        manage_followup<br />                    WHERE<br
/>                        n_issue = <a href="http://i.id" target="_blank">i.id</a><br />                     ORDER
BY<br/>                         dt_modified DESC<br />                    LIMIT 1)) AS latestfu,<br /><br />V2 (Using
GroupBy, "max" aggregate function  and distinct- better performance, but still bad because of distinct) <br /><br /><br
/>SELECTDISTINCT ON (fu.n_issue)<br />                                <a href="http://fu.id"
target="_blank">fu.id</a>,<br/>                                fu.dt_created,<br />                               
fu.dt_modified,<br/>                                 fu.t_description,<br />                                 fu.n_issue
asissue_id<br />                            FROM<br />                                manage_followup fu,<br
/>                               (SELECT<br />                                    n_issue,<br />
                                   max(dt_modified) as dt_modified<br />                                 FROM<br
/>                                   manage_followup<br />                                 GROUP BY<br
/>                                   n_issue) as max_modified<br />                             WHERE<br
/>                               max_modified.n_issue = fu.n_issue<br />                                AND<br
/>                               fu.dt_modified = max_modified.dt_modified) AS latestfu ON (latestfu.issue_id = <a
href="http://i.id"target="_blank">i.id</a>),<br /><br />We must use distinct here as we sometimes use batch scripts to
enterfollowups, which will give them all similar, if not equal, modification dates. We also can't use followup ids as
anindicator of the latest followup because users of the system can retrospectively go back and change older
followups.<br/><br />I was hoping some one could provide a solution that does not require a corrolated subquery or make
useof the distinct keyword. Any help would be much appreciated. <br /><br />Kind regards,<br /><font
color="#888888">Sebastian<br /><br /><br /><br /><br /><br /></font></blockquote></div>

Re: Query optimizing

From

"Sebastian Ritter"

Date:

10 November 2008, 07:58:06

Hi Helio,

Sorry about the parenthesis - Bad copy/pasting skills! To further discuss your suggestion: Wouldn't adding n_issue=i.id as a where clause filter cause the sub-query to become correlated and thus much less efficient ? I may be wrong, or may have miss-understood your suggestion.

Thanks for you help,
Sebastian

On Mon, Nov 10, 2008 at 11:48 AM, Helio Campos Mello de Andrade <helio.campos@gmail.com> wrote:

Hi Sebastian,

- First of all i think there is an open-parenthesis missing in the query V2.
Maybe in the V2 version you cold restrict the results in the INNER query a bit more if you use a restriction clause like "WHERE n_issue = i.id" in that. It will certainly lower the number of rows returned by it to only 1 result.

Regards

--
Helio Campos Mello de Andrade

On Mon, Nov 10, 2008 at 8:43 AM, Sebastian Ritter <ritter.sebastian@gmail.com> wrote:
Hi all,

I was hoping to receive some advise on a slow running query in our business' Issue Tracking System. To shed some light on the below mentioned queries, here is a brief summary of how users interact with the system. The two main components in the system are a Issues and Followups. An Issue is created by our technical support staff when some form of action needs to be taken in order to resolve a problem. FollowUps are entries entered against an issue depicting the sequence of events taken to resolve the issue. There are about 15,000 Issues in the system at present and about 95,000 FollowUps in the system. As we need the system to be very responsive, each query should ideally run in under 1 second.

A lot of the reports our technical officers submit to us include a listing of all actioned issues for a given day along with the last modified followup of each said issue. With the number of rows in our database increasing at a high rate, these queries are starting to run too slowly.

Here is a condensed version of the two tables:

Issues:
=================================
id - integer
dt_created - timestamp
dt_modified - timestamp
t_title - varchar
t_description - varchar

FollowUps:
=================================
id - integer
dt_created - timestamp
dt_modified - timestamp
t_description - varchar
n_issue - foregin key to issues

We have identified that the slowness in our queries is trying to return the lastest followup for each actioned issue that day. Without further ado here are two variations I have tried within the system (neither of which are making the cut):

V1 (correlated subquery - Very bad performance)

(SELECT
         fu.*
FROM
      manage_followup fu,
        manage_issue i
WHERE
         i.id = fu.n_issue
       AND
         fu.id = (SELECT
                        id
                   FROM
                        manage_followup
                    WHERE
                         n_issue = i.id
                    ORDER BY
                         dt_modified DESC
                    LIMIT 1)) AS latestfu,

V2 (Using Group By, "max" aggregate function and distinct- better performance, but still bad because of distinct)

SELECT DISTINCT ON (fu.n_issue)
                                fu.id,
                                fu.dt_created,
                                fu.dt_modified,
                                fu.t_description,
                                fu.n_issue as issue_id
                            FROM
                                manage_followup fu,
                                (SELECT
                                    n_issue,
                                    max(dt_modified) as dt_modified
                                 FROM
                                    manage_followup
                                 GROUP BY
                                    n_issue) as max_modified
                            WHERE
                                max_modified.n_issue = fu.n_issue
                                AND
                                fu.dt_modified = max_modified.dt_modified) AS latestfu ON (latestfu.issue_id = i.id),

We must use distinct here as we sometimes use batch scripts to enter followups, which will give them all similar, if not equal, modification dates. We also can't use followup ids as an indicator of the latest followup because users of the system can retrospectively go back and change older followups.

I was hoping some one could provide a solution that does not require a corrolated subquery or make use of the distinct keyword. Any help would be much appreciated.

Kind regards,
Sebastian

Re: Query optimizing

From

Richard Huxton

Date:

10 November 2008, 08:03:28

Richard Huxton wrote:
> Do you have an index on (id,dt_modified) for manage_followup? Can you
>  provide an EXPLAIN ANALYSE for this?

> Hi Richard,
>
> Firstly, thank-you very much for your swift reply. To answer your
> question,
> I had not been using an index on dt_modfied. I have added it now and
> ran explain analyse on the function snippet. I am almost too
> embarrassed
to show
> the result....
>
> QUERY PLAN
[snip]
> Total runtime: 412464.804 ms!!!!

Something wrong here. I've attacked a small script that generates 10,000
issues and 10 follow-ups for each. It then pulls off the most recent
follow-ups for all issues occurring on a given date.

The explain analyse should show both indexes being used and a runtime of
a few milliseconds.

--
  Richard Huxton
  Archonet Ltd
BEGIN;

CREATE SCHEMA issuefup;

SET search_path = issuefup;

CREATE TABLE issues (
    id integer,
    dt_created timestamp(0) with time zone,
    dt_modified timestamp(0) with time zone,
    t_title varchar(100),
    t_description text
);

CREATE TABLE followups (
    id integer,
    dt_created timestamp(0) with time zone,
    dt_modified timestamp(0) with time zone,
    t_description text,
    n_issue integer NOT NULL
);

INSERT INTO issues
    (id, dt_created, dt_modified, t_title, t_description)
SELECT
    (d*100 + i),
    '2008-01-01'::date + (d * '1 day'::interval),
    '2008-01-01'::date + (d * '1 day'::interval),
    'issue title ' || d || '/' || i,
    'issue description ' || d || '/' || i
FROM
    generate_series(0,99) AS d,
    generate_series(0,99) AS i
;


INSERT INTO followups
    (id, dt_created, dt_modified, t_description, n_issue)
SELECT
    (i.id * 10) + d,
    '2008-01-01'::date + ((i.id + d) * '1 day'::interval),
    '2008-01-01'::date + ((i.id + d) * '1 day'::interval),
    'followup description ' || ((i.id * 10) + d),
    i.id
FROM
    generate_series(0,9) AS d,
    issues AS i
;

ALTER TABLE issues ADD  PRIMARY KEY (id);
ALTER TABLE followups ADD PRIMARY KEY (id);
ALTER TABLE followups ADD CONSTRAINT n_issue_fkey FOREIGN KEY (n_issue) REFERENCES issues (id);
CREATE INDEX issues_dt_idx ON issues (dt_modified);
CREATE INDEX followups_nissue_dt_idx ON followups (n_issue, dt_modified);

ANALYSE ;

EXPLAIN ANALYSE
SELECT
    fu.*
FROM
    issues i,
    followups fu
WHERE
    i.dt_modified = '2008-01-07 00:00:00+00'
    AND fu.id = (
        SELECT f.id
        FROM followups f
        WHERE f.n_issue = i.id
        ORDER BY f.dt_modified DESC LIMIT 1
    )
;

ROLLBACK;

Re: Query optimizing

From

"Sebastian Ritter"

Date:

10 November 2008, 08:21:32

Cheers for this Richard. The more I think about it, I believe the join is being made against ALL issues and followups
firstand then filtered by my where clause conditions afterwards. This would in incur a scan against all 15,000 issues
and95,000 followups. Set theory tells me that I should not use the entire issue table but rather the subset of interest
andthen join it to the followup table, instead of joining the two tables and then filtering the results. I was under
theimpression that the postgresql optimizer would have done this logically by itself. Could it have something to do
withthe fact that it is a subquery and thus the planner can not deduce filtering conditions from the outer query
againstit? My apologises if that made no sense.<br /><br />In summary, what im trying to understand is the following:
Willthere be a performance difference between filtering query sets first and then joining them together as opposed to
joiningfirst and then filtering? Does the opitmiser not choose the best course of action either way yielding the same
result?<br/><br />This might be a stupid question.<br />Sebastian<br /><br /><div class="gmail_quote">On Mon, Nov 10,
2008at 12:03 PM, Richard Huxton <span dir="ltr"><<a href="mailto:dev@archonet.com">dev@archonet.com</a>></span>
wrote:<br/><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex;
padding-left:1ex;"><div class="Ih2E3d">Richard Huxton wrote:<br /> > Do you have an index on (id,dt_modified) for
manage_followup?Can you<br /> >  provide an EXPLAIN ANALYSE for this?<br /><br /></div><div class="Ih2E3d">> Hi
Richard,<br/> ><br /> > Firstly, thank-you very much for your swift reply. To answer your<br /> > question,<br
/>> I had not been using an index on dt_modfied. I have added it now and<br /> > ran explain analyse on the
functionsnippet. I am almost too<br /> > embarrassed<br /> to show<br /> > the result....<br /> ><br /> >
QUERYPLAN<br /></div>[snip]<br /><div class="Ih2E3d">> Total runtime: 412464.804 ms!!!!<br /><br /></div>Something
wronghere. I've attacked a small script that generates 10,000<br /> issues and 10 follow-ups for each. It then pulls
offthe most recent<br /> follow-ups for all issues occurring on a given date.<br /><br /> The explain analyse should
showboth indexes being used and a runtime of<br /> a few milliseconds.<br /><font color="#888888"><br /> --<br />
 RichardHuxton<br />  Archonet Ltd<br /></font></blockquote></div><br />

Re: Query optimizing

From

Richard Huxton

Date:

10 November 2008, 08:32:50

Sebastian Ritter wrote:
> Could it have something
> to do with the fact that it is a subquery and thus the planner can not
> deduce filtering conditions from the outer query against it? My apologises
> if that made no sense.

Could make a difference.

> In summary, what im trying to understand is the following: Will there be a
> performance difference between filtering query sets first and then joining
> them together as opposed to joining first and then filtering? Does the
> opitmiser not choose the best course of action either way yielding the same
> result?

There obviously is a performance difference between joining all of the
issues table versus joining 1% of it to followups.

In most cases the planner can push the condition into the subquery, but
not in all cases because:1. It's not provably correct to do so2. The planner isn't smart enough to figure out that it
can

It's impossible to say which applies to you without knowing the full query.

--  Richard Huxton Archonet Ltd

Re: Query optimizing

From

"Sebastian Ritter"

Date:

10 November 2008, 09:39:55

Cheers for you help guys. Having filtered and then joined has substantially reduced the run time.<br /><br />Much
obliged,<br/>Sebastian<br /><br /><div class="gmail_quote">On Mon, Nov 10, 2008 at 12:32 PM, Richard Huxton <span
dir="ltr"><<ahref="mailto:dev@archonet.com">dev@archonet.com</a>></span> wrote:<br /><blockquote
class="gmail_quote"style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left:
1ex;"><divclass="Ih2E3d">Sebastian Ritter wrote:<br /> > Could it have something<br /> > to do with the fact that
itis a subquery and thus the planner can not<br /> > deduce filtering conditions from the outer query against it? My
apologises<br/> > if that made no sense.<br /><br /></div>Could make a difference.<br /><div class="Ih2E3d"><br />
>In summary, what im trying to understand is the following: Will there be a<br /> > performance difference
betweenfiltering query sets first and then joining<br /> > them together as opposed to joining first and then
filtering?Does the<br /> > opitmiser not choose the best course of action either way yielding the same<br /> >
result?<br/><br /></div>There obviously is a performance difference between joining all of the<br /> issues table
versusjoining 1% of it to followups.<br /><br /> In most cases the planner can push the condition into the subquery,
but<br/> not in all cases because:<br />  1. It's not provably correct to do so<br />  2. The planner isn't smart
enoughto figure out that it can<br /><br /> It's impossible to say which applies to you without knowing the full
query.<br/><font color="#888888"><br /> --<br />  Richard Huxton<br />  Archonet Ltd<br /></font></blockquote></div><br
/>