Thread: Mysterious performance of query because of plsql function in where condition

Mysterious performance of query because of plsql function in where condition

From
"Peter Alberer"
Date:
Hi there,

i have a problem with a query that uses the result of a plsql function
In
the where clause:

SELECT
   assignments.assignment_id,
   assignments.package_id AS package_id,
   assignments.title AS title,
   COUNT(*) AS Count
FROM
   assignments INNER JOIN submissions ON
   (assignments.assignment_id=submissions.assignment_id)
WHERE
   package_id=949589 AND
   submission_status(submissions.submission_id)='closed'
GROUP BY
   assignments.assignment_id, assignments.package_id, assignments.title
ORDER BY
   assignments.title;

Postgres seems to execute the function "submission_status" for every row
of
the submissions table (~1500 rows). The query therefore takes quite a
lot
time, although in fact no row is returned from the assignments table
when
the condition package_id=949589 is used.

                                                          QUERY PLAN
------------------------------------------------------------------------
---
---------------------------------------------------
 Sort  (cost=41.21..41.21 rows=1 width=35) (actual
time=4276.978..4276.978
rows=0 loops=1)
   Sort Key: assignments.title
   ->  HashAggregate  (cost=41.19..41.20 rows=1 width=35) (actual
time=4276.970..4276.970 rows=0 loops=1)
         ->  Hash Join  (cost=2.40..41.18 rows=1 width=35) (actual
time=4276.966..4276.966 rows=0 loops=1)
               Hash Cond: ("outer".assignment_id =
"inner".assignment_id)
               ->  Seq Scan on submissions  (cost=0.00..38.73 rows=9
width=4) (actual time=10.902..4276.745 rows=38 loops=1)
                     Filter: (submission_status(submission_id) =
'closed'::text)
               ->  Hash  (cost=2.40..2.40 rows=2 width=35) (actual
time=0.058..0.058 rows=0 loops=1)
                     ->  Seq Scan on assignments  (cost=0.00..2.40
rows=2
width=35) (actual time=0.015..0.052 rows=2 loops=1)
                           Filter: (package_id = 949589)
 Total runtime: 4277.078 ms
(11 rows)

I therefore tried to rephrase the query, to make sure that the function
is
only used for the rows returned by the join but not even the following
does
help (the subselect t1 does not return a single row):

select * from (
    SELECT
      a.assignment_id, a.package_id, a.title, s.submission_id,
      COUNT(*) AS Count
    FROM
      assignments a INNER JOIN submissions s ON
(a.assignment_id=s.assignment_id)
    WHERE
        a.package_id=949589
    GROUP BY
        a.assignment_id, a.package_id, a.title, s.submission_id
) t1
where
   submission_status(t1.submission_id)='closed'
order by
   title;

                                                              QUERY PLAN
------------------------------------------------------------------------
---
-----------------------------------------------------------
 Sort  (cost=41.21..41.22 rows=1 width=188) (actual
time=4114.251..4114.251
rows=0 loops=1)
   Sort Key: title
   ->  Subquery Scan t1  (cost=41.20..41.20 rows=1 width=188) (actual
time=4114.242..4114.242 rows=0 loops=1)
         ->  HashAggregate  (cost=41.20..41.20 rows=1 width=39) (actual
time=4114.238..4114.238 rows=0 loops=1)
               ->  Hash Join  (cost=2.40..41.18 rows=1 width=39) (actual
time=4114.235..4114.235 rows=0 loops=1)
                     Hash Cond: ("outer".assignment_id =
"inner".assignment_id)
                     ->  Seq Scan on submissions s  (cost=0.00..38.73
rows=9 width=8) (actual time=7.179..4113.984 rows=38 loops=1)
                           Filter: (submission_status(submission_id) =
'closed'::text)
                     ->  Hash  (cost=2.40..2.40 rows=2 width=35) (actual
time=0.100..0.100 rows=0 loops=1)
                           ->  Seq Scan on assignments a
(cost=0.00..2.40
rows=2 width=35) (actual time=0.045..0.094 rows=2 loops=1)
                                 Filter: (package_id = 949589)
 Total runtime: 4114.356 ms
(12 rows)

The function is nevertheless executed for every row in the submissions
table. A simple "select *, submission_status(submission_id) from
submissions" takes about the same time as the 2 queries stated above.

The whole database has been vacuum analysed right before the explain
analyse output has been captured.

What can I do to reduce the time this query takes? And why is the
function
executed although there is no row in the result set of t1 in my
rephrased
query?

TIA, peter

Ps: table definitions:

                      Table "public.assignments"
    Column     |            Type             |       Modifiers
---------------+-----------------------------+------------------------
 assignment_id | integer                     | not null
 title         | character varying(100)      | not null
 max_grade     | smallint                    | not null
 start_date    | timestamp without time zone | not null default now()
 end_date      | timestamp without time zone | not null
 over_due_date | timestamp without time zone |
 score_release | smallint                    | not null default 1
 package_id    | integer                     | not null
 cal_item_id   | integer                     |
Indexes:
    "assignments_pk" primary key, btree (assignment_id)
Check constraints:
    "assignments_sr_ck" CHECK (score_release = 1 OR score_release = 2 OR
score_release = 3)
Foreign-key constraints:
    "cal_item_id" FOREIGN KEY (cal_item_id) REFERENCES
cal_items(cal_item_id) ON DELETE SET NULL
    "package_id_fk" FOREIGN KEY (package_id) REFERENCES
apm_packages(package_id)
    "assignment_id_fk" FOREIGN KEY (assignment_id) REFERENCES
acs_objects(object_id) ON DELETE CASCADE

               Table "public.submissions"
    Column     |            Type             | Modifiers
---------------+-----------------------------+-----------
 submission_id | integer                     | not null
 person_id     | integer                     | not null
 assignment_id | integer                     | not null
 last_modified | timestamp without time zone | not null
 recovery_date | timestamp without time zone |
 grading       | smallint                    |
 grading_date  | timestamp without time zone |
Indexes:
    "submissions_pk" primary key, btree (submission_id)
    "submissions_person_ass_un" unique, btree (person_id, assignment_id)
Foreign-key constraints:
    "assignment_id_fk" FOREIGN KEY (assignment_id) REFERENCES
assignments(assignment_id)
    "person_id_fk" FOREIGN KEY (person_id) REFERENCES persons(person_id)

--
peter.alberer@wu-wien.ac.at  Tel: +43/1/31336/4341
Abteilung für Wirtschaftsinformatik, Wirtschaftsuniversitaet Wien,
Austria



On Jul 2, 2004, at 3:48 AM, Peter Alberer wrote:
>
> Postgres seems to execute the function "submission_status" for every
> row
> of
> the submissions table (~1500 rows). The query therefore takes quite a
> lot
> time, although in fact no row is returned from the assignments table
> when
> the condition package_id=949589 is used.
>

Well, you need to think of it this way - PG has no idea what the
function does so it treats it as a "black box" - thus it has to run it
for each row to see what evaluates too - especially since it is in a
where clause.

If you really want a function there you can use a SQL function instead
of plpgsql - PG has smart enough to push that function up into your
query and let the optimizer look at the whole thing.

You can also take a look at the various flags you can use while
creating functions such as immutable, strict, etc. they can help

--
Jeff Trout <jeff@jefftrout.com>
http://www.jefftrout.com/
http://www.stuarthamm.net/


On Fri, Jul 02, 2004 at 09:48:48 +0200,
  Peter Alberer <h9351252@obelix.wu-wien.ac.at> wrote:
>
> Postgres seems to execute the function "submission_status" for every row
> of
> the submissions table (~1500 rows). The query therefore takes quite a
> lot
> time, although in fact no row is returned from the assignments table
> when
> the condition package_id=949589 is used.

If submission_status is invertable you might want to create the
inverse function, mark it immutable and call it with 'closed'.
That would allow the optimizer to compare submissions.submission_id
to a constant.

Another option would be be to create an index on
submission_status(submissions.submission_id).

hi,

Peter Alberer wrote:

> Hi there,
>
> i have a problem with a query that uses the result of a plsql function
> In
> the where clause:
>
> SELECT
>    assignments.assignment_id,
>    assignments.package_id AS package_id,
>    assignments.title AS title,
>    COUNT(*) AS Count
> FROM
>    assignments INNER JOIN submissions ON
>    (assignments.assignment_id=submissions.assignment_id)
> WHERE
>    package_id=949589 AND
>    submission_status(submissions.submission_id)='closed'
> GROUP BY
>    assignments.assignment_id, assignments.package_id, assignments.title
> ORDER BY
>    assignments.title;
>
> Postgres seems to execute the function "submission_status" for every row
> of
> the submissions table (~1500 rows).

what is submission_status actualy?
\df submission_status
Is the function submission_status  called stable?

C.