Thread: Performance problems with multiple layers of functions

Performance problems with multiple layers of functions

From

Svenne Krap

Date:

24 March 2006, 08:48:50

Hi there.

I have hit a edge in the planning and I hope you can help.

The system uses a lot of stored procedures to move as much of the
intelligence into the database layer as possible.

My (development) query looks like and runs reasonably fast:

explain analyze select dataset_id, entity, sum(amount) from
entrydata_current where  flow_direction in (select * from
outflow_direction(dataset_id)) and dataset_id in (122,112,125,89,111)
group by dataset_id, entity;

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=918171.00..918171.30 rows=24 width=19) (actual
time=11533.297..11533.340 rows=50 loops=1)
   ->  Bitmap Heap Scan on entrydata_current  (cost=676.72..917736.04
rows=57994 width=19) (actual time=23.921..11425.373 rows=37870 loops=1)
         Recheck Cond: ((dataset_id = 122) OR (dataset_id = 112) OR
(dataset_id = 125) OR (dataset_id = 89) OR (dataset_id = 111))
         Filter: (subplan)
         ->  BitmapOr  (cost=676.72..676.72 rows=117633 width=0) (actual
time=15.765..15.765 rows=0 loops=1)
               ->  Bitmap Index Scan on entrydata_current_dataset_idx
(cost=0.00..83.97 rows=14563 width=0) (actual time=1.881..1.881
rows=13728 loops=1)
                     Index Cond: (dataset_id = 122)
               ->  Bitmap Index Scan on entrydata_current_dataset_idx
(cost=0.00..156.12 rows=27176 width=0) (actual time=3.508..3.508
rows=25748 loops=1)
                     Index Cond: (dataset_id = 112)
               ->  Bitmap Index Scan on entrydata_current_dataset_idx
(cost=0.00..124.24 rows=21498 width=0) (actual time=2.729..2.729
rows=20114 loops=1)
                     Index Cond: (dataset_id = 125)
               ->  Bitmap Index Scan on entrydata_current_dataset_idx
(cost=0.00..102.20 rows=17771 width=0) (actual time=2.351..2.351
rows=17344 loops=1)
                     Index Cond: (dataset_id = 89)
               ->  Bitmap Index Scan on entrydata_current_dataset_idx
(cost=0.00..210.19 rows=36625 width=0) (actual time=5.292..5.292
rows=37118 loops=1)
                     Index Cond: (dataset_id = 111)
         SubPlan
           ->  Function Scan on outflow_direction  (cost=0.00..12.50
rows=1000 width=4) (actual time=0.093..0.095 rows=4 loops=114052)
 Total runtime: 11540.506 ms
(18 rows)

The problem is, that the application should not need to know the five
dataset_ids (it will always know one - its own). So I make a function to
return the five ids and then the query looks like:

explain select dataset_id, entity, sum(amount) from entrydata_current
where  flow_direction in (select * from outflow_direction(dataset_id))
and dataset_id in (select * from get_dataset_ids(122)) group by
dataset_id, entity;
                                                               QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=24672195.68..24672203.88 rows=656 width=19)
   ->  Hash IN Join  (cost=15.00..24660005.45 rows=1625364 width=19)
         Hash Cond: ("outer".dataset_id = "inner".get_dataset_ids)
         ->  Index Scan using entrydata_current_dataset_idx on
entrydata_current  (cost=0.00..24558405.20 rows=1625364 width=19)
               Filter: (subplan)
               SubPlan
                 ->  Function Scan on outflow_direction
(cost=0.00..12.50 rows=1000 width=4)
         ->  Hash  (cost=12.50..12.50 rows=1000 width=4)
               ->  Function Scan on get_dataset_ids  (cost=0.00..12.50
rows=1000 width=4)
(9 rows)

which does not return within 10 minutes - which is unacceptable.

Is there any way to get a better plan for the second ? The planner
should really see the two queries as equal as there is no dependencies
between the outer query and get_dataset_ids (isn't it called constant
folding?).

Thanks in advance

Svenne

Attachment

smime.p7s

Re: Performance problems with multiple layers of functions

From

"Jim C. Nasby"

Date:

24 March 2006, 08:59:19

On Fri, Mar 24, 2006 at 01:49:17PM +0100, Svenne Krap wrote:
> explain select dataset_id, entity, sum(amount) from entrydata_current
> where  flow_direction in (select * from outflow_direction(dataset_id))
> and dataset_id in (select * from get_dataset_ids(122)) group by
> dataset_id, entity;
<snip>
> which does not return within 10 minutes - which is unacceptable.


The issue is that the planner has no way to know what's comming back
from get_dataset_ids.

I think your best bet will be to wrap that select into it's own function
and have that function prepare the query statement, going back to
hard-coded values. So you could do something like:

SQL := 'SELECT ... AND dataset_id IN (''' || get_dataset_ids(122) ||
''');' (yeah, I know that won't work as written, but you get the idea).
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Performance problems with multiple layers of functions

From

Tom Lane

Date:

24 March 2006, 12:02:44

"Jim C. Nasby" <jnasby@pervasive.com> writes:
> On Fri, Mar 24, 2006 at 01:49:17PM +0100, Svenne Krap wrote:
>> explain select dataset_id, entity, sum(amount) from entrydata_current
>> where  flow_direction in (select * from outflow_direction(dataset_id))
>> and dataset_id in (select * from get_dataset_ids(122)) group by
>> dataset_id, entity;

> The issue is that the planner has no way to know what's comming back
> from get_dataset_ids.

More specifically, the first IN is not optimizable into a join because
the results of the sub-SELECT depend on the current row of the outer
query.  The second IN is being optimized fine, but the first one is
what's killing you.

I'd suggest refactoring the functions into something that returns a set
of outflow_direction/dataset_id pairs, and then phrase the query as

where (flow_direction, dataset_id) in (select * from new_func(122))

You could do it without refactoring:

where (flow_direction, dataset_id) in
      (select outflow_direction(id),id from get_dataset_ids(122) id)

however this won't work if outflow_direction() is a plpgsql function
because of limitations in plpgsql's set-function support.  (It will work
if outflow_direction() is a SQL function, or you could kluge it as a SQL
function wrapper around a plpgsql function.)

            regards, tom lane

Re: Performance problems with multiple layers of functions

From

Svenne Krap

Date:

24 March 2006, 15:16:06

Tom Lane wrote:
> where (flow_direction, dataset_id) in (select * from new_func(122))
>

Is this form of multi-column IN mentioned anywhere in the docs? I can't
find it.

Svenne

Attachment

smime.p7s

Re: Performance problems with multiple layers of functions

From

Tom Lane

Date:

24 March 2006, 15:23:57

Svenne Krap <svenne@krap.dk> writes:
> Tom Lane wrote:
>> where (flow_direction, dataset_id) in (select * from new_func(122))

> Is this form of multi-column IN mentioned anywhere in the docs? I can't
> find it.

Sure, look under "Subquery Expressions".  8.0 and later refer to it as a
row_constructor, but it's documented at least as far back as 7.3.

            regards, tom lane

Query plan from hell

From

PFC

Date:

24 March 2006, 18:54:44

    Whoa !

    bookmark_delta contains very few rows but is inserted/deleted very
often... the effect is spectacular !
    I guess I'll have to vacuum analyze this table every minute...


annonces=# EXPLAIN ANALYZE SELECT id, priority FROM annonces WHERE id IN
(SELECT annonce_id FROM bookmark_delta);
                                                        QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
  Hash IN Join  (cost=32.12..8607.08 rows=1770 width=6) (actual
time=387.011..387.569 rows=1 loops=1)
    Hash Cond: ("outer".id = "inner".annonce_id)
    ->  Seq Scan on annonces  (cost=0.00..7796.00 rows=101500 width=6)
(actual time=0.022..164.369 rows=101470 loops=1)
    ->  Hash  (cost=27.70..27.70 rows=1770 width=4) (actual
time=0.013..0.013 rows=5 loops=1)
          ->  Seq Scan on bookmark_delta  (cost=0.00..27.70 rows=1770
width=4) (actual time=0.004..0.010 rows=5 loops=1)
  Total runtime: 387.627 ms
(6 lignes)

annonces=# EXPLAIN ANALYZE SELECT id, priority FROM annonces a, (SELECT
annonce_id FROM bookmark_delta GROUP BY annonce_id) foo WHERE
a.id=foo.annonce_id;
                                                            QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------
  Nested Loop  (cost=32.12..10409.31 rows=1770 width=6) (actual
time=0.081..0.084 rows=1 loops=1)
    ->  HashAggregate  (cost=32.12..49.83 rows=1770 width=4) (actual
time=0.038..0.040 rows=1 loops=1)
          ->  Seq Scan on bookmark_delta  (cost=0.00..27.70 rows=1770
width=4) (actual time=0.024..0.027 rows=5 loops=1)
    ->  Index Scan using annonces_pkey on annonces a  (cost=0.00..5.83
rows=1 width=6) (actual time=0.039..0.040 rows=1 loops=1)
          Index Cond: (a.id = "outer".annonce_id)
  Total runtime: 0.163 ms
(6 lignes)

annonces=# vacuum bookmark_delta ;
VACUUM
annonces=# EXPLAIN ANALYZE SELECT id, priority FROM annonces WHERE id IN
(SELECT annonce_id FROM bookmark_delta);
                                                        QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
  Hash IN Join  (cost=32.12..8607.08 rows=1770 width=6) (actual
time=195.284..196.063 rows=1 loops=1)
    Hash Cond: ("outer".id = "inner".annonce_id)
    ->  Seq Scan on annonces  (cost=0.00..7796.00 rows=101500 width=6)
(actual time=0.014..165.626 rows=101470 loops=1)
    ->  Hash  (cost=27.70..27.70 rows=1770 width=4) (actual
time=0.008..0.008 rows=2 loops=1)
          ->  Seq Scan on bookmark_delta  (cost=0.00..27.70 rows=1770
width=4) (actual time=0.003..0.004 rows=2 loops=1)
  Total runtime: 196.122 ms
(6 lignes)

annonces=# vacuum analyze bookmark_delta ;
VACUUM
annonces=# EXPLAIN ANALYZE SELECT id, priority FROM annonces WHERE id IN
(SELECT annonce_id FROM bookmark_delta);
                                                           QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------
  Nested Loop  (cost=1.02..6.88 rows=1 width=6) (actual time=0.025..0.027
rows=1 loops=1)
    ->  HashAggregate  (cost=1.02..1.03 rows=1 width=4) (actual
time=0.011..0.012 rows=1 loops=1)
          ->  Seq Scan on bookmark_delta  (cost=0.00..1.02 rows=2 width=4)
(actual time=0.004..0.006 rows=2 loops=1)
    ->  Index Scan using annonces_pkey on annonces  (cost=0.00..5.83 rows=1
width=6) (actual time=0.009..0.010 rows=1 loops=1)
          Index Cond: (annonces.id = "outer".annonce_id)
  Total runtime: 0.104 ms
(6 lignes)

Re: Query plan from hell

From

Guido Neitzer

Date:

25 March 2006, 05:53:00

On 24.03.2006, at 23:54 Uhr, PFC wrote:

>     bookmark_delta contains very few rows but is inserted/deleted very
> often... the effect is spectacular !
>     I guess I'll have to vacuum analyze this table every minute...

What about using autovacuum?

cug

--
PharmaLine, Essen, GERMANY
Software and Database Development

Attachment

smime.p7s