Can query planner prefer a JOIN over a high-cost Function? - Mailing list pgsql-performance

From David McNett
Subject Can query planner prefer a JOIN over a high-cost Function?
Date
Msg-id 20130820180651.GA72756@macnugget.org
Whole thread Raw
Responses Re: Can query planner prefer a JOIN over a high-cost Function?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
I've got a normalized data table from which I'm trying to select a small
subset of rows determined by both traditional filtering as well as the
result of a cpu-expensive function that I've defined.  What I'm seeing
is that the query planner always attempts to defer the de-normalizing
JOIN over the function which causes the query to be much slower than it
would be if the JOIN were performed (for filtering) before the function
is run on the rows.

Is there any way for me to influence the query planner so that it can
know that the JOIN is far less expensive than the function for planning?
The COST attribute on the function appears to have no effect.

I'm testing on:
  PostgreSQL 9.2.4 on amd64-portbld-freebsd9.1, compiled by cc (GCC)
  4.2.1 20070831 patched [FreeBSD], 64-bit

Here's a synthetic example which demonstrates the issue.  A very simple
table with normalized codes in a secondary table.

  CREATE TABLE codes (
    code_id integer NOT NULL,
    code varchar NOT NULL,
    PRIMARY KEY(code_id)
  );

  INSERT INTO codes(code_id,code) SELECT 1,'one';
  INSERT INTO codes(code_id,code) SELECT 2,'two';
  INSERT INTO codes(code_id,code) SELECT 3,'three';
  INSERT INTO codes(code_id,code) SELECT 4,'four';
  INSERT INTO codes(code_id,code) SELECT 5,'five';

  CREATE TABLE examples (
    example_id serial NOT NULL,
    code_id integer NOT NULL REFERENCES codes(code_id),
    value varchar,
    PRIMARY KEY(example_id)
  );

  INSERT INTO examples (code_id,value) SELECT 1,'een';
  INSERT INTO examples (code_id,value) SELECT 2,'koe';
  INSERT INTO examples (code_id,value) SELECT 3,'doet';
  INSERT INTO examples (code_id,value) SELECT 4,'boe';

And a de-normalizing view for access:

  CREATE VIEW examples_view AS
    SELECT e.*,c.code FROM examples e LEFT JOIN codes c USING (code_id);


And a user-defined function which is painfully slow to run:

  CREATE FUNCTION painfully_slow_function(id integer,value varchar) RETURNS boolean AS $$
    BEGIN
        RAISE NOTICE 'Processing ID % (%)',id,value;
        PERFORM pg_sleep(10);
        RETURN TRUE;
    END;
$$ LANGUAGE plpgsql;

A simple SELECT not trying to de-normalize the data, only involving the basa
table does what we'd hope. Note that the function is only run on the code_id matching row because the planner rightly
filterson that first: 

[nugget@[local]|costtest] > EXPLAIN (ANALYZE,BUFFERS) SELECT * FROM examples WHERE code_id = 3 AND
painfully_slow_function(example_id,value)IS TRUE; 
NOTICE:  Processing ID 3 (doet)
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Seq Scan on examples  (cost=0.00..314.50 rows=2 width=40) (actual time=10010.925..10010.929 rows=1 loops=1)
   Filter: ((code_id = 3) AND (painfully_slow_function(example_id, value) IS TRUE))
   Rows Removed by Filter: 3
   Buffers: shared hit=2 read=2
 Total runtime: 10010.948 ms
(5 rows)

Time: 10011.328 ms

However, if the SELECT instead uses the VIEW which de-normalizes the data
the query planner defers the join and the result is running the function on
all rows in the table:

[nugget@[local]|costtest] > EXPLAIN (ANALYZE,BUFFERS) SELECT * FROM examples_view WHERE code = 'three' AND
painfully_slow_function(example_id,value)IS TRUE; 
NOTICE:  Processing ID 1 (een)
NOTICE:  Processing ID 2 (koe)
NOTICE:  Processing ID 3 (doet)
NOTICE:  Processing ID 4 (boe)
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=25.45..338.52 rows=2 width=72) (actual time=30053.776..40073.772 rows=1 loops=1)
   Hash Cond: (e.code_id = c.code_id)
   Buffers: shared hit=2
   ->  Seq Scan on examples e  (cost=0.00..311.60 rows=387 width=40) (actual time=10013.765..40073.708 rows=4 loops=1)
         Filter: (painfully_slow_function(example_id, value) IS TRUE)
         Buffers: shared hit=1
   ->  Hash  (cost=25.38..25.38 rows=6 width=36) (actual time=0.019..0.019 rows=1 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 1kB
         Buffers: shared hit=1
         ->  Seq Scan on codes c  (cost=0.00..25.38 rows=6 width=36) (actual time=0.013..0.014 rows=1 loops=1)
               Filter: ((code)::text = 'three'::text)
               Rows Removed by Filter: 4
               Buffers: shared hit=1
 Total runtime: 40073.813 ms
(14 rows)

Time: 40074.363 ms


Even if I juke the COST on the function to crank it up to a ridiculous execution cost, the query planner doesn't seem
tochange ( also at http://explain.depesz.com/s/WEh ): 

[nugget@[local]|costtest] > ALTER FUNCTION painfully_slow_function(integer,varchar) COST 2147483647;
ALTER FUNCTION
Time: 1.637 ms
[nugget@[local]|costtest] > EXPLAIN (ANALYZE,BUFFERS) SELECT * FROM examples_view WHERE code = 'three' AND
painfully_slow_function(example_id,value)IS TRUE; 
NOTICE:  Processing ID 1 (een)
NOTICE:  Processing ID 2 (koe)
NOTICE:  Processing ID 3 (doet)
NOTICE:  Processing ID 4 (boe)
                                                          QUERY PLAN
      

------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..6227702661.02 rows=2 width=72) (actual time=30056.425..40076.436 rows=1 loops=1)
   Join Filter: (e.code_id = c.code_id)
   Rows Removed by Join Filter: 3
   Buffers: shared hit=5
   ->  Seq Scan on examples e  (cost=0.00..6227702600.80 rows=387 width=40) (actual time=10016.458..40076.370 rows=4
loops=1)
         Filter: (painfully_slow_function(example_id, value) IS TRUE)
         Buffers: shared hit=4
   ->  Materialize  (cost=0.00..25.41 rows=6 width=36) (actual time=0.005..0.007 rows=1 loops=4)
         Buffers: shared hit=1
         ->  Seq Scan on codes c  (cost=0.00..25.38 rows=6 width=36) (actual time=0.011..0.012 rows=1 loops=1)
               Filter: ((code)::text = 'three'::text)
               Rows Removed by Filter: 4
               Buffers: shared hit=1
 Total runtime: 40076.475 ms
(14 rows)

Is there any COST level where PostgreSQL will properly determine that the
join is less expensive than the function?  Or, is there another knob that I
can turn which will influence the query planner in this way?

Thanks!  I hope I've just missed something obvious in the documentation.


pgsql-performance by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: queries with DISTINCT / GROUP BY giving different plans
Next
From: Tom Lane
Date:
Subject: Re: Can query planner prefer a JOIN over a high-cost Function?