Thread: Can query planner prefer a JOIN over a high-cost Function?

Can query planner prefer a JOIN over a high-cost Function?

From
David McNett
Date:
I've got a normalized data table from which I'm trying to select a small
subset of rows determined by both traditional filtering as well as the
result of a cpu-expensive function that I've defined.  What I'm seeing
is that the query planner always attempts to defer the de-normalizing
JOIN over the function which causes the query to be much slower than it
would be if the JOIN were performed (for filtering) before the function
is run on the rows.

Is there any way for me to influence the query planner so that it can
know that the JOIN is far less expensive than the function for planning?
The COST attribute on the function appears to have no effect.

I'm testing on:
  PostgreSQL 9.2.4 on amd64-portbld-freebsd9.1, compiled by cc (GCC)
  4.2.1 20070831 patched [FreeBSD], 64-bit

Here's a synthetic example which demonstrates the issue.  A very simple
table with normalized codes in a secondary table.

  CREATE TABLE codes (
    code_id integer NOT NULL,
    code varchar NOT NULL,
    PRIMARY KEY(code_id)
  );

  INSERT INTO codes(code_id,code) SELECT 1,'one';
  INSERT INTO codes(code_id,code) SELECT 2,'two';
  INSERT INTO codes(code_id,code) SELECT 3,'three';
  INSERT INTO codes(code_id,code) SELECT 4,'four';
  INSERT INTO codes(code_id,code) SELECT 5,'five';

  CREATE TABLE examples (
    example_id serial NOT NULL,
    code_id integer NOT NULL REFERENCES codes(code_id),
    value varchar,
    PRIMARY KEY(example_id)
  );

  INSERT INTO examples (code_id,value) SELECT 1,'een';
  INSERT INTO examples (code_id,value) SELECT 2,'koe';
  INSERT INTO examples (code_id,value) SELECT 3,'doet';
  INSERT INTO examples (code_id,value) SELECT 4,'boe';

And a de-normalizing view for access:

  CREATE VIEW examples_view AS
    SELECT e.*,c.code FROM examples e LEFT JOIN codes c USING (code_id);


And a user-defined function which is painfully slow to run:

  CREATE FUNCTION painfully_slow_function(id integer,value varchar) RETURNS boolean AS $$
    BEGIN
        RAISE NOTICE 'Processing ID % (%)',id,value;
        PERFORM pg_sleep(10);
        RETURN TRUE;
    END;
$$ LANGUAGE plpgsql;

A simple SELECT not trying to de-normalize the data, only involving the basa
table does what we'd hope. Note that the function is only run on the code_id matching row because the planner rightly
filterson that first: 

[nugget@[local]|costtest] > EXPLAIN (ANALYZE,BUFFERS) SELECT * FROM examples WHERE code_id = 3 AND
painfully_slow_function(example_id,value)IS TRUE; 
NOTICE:  Processing ID 3 (doet)
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Seq Scan on examples  (cost=0.00..314.50 rows=2 width=40) (actual time=10010.925..10010.929 rows=1 loops=1)
   Filter: ((code_id = 3) AND (painfully_slow_function(example_id, value) IS TRUE))
   Rows Removed by Filter: 3
   Buffers: shared hit=2 read=2
 Total runtime: 10010.948 ms
(5 rows)

Time: 10011.328 ms

However, if the SELECT instead uses the VIEW which de-normalizes the data
the query planner defers the join and the result is running the function on
all rows in the table:

[nugget@[local]|costtest] > EXPLAIN (ANALYZE,BUFFERS) SELECT * FROM examples_view WHERE code = 'three' AND
painfully_slow_function(example_id,value)IS TRUE; 
NOTICE:  Processing ID 1 (een)
NOTICE:  Processing ID 2 (koe)
NOTICE:  Processing ID 3 (doet)
NOTICE:  Processing ID 4 (boe)
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=25.45..338.52 rows=2 width=72) (actual time=30053.776..40073.772 rows=1 loops=1)
   Hash Cond: (e.code_id = c.code_id)
   Buffers: shared hit=2
   ->  Seq Scan on examples e  (cost=0.00..311.60 rows=387 width=40) (actual time=10013.765..40073.708 rows=4 loops=1)
         Filter: (painfully_slow_function(example_id, value) IS TRUE)
         Buffers: shared hit=1
   ->  Hash  (cost=25.38..25.38 rows=6 width=36) (actual time=0.019..0.019 rows=1 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 1kB
         Buffers: shared hit=1
         ->  Seq Scan on codes c  (cost=0.00..25.38 rows=6 width=36) (actual time=0.013..0.014 rows=1 loops=1)
               Filter: ((code)::text = 'three'::text)
               Rows Removed by Filter: 4
               Buffers: shared hit=1
 Total runtime: 40073.813 ms
(14 rows)

Time: 40074.363 ms


Even if I juke the COST on the function to crank it up to a ridiculous execution cost, the query planner doesn't seem
tochange ( also at http://explain.depesz.com/s/WEh ): 

[nugget@[local]|costtest] > ALTER FUNCTION painfully_slow_function(integer,varchar) COST 2147483647;
ALTER FUNCTION
Time: 1.637 ms
[nugget@[local]|costtest] > EXPLAIN (ANALYZE,BUFFERS) SELECT * FROM examples_view WHERE code = 'three' AND
painfully_slow_function(example_id,value)IS TRUE; 
NOTICE:  Processing ID 1 (een)
NOTICE:  Processing ID 2 (koe)
NOTICE:  Processing ID 3 (doet)
NOTICE:  Processing ID 4 (boe)
                                                          QUERY PLAN
      

------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..6227702661.02 rows=2 width=72) (actual time=30056.425..40076.436 rows=1 loops=1)
   Join Filter: (e.code_id = c.code_id)
   Rows Removed by Join Filter: 3
   Buffers: shared hit=5
   ->  Seq Scan on examples e  (cost=0.00..6227702600.80 rows=387 width=40) (actual time=10016.458..40076.370 rows=4
loops=1)
         Filter: (painfully_slow_function(example_id, value) IS TRUE)
         Buffers: shared hit=4
   ->  Materialize  (cost=0.00..25.41 rows=6 width=36) (actual time=0.005..0.007 rows=1 loops=4)
         Buffers: shared hit=1
         ->  Seq Scan on codes c  (cost=0.00..25.38 rows=6 width=36) (actual time=0.011..0.012 rows=1 loops=1)
               Filter: ((code)::text = 'three'::text)
               Rows Removed by Filter: 4
               Buffers: shared hit=1
 Total runtime: 40076.475 ms
(14 rows)

Is there any COST level where PostgreSQL will properly determine that the
join is less expensive than the function?  Or, is there another knob that I
can turn which will influence the query planner in this way?

Thanks!  I hope I've just missed something obvious in the documentation.


Re: Can query planner prefer a JOIN over a high-cost Function?

From
Tom Lane
Date:
David McNett <pglists@macnugget.org> writes:
> Is there any way for me to influence the query planner so that it can
> know that the JOIN is far less expensive than the function for planning?
> The COST attribute on the function appears to have no effect.

I think what you're missing is an index on examples.code_id, which
would allow for a plan like this one:

 Nested Loop  (cost=154.41..205263.18 rows=2185 width=16)
   ->  Seq Scan on codes c  (cost=0.00..1.06 rows=1 width=8)
         Filter: ((code)::text = 'three'::text)
   ->  Bitmap Heap Scan on examples e  (cost=154.41..205234.81 rows=2731 width=1
2)
         Recheck Cond: (code_id = c.code_id)
         Filter: (painfully_slow_function(example_id, value) IS TRUE)
         ->  Bitmap Index Scan on examples_code_id_idx  (cost=0.00..153.73 rows=
8192 width=0)
               Index Cond: (code_id = c.code_id)

If you really want to force the join to occur separately, you could
probably do something involving a sub-select with OFFSET 0, but I wouldn't
recommend pursuing that path unless you can't get a decent result without
contorting the query.

Another thing worth thinking about is whether you could precalculate the
expensive function via a functional index.  It'd have to be immutable,
but if it is, this is a useful way of changing the ground rules.

            regards, tom lane