Thread: function not called if part of aggregate

function not called if part of aggregate

From

"Craig A. James"

Date:

11 June 2006, 14:19:31

My application has a function, call it "foo()", that requires initialization from a table of about 800 values.  Rather
thanbuild these values into the C code, it seemed like a good idea to put them on a PG table and create a second
function,call it "foo_init()", which is called for each value, like this: 

   select foo_init(value) from foo_init_table order by value_id;

This works well, but it requires me to actually retrieve the function's value 800 times.  So I thought I'd be clever:

   select count(1) from (select foo_init(value) from foo_init_table order by value_id) as foo;

And indeed, it count() returns 800, as expected.  But my function foo_init() never gets called!  Apparently the
optimizerfigures out that foo_init() must return one value for each row, so it doesn't bother to actually call the
function.

db=> explain select count(1) from (select foo_init(value) from foo_init_table order by db_no) as foo;
                                             query plan
----------------------------------------------------------------------------------------------------
 aggregate  (cost=69.95..69.95 rows=1 width=0)
   ->  Subquery Scan foo  (cost=0.00..67.93 rows=806 width=0)
         ->  Index Scan using foo_init_table_pkey on foo_init_table  (cost=0.00..59.87 rows=806 width=30)

This doesn't seem right to me -- how can the optimizer possibly know that a function doesn't have a side effect, as in
mycase?  Functions could do all sorts of things, such as logging activity, filling in other tables, etc, etc. 

Am I missing something here?

Thanks,
Craig

Re: function not called if part of aggregate

From

"Steinar H. Gunderson"

Date:

11 June 2006, 14:32:10

On Sun, Jun 11, 2006 at 10:18:20AM -0700, Craig A. James wrote:
> This works well, but it requires me to actually retrieve the function's
> value 800 times.

Is this actually a problem?

> So I thought I'd be clever:
>
>   select count(1) from (select foo_init(value) from foo_init_table order by
>   value_id) as foo;

Why not just count(foo_init(value))?

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: function not called if part of aggregate

From

Tom Lane

Date:

11 June 2006, 14:39:36

"Craig A. James" <cjames@modgraph-usa.com> writes:
>    select count(1) from (select foo_init(value) from foo_init_table order by value_id) as foo;

> And indeed, it count() returns 800, as expected.  But my function foo_init() never gets called!

Really?  With the ORDER BY in there, it does get called, in my
experiments.  What PG version is this exactly?

However, the short answer to your question is that PG does not guarantee
to evaluate parts of the query not needed to determine the result.  You
could do something like

select count(x) from (select foo_init(value) as x from foo_init_table order by value_id) as foo;

to ensure that foo_init() must be evaluated.

            regards, tom lane

Re: function not called if part of aggregate

From

"Jim C. Nasby"

Date:

11 June 2006, 14:48:48

On Sun, Jun 11, 2006 at 10:18:20AM -0700, Craig A. James wrote:
> This doesn't seem right to me -- how can the optimizer possibly know that a
> function doesn't have a side effect, as in my case?  Functions could do all
> sorts of things, such as logging activity, filling in other tables, etc,
> etc.
>
> Am I missing something here?

Read about function stability in the docs.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: function not called if part of aggregate

From

Greg Stark

Date:

11 June 2006, 15:00:52

"Craig A. James" <cjames@modgraph-usa.com> writes:

> This doesn't seem right to me -- how can the optimizer possibly know that a
> function doesn't have a side effect, as in my case? Functions could do all
> sorts of things, such as logging activity, filling in other tables, etc, etc.

The optimizer can know this if the user tells it so by marking the function
IMMUTABLE. If the function is marked VOLATILE then the optimizer can know it
might have side effects.

However that's not enough to explain what you've shown. How about you show the
actual query and actual plan you're working with? The plan you've shown can't
result from the query you sent.

--
greg

Re: function not called if part of aggregate

From

"Craig A. James"

Date:

13 June 2006, 03:27:38

Greg Stark wrote:
> However that's not enough to explain what you've shown. How about you show the
> actual query and actual plan you're working with? The plan you've shown can't
> result from the query you sent.

Mea culpa, sort of.  But ... in fact, the plan I sent *was* from query I sent, with the table/column names changed for
clarity. This time I'll send the plan "raw".  (This is PG 8.0.1.) 

chm=> explain select count(1) from (select normalize_add_salt(smiles) from
chm(>     salt_smiles order by db_no) as foo;
                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
 Aggregate  (cost=69.95..69.95 rows=1 width=0)
   ->  Subquery Scan foo  (cost=0.00..67.93 rows=806 width=0)
         ->  Index Scan using salt_smiles_pkey on salt_smiles  (cost=0.00..59.87 rows=806 width=30)
(3 rows)

As pointed out by Tom and others, this query DOES in fact call the normalize_add_salt() function.

Now here's the weird part. (And where my original posting went wrong -- sorry for the error!  I got the two queries
mixedup.) 

I originally had a more complex query, the purpose being to guarantee that the function was called on the strings in
theorder specified.  (More on this below.)  Here is the original query I used: 

chm=> explain select count(1) from (select normalize_add_salt(smiles)
chm(>   from (select smiles from salt_smiles order by db_no) as foo) as bar;
                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
 Aggregate  (cost=67.94..67.94 rows=1 width=0)
   ->  Subquery Scan foo  (cost=0.00..65.92 rows=806 width=0)
         ->  Index Scan using salt_smiles_pkey on salt_smiles  (cost=0.00..57.86 rows=806 width=30)
(3 rows)

Notice that the plans are essentially identical, yet in this one the function does NOT get called.  I proved this by
bruteforce, inserting "char **p = NULL; *p = "foo";" into the C code to guarantee a segmentation violation if the
functiongets called.  In the first case it does SIGSEGV, and in the second case it does not. 

Now the reason for this more-complex query with an additional subselect is that the SMILES (which, by the way, are a
lexicalway of representing chemical structures - see www.daylight.com), must be passed to the function in a particular
order(hence the ORDER BY).  In retrospect I realize the optimizer apparently flattens this query anyway (hence the
identicalplans, above). 

But the weird thing is that, in spite of flattening, which would appear to make the queries equivalent, the function
getscalled in one case, and not in the other. 

Steinar H. Gunderson asked:
>>  select count(1) from (select foo_init(value) from foo_init_table order by
>>  value_id) as foo;
> Why not just count(foo_init(value))?

Because the SMILES must be processed in a specific order, hence the more complex queries.

The simple answer to this whole problem is what Steinar wrote:
>>This works well, but it requires me to actually retrieve the function's
>>value 800 times.
>
> Is this actually a problem?

No, it's just a nuisance.  It occurs to me that in spite of the ORDER BY expression, Postgres is free to evaluate the
functionfirst, THEN sort the results, which means the SMILES would be processed in random order anyway.  I.e. my ORDER
BYclause is useless for the intended purpose. 

So the only way I can see to get this right is to pull the SMILES into my application with the ORDER BY to ensure I
havethem in the correct order, then send them back one at a time via a "select normalize_add_salt(smiles)", meaning
I'llretrieve 800 strings and then send them back.  

I just thought there ought to be a way to do this all on the PG server instead of sending all these strings back and
forth. I'd like to say to Postgres, "Just do it this way, OK?"  But the optimizer can't be turned off, so I guess I
haveto do it the slow way.  The good news is that this is just an initialization step, after which I typically process
thousandsof molecules, so the extra overhead won't kill me. 

Thanks to all for your help.

Craig

Re: function not called if part of aggregate

From

Tom Lane

Date:

13 June 2006, 11:37:15

"Craig A. James" <cjames@modgraph-usa.com> writes:
> But the weird thing is that, in spite of flattening, which would appear to make the queries equivalent, the function
getscalled in one case, and not in the other. 

No, nothing particularly weird about it.  ORDER BY in a subselect
acts as an "optimization fence" that prevents flattening.  An
un-flattened subquery will evaluate all its output columns whether the
parent query reads them or not.  (This is not set in stone mind you,
but in the current planner implementation it's hard to avoid, because
such a sub-query gets planned before we've figured out which columns
the parent wants to reference.)  The cases in which you had the function
in a subquery without ORDER BY were flattenable, and in that case the
planner threw the function expression away as being unreferenced.

            regards, tom lane

explain analyze reports 20x more time than actual

From

"Craig A. James"

Date:

30 June 2006, 02:54:15

I have a query that needs to run faster, with the obvious solution being to add an index.  But to confirm this, I ran
explainanalyze.  When I run the actual query, it consistently takes 6-7 seconds by the wall clock.  My application with
a"verbose" mode enabled reports 6.6 seconds consistently.  However, when I run EXPLAIN ANALYZE, it takes 120 seconds!
Thisis 20x longer, and it leads me to distrust the plan that it claims to be executing.  How can the actual run time be
somuch faster than that claimed by EXPLAIN ANALYZE?  How can I find out the actual plan it's using? 

Thanks,
Craig


Details:
  Postgres 8.0.3
    shared_buffers = 20000
    work_mem = 500000
    effective_cache_size = 430000
  Dell w/ Xeon
    Linux kernel 2.6.9-1.667smp
    4 GB memory

=> explain analyze select SAMPLE.SAMPLE_ID, SAMPLE.VERSION_ID,SAMPLE.SUPPLIER_ID,SAMPLE.CATALOGUE_ID,SAMPLE.PREP_ID
fromHITLIST_ROWS_281430 join SAMPLE on (HITLIST_ROWS_281430.OBJECTID = SAMPLE.SAMPLE_ID) where  SAMPLE.VERSION_ID in
(7513672,7513650,7513634,7513620,7513592,7513590,7513582,7513576,7513562,7513560)order by
HITLIST_ROWS_281430.SortOrder;

                                          QUERY PLAN
                                                                                               

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=234964.38..234964.52 rows=58 width=24) (actual time=120510.842..120510.889 rows=10 loops=1)
   Sort Key: hitlist_rows_281430.sortorder
   ->  Hash Join  (cost=353.68..234962.68 rows=58 width=24) (actual time=81433.194..120510.753 rows=10 loops=1)
         Hash Cond: ("outer".objectid = "inner".sample_id)
         ->  Seq Scan on hitlist_rows_281430  (cost=0.00..177121.61 rows=11497361 width=8) (actual
time=0.008..64434.110rows=11497361 loops=1) 
         ->  Hash  (cost=353.48..353.48 rows=82 width=20) (actual time=0.293..0.293 rows=0 loops=1)
               ->  Index Scan using i_sample_version_id, i_sample_version_id, i_sample_version_id, i_sample_version_id,
i_sample_version_id,i_sample_version_id, i_sample_version_id, i_sample_version_id, i_sample_version_id,
i_sample_version_idon sample  (cost=0.00..353.48 rows=82 width=20) (actual time=0.042..0.201 rows=12 loops=1) 
                     Index Cond: ((version_id = 7513672) OR (version_id = 7513650) OR (version_id = 7513634) OR
(version_id= 7513620) OR (version_id = 7513592) OR (version_id = 7513590) OR (version_id = 7513582) OR (version_id =
7513576)OR (version_id = 7513562) OR (version_id = 7513560)) 
 Total runtime: 120511.485 ms
(9 rows)

Re: explain analyze reports 20x more time than actual

From

Tom Lane

Date:

30 June 2006, 03:05:08

"Craig A. James" <cjames@modgraph-usa.com> writes:
> I have a query that needs to run faster, with the obvious solution
> being to add an index.  But to confirm this, I ran explain analyze.
> When I run the actual query, it consistently takes 6-7 seconds by the
> wall clock.  My application with a "verbose" mode enabled reports 6.6
> seconds consistently.  However, when I run EXPLAIN ANALYZE, it takes
> 120 seconds!

See recent discussions --- if you've got duff PC hardware, it seems that
reading the clock takes forever :-(.  In this case I'd assume that the
cost of the seqscan (11497361 rows returned) is being overstated because
of the 2*11497361 gettimeofday calls involved.

            regards, tom lane