Thread: function not called if part of aggregate
My application has a function, call it "foo()", that requires initialization from a table of about 800 values. Rather thanbuild these values into the C code, it seemed like a good idea to put them on a PG table and create a second function,call it "foo_init()", which is called for each value, like this: select foo_init(value) from foo_init_table order by value_id; This works well, but it requires me to actually retrieve the function's value 800 times. So I thought I'd be clever: select count(1) from (select foo_init(value) from foo_init_table order by value_id) as foo; And indeed, it count() returns 800, as expected. But my function foo_init() never gets called! Apparently the optimizerfigures out that foo_init() must return one value for each row, so it doesn't bother to actually call the function. db=> explain select count(1) from (select foo_init(value) from foo_init_table order by db_no) as foo; query plan ---------------------------------------------------------------------------------------------------- aggregate (cost=69.95..69.95 rows=1 width=0) -> Subquery Scan foo (cost=0.00..67.93 rows=806 width=0) -> Index Scan using foo_init_table_pkey on foo_init_table (cost=0.00..59.87 rows=806 width=30) This doesn't seem right to me -- how can the optimizer possibly know that a function doesn't have a side effect, as in mycase? Functions could do all sorts of things, such as logging activity, filling in other tables, etc, etc. Am I missing something here? Thanks, Craig
On Sun, Jun 11, 2006 at 10:18:20AM -0700, Craig A. James wrote: > This works well, but it requires me to actually retrieve the function's > value 800 times. Is this actually a problem? > So I thought I'd be clever: > > select count(1) from (select foo_init(value) from foo_init_table order by > value_id) as foo; Why not just count(foo_init(value))? /* Steinar */ -- Homepage: http://www.sesse.net/
"Craig A. James" <cjames@modgraph-usa.com> writes: > select count(1) from (select foo_init(value) from foo_init_table order by value_id) as foo; > And indeed, it count() returns 800, as expected. But my function foo_init() never gets called! Really? With the ORDER BY in there, it does get called, in my experiments. What PG version is this exactly? However, the short answer to your question is that PG does not guarantee to evaluate parts of the query not needed to determine the result. You could do something like select count(x) from (select foo_init(value) as x from foo_init_table order by value_id) as foo; to ensure that foo_init() must be evaluated. regards, tom lane
On Sun, Jun 11, 2006 at 10:18:20AM -0700, Craig A. James wrote: > This doesn't seem right to me -- how can the optimizer possibly know that a > function doesn't have a side effect, as in my case? Functions could do all > sorts of things, such as logging activity, filling in other tables, etc, > etc. > > Am I missing something here? Read about function stability in the docs. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Craig A. James" <cjames@modgraph-usa.com> writes: > This doesn't seem right to me -- how can the optimizer possibly know that a > function doesn't have a side effect, as in my case? Functions could do all > sorts of things, such as logging activity, filling in other tables, etc, etc. The optimizer can know this if the user tells it so by marking the function IMMUTABLE. If the function is marked VOLATILE then the optimizer can know it might have side effects. However that's not enough to explain what you've shown. How about you show the actual query and actual plan you're working with? The plan you've shown can't result from the query you sent. -- greg
Greg Stark wrote: > However that's not enough to explain what you've shown. How about you show the > actual query and actual plan you're working with? The plan you've shown can't > result from the query you sent. Mea culpa, sort of. But ... in fact, the plan I sent *was* from query I sent, with the table/column names changed for clarity. This time I'll send the plan "raw". (This is PG 8.0.1.) chm=> explain select count(1) from (select normalize_add_salt(smiles) from chm(> salt_smiles order by db_no) as foo; QUERY PLAN ---------------------------------------------------------------------------------------------------- Aggregate (cost=69.95..69.95 rows=1 width=0) -> Subquery Scan foo (cost=0.00..67.93 rows=806 width=0) -> Index Scan using salt_smiles_pkey on salt_smiles (cost=0.00..59.87 rows=806 width=30) (3 rows) As pointed out by Tom and others, this query DOES in fact call the normalize_add_salt() function. Now here's the weird part. (And where my original posting went wrong -- sorry for the error! I got the two queries mixedup.) I originally had a more complex query, the purpose being to guarantee that the function was called on the strings in theorder specified. (More on this below.) Here is the original query I used: chm=> explain select count(1) from (select normalize_add_salt(smiles) chm(> from (select smiles from salt_smiles order by db_no) as foo) as bar; QUERY PLAN ---------------------------------------------------------------------------------------------------- Aggregate (cost=67.94..67.94 rows=1 width=0) -> Subquery Scan foo (cost=0.00..65.92 rows=806 width=0) -> Index Scan using salt_smiles_pkey on salt_smiles (cost=0.00..57.86 rows=806 width=30) (3 rows) Notice that the plans are essentially identical, yet in this one the function does NOT get called. I proved this by bruteforce, inserting "char **p = NULL; *p = "foo";" into the C code to guarantee a segmentation violation if the functiongets called. In the first case it does SIGSEGV, and in the second case it does not. Now the reason for this more-complex query with an additional subselect is that the SMILES (which, by the way, are a lexicalway of representing chemical structures - see www.daylight.com), must be passed to the function in a particular order(hence the ORDER BY). In retrospect I realize the optimizer apparently flattens this query anyway (hence the identicalplans, above). But the weird thing is that, in spite of flattening, which would appear to make the queries equivalent, the function getscalled in one case, and not in the other. Steinar H. Gunderson asked: >> select count(1) from (select foo_init(value) from foo_init_table order by >> value_id) as foo; > Why not just count(foo_init(value))? Because the SMILES must be processed in a specific order, hence the more complex queries. The simple answer to this whole problem is what Steinar wrote: >>This works well, but it requires me to actually retrieve the function's >>value 800 times. > > Is this actually a problem? No, it's just a nuisance. It occurs to me that in spite of the ORDER BY expression, Postgres is free to evaluate the functionfirst, THEN sort the results, which means the SMILES would be processed in random order anyway. I.e. my ORDER BYclause is useless for the intended purpose. So the only way I can see to get this right is to pull the SMILES into my application with the ORDER BY to ensure I havethem in the correct order, then send them back one at a time via a "select normalize_add_salt(smiles)", meaning I'llretrieve 800 strings and then send them back. I just thought there ought to be a way to do this all on the PG server instead of sending all these strings back and forth. I'd like to say to Postgres, "Just do it this way, OK?" But the optimizer can't be turned off, so I guess I haveto do it the slow way. The good news is that this is just an initialization step, after which I typically process thousandsof molecules, so the extra overhead won't kill me. Thanks to all for your help. Craig
"Craig A. James" <cjames@modgraph-usa.com> writes: > But the weird thing is that, in spite of flattening, which would appear to make the queries equivalent, the function getscalled in one case, and not in the other. No, nothing particularly weird about it. ORDER BY in a subselect acts as an "optimization fence" that prevents flattening. An un-flattened subquery will evaluate all its output columns whether the parent query reads them or not. (This is not set in stone mind you, but in the current planner implementation it's hard to avoid, because such a sub-query gets planned before we've figured out which columns the parent wants to reference.) The cases in which you had the function in a subquery without ORDER BY were flattenable, and in that case the planner threw the function expression away as being unreferenced. regards, tom lane
I have a query that needs to run faster, with the obvious solution being to add an index. But to confirm this, I ran explainanalyze. When I run the actual query, it consistently takes 6-7 seconds by the wall clock. My application with a"verbose" mode enabled reports 6.6 seconds consistently. However, when I run EXPLAIN ANALYZE, it takes 120 seconds! Thisis 20x longer, and it leads me to distrust the plan that it claims to be executing. How can the actual run time be somuch faster than that claimed by EXPLAIN ANALYZE? How can I find out the actual plan it's using? Thanks, Craig Details: Postgres 8.0.3 shared_buffers = 20000 work_mem = 500000 effective_cache_size = 430000 Dell w/ Xeon Linux kernel 2.6.9-1.667smp 4 GB memory => explain analyze select SAMPLE.SAMPLE_ID, SAMPLE.VERSION_ID,SAMPLE.SUPPLIER_ID,SAMPLE.CATALOGUE_ID,SAMPLE.PREP_ID fromHITLIST_ROWS_281430 join SAMPLE on (HITLIST_ROWS_281430.OBJECTID = SAMPLE.SAMPLE_ID) where SAMPLE.VERSION_ID in (7513672,7513650,7513634,7513620,7513592,7513590,7513582,7513576,7513562,7513560)order by HITLIST_ROWS_281430.SortOrder; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Sort (cost=234964.38..234964.52 rows=58 width=24) (actual time=120510.842..120510.889 rows=10 loops=1) Sort Key: hitlist_rows_281430.sortorder -> Hash Join (cost=353.68..234962.68 rows=58 width=24) (actual time=81433.194..120510.753 rows=10 loops=1) Hash Cond: ("outer".objectid = "inner".sample_id) -> Seq Scan on hitlist_rows_281430 (cost=0.00..177121.61 rows=11497361 width=8) (actual time=0.008..64434.110rows=11497361 loops=1) -> Hash (cost=353.48..353.48 rows=82 width=20) (actual time=0.293..0.293 rows=0 loops=1) -> Index Scan using i_sample_version_id, i_sample_version_id, i_sample_version_id, i_sample_version_id, i_sample_version_id,i_sample_version_id, i_sample_version_id, i_sample_version_id, i_sample_version_id, i_sample_version_idon sample (cost=0.00..353.48 rows=82 width=20) (actual time=0.042..0.201 rows=12 loops=1) Index Cond: ((version_id = 7513672) OR (version_id = 7513650) OR (version_id = 7513634) OR (version_id= 7513620) OR (version_id = 7513592) OR (version_id = 7513590) OR (version_id = 7513582) OR (version_id = 7513576)OR (version_id = 7513562) OR (version_id = 7513560)) Total runtime: 120511.485 ms (9 rows)
"Craig A. James" <cjames@modgraph-usa.com> writes: > I have a query that needs to run faster, with the obvious solution > being to add an index. But to confirm this, I ran explain analyze. > When I run the actual query, it consistently takes 6-7 seconds by the > wall clock. My application with a "verbose" mode enabled reports 6.6 > seconds consistently. However, when I run EXPLAIN ANALYZE, it takes > 120 seconds! See recent discussions --- if you've got duff PC hardware, it seems that reading the clock takes forever :-(. In this case I'd assume that the cost of the seqscan (11497361 rows returned) is being overstated because of the 2*11497361 gettimeofday calls involved. regards, tom lane