Re: Function execution costs 'n all that - Mailing list pgsql-hackers

From Mark Dilger
Subject Re: Function execution costs 'n all that
Date
Msg-id 45BD60EB.2060106@markdilger.com
Whole thread Raw
In response to Re: Function execution costs 'n all that  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Tom Lane wrote:
> Mark Dilger <pgsql@markdilger.com> writes:
>> Tom Lane wrote:
>>> Would a simple constant value be workable, or do we need some more
>>> complex model (and if so what)?
> 
>> Consider:
>> ANALYZE myfunc(integer) ON (SELECT myfunc(7)) WITH RATIO 0.03;
>> ...
>> It seems to me that the above system would work perfectly well for
>> collecting the number of rows returned from a set returning function,
>> not just the run times.
> 
> I don't really think that data collection is the bottleneck.

Ahh, I'm not just thinking about data collection.  I'm thinking about usability 
for non-hackers who know enough plpgsql to write a function and then want to 
train the system to plan for it appropriately.  It's a much easier task for a 
novice user to say "go away and figure out how expensive this thing is" than for 
a novice to think about things like statistical variance, etc.  We don't demand 
that users have that kind of knowledge to write queries or run analyze on 
tables, so why would they need that kind of knowledge to write a function?

> If a
> constant estimate isn't good enough for you, then you need some kind of
> model of how the runtime or number of rows varies with the function's
> inputs ... and I hardly see how something like the above is likely to
> figure out how to fit a good model.  Or at least, if you think it can,
> then you skipped all the interesting bits.

I am (perhaps naively) imagining that the user will train the database over the 
same query as the one that will actually get used most often in production.  In 
the case that the query modifies the table, the user could train the database 
over a copy of that table.  The data collected by the analyze phase would just 
be constant stuff like average and stddev.  That would make the job of the 
planner / cost estimator easier, right?  It could treat the function as a 
constant cost function.

> One other point is that we already know that sampling overhead and
> measurement error are significant problems when trying to measure
> intervals on the order of one Plan-node execution.  I'm afraid that
> would get a great deal worse if we try to use a similar approach to
> timing individual function calls.

The query could be run with the arguments passed to "myfunc" being recorded to a 
temporary table.  After the query is complete (and the temporary table 
populated), data from the temp table could be pulled into memory in batches, 
with the "myfunc" run on them again in a tight loop.  The loop itself could be 
timed, rather than each iteration.  The sum of all the timings for the various 
loops would then be the final runtime which would be divided by the total number 
of rows to get the average runtime per call.  The downside is that I don't see 
how you retrieve the standard deviation.  (I also don't know if the planner 
knows how to use standard deviation information, so perhaps this is a non issue.)

A further refinement would be to batch the inputs based on properties of the 
input data.  For text, you could run a batch of short text first, a batch of 
medium second, and a batch of long text last, and use best-fit linear algebra to 
determine the runtime cost vs. input text length function.  I'm not sure how 
such a refinement would be done for fixed size datatypes.  And for some text 
functions the runtime won't vary with length but with some other property anyway.

mark


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Function execution costs 'n all that
Next
From: Koichi Suzuki
Date:
Subject: Archive log compression keeping physical log available in the crash recovery