Thread: Feature request: optimizer improvement

Feature request: optimizer improvement

From
Joe Love
Date:
In postgres 9.2 I have a function that is relatively expensive.  When I write a query such as:

select expensive_function(o.id),o.* from offeirng o where valid='Y' order by name limit 1;

the query runs slow and appears to be running the function on each ID, which in this case should be totally unnecessary as it really only needs to run on 1 row.

When I rewrite the query like so:

select expensive_function(o.id), o.*
from (select *offering where valid='Y' order by name limit 1) o;

the expensive function only runs once and thus, much faster. I would think that the optimizer could handle this situation, especially when limit or offset is used and the expensive function is not used in a group by, order by or where.
On Oct 31, 2013, at 11:04 AM, Joe Love <joe@primoweb.com> wrote:
In postgres 9.2 I have a function that is relatively expensive.  When I write a query such as:

select expensive_function(o.id),o.* from offeirng o where valid='Y' order by name limit 1;

the query runs slow and appears to be running the function on each ID, which in this case should be totally unnecessary as it really only needs to run on 1 row.

When I rewrite the query like so:

select expensive_function(o.id), o.*
from (select *offering where valid='Y' order by name limit 1) o;

the expensive function only runs once and thus, much faster. I would think that the optimizer could handle this situation, especially when limit or offset is used and the expensive function is not used in a group by, order by or where.

Does anyone know what the SQL standard says about this, if anything? I can't see any way that this would change the result set, but of course if the function has external effects this would make a difference...
Jim Nasby <jim@nasby.net> writes:
> On Oct 31, 2013, at 11:04 AM, Joe Love <joe@primoweb.com> wrote:
>> In postgres 9.2 I have a function that is relatively expensive.  When I write a query such as:
>> 
>> select expensive_function(o.id),o.* from offeirng o where valid='Y' order by name limit 1;

> Does anyone know what the SQL standard says about this, if anything?

The computational model is that you evaluate the SELECT list before
sorting; this must be so since you can write
 select somefunc(x) as y from tab order by y;

In the general case, therefore, it's impossible to avoid evaluating the
function at all rows.  I'm not sure what the spec says about whether it's
okay to skip evaluation of functions that would be evaluated in a naive
implementation of the computational model, so it's possible that what
the OP is asking for is directly contrary to spec.  But more likely they
permit implementations to skip "unnecessary" calls, if indeed they address
this at all.

As far as PG goes, I think the "excess" calls would only occur if the plan
includes an explicit sort step, since the select list would be evaluated
before the sort step.  If you've got an index on "name" (or maybe you'd
need (valid, name) if there aren't many rows with valid = 'Y') I'd expect
it to pick out the minimal "name" row with the index, avoiding any sort,
and then the function would only be evaluated on the single fetched row.
But these are implementation details not anything you're going to find
support for in the spec.
        regards, tom lane



On Sat, Nov 2, 2013 at 2:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jim Nasby <jim@nasby.net> writes:
>> On Oct 31, 2013, at 11:04 AM, Joe Love <joe@primoweb.com> wrote:
>>> In postgres 9.2 I have a function that is relatively expensive.  When I write a query such as:
>>>
>>> select expensive_function(o.id),o.* from offeirng o where valid='Y' order by name limit 1;
>
>> Does anyone know what the SQL standard says about this, if anything?
>
> The computational model is that you evaluate the SELECT list before
> sorting; this must be so since you can write
>
>          select somefunc(x) as y from tab order by y;
>
> In the general case, therefore, it's impossible to avoid evaluating the
> function at all rows.  I'm not sure what the spec says about whether it's
> okay to skip evaluation of functions that would be evaluated in a naive
> implementation of the computational model, so it's possible that what
> the OP is asking for is directly contrary to spec.  But more likely they
> permit implementations to skip "unnecessary" calls, if indeed they address
> this at all.
>
> As far as PG goes, I think the "excess" calls would only occur if the plan
> includes an explicit sort step, since the select list would be evaluated
> before the sort step.  If you've got an index on "name" (or maybe you'd
> need (valid, name) if there aren't many rows with valid = 'Y') I'd expect
> it to pick out the minimal "name" row with the index, avoiding any sort,
> and then the function would only be evaluated on the single fetched row.
> But these are implementation details not anything you're going to find
> support for in the spec.
>
>                         regards, tom lane

Doesnt that make the index mandatory here?
If I understand correctly, if an index is present, the sort will be
avoided altogether. IMHO, thats avoiding the problem. The question
here is that whether we can add planner heuristics for understanding
this case and executing the LIMIT part first (before executing the
funtion).

I understand the reasons for executing SELECT before the sort. But,
couldnt we get the planner to see the LIMIT part and push the sort
node above the select node for this specific case?

So, seeing the LIMIT, the dataset is first sorted, then LIMITed, then
the function applied.

Is this process possible?

Regards,

Atri
-- 
Regards,

Atri
l'apprenant



Atri Sharma <atri.jiit@gmail.com> writes:
> I understand the reasons for executing SELECT before the sort. But,
> couldnt we get the planner to see the LIMIT part and push the sort
> node above the select node for this specific case?

[ Shrug... ]  I don't see the point.  If the OP actually cares about the
speed of this query, he's going to want to avoid the sort step too,
which is what makes the index a good idea.

More generally, this is not a transformation we could apply
unconditionally --- at the very least it'd need to be avoided when
volatile functions are involved, and I don't think it's invariably
a win from a cost standpoint even if there aren't semantic blockers.
But right now the planner has no real ability to reason about placement
of SELECT-list evaluation: it's done in a fixed spot in any particular
plan structure.  I don't think it's practical to add such considerations
to the rat's nest that is grouping_planner and friends.  I have
ambitions to replace all that with a Path-generation-and-comparison
approach, and the Paths used for this would need to carry some
indication of which expressions would be evaluated where.  So maybe
once that's done we could think about whether this is worth doing.
I remain dubious though.
        regards, tom lane



I'm wondering what type of index would work for this as it is a volatile function. Not knowing how PGs optimizer runs, I'm at a loss as to why this wouldn't be possible or worth doing. It seems to me that all functions in the "select" part of the statement could be calculated at the end of the query after the results have been gathered, and even after the sorting had been done as long as the column wasn't part of the order by (or perhaps group by). 

I have an entire set of functions that perform in this way. For example, I'm selecting a list of all my products and the function does a complex calculation based on inventory in the warehouse + expected deliveries from the factory to determine how many of each item is available, and when they first become available.   What's helpful is for the users search criteria to initially limit the search result, and then I want to paginate the results and only show them a few at a time. In the verbose syntax I mentioned originally, the query performs well, in the most straightforward syntax, it does not. I'm not sure I even need to "hint" the optimizer to perform this type of an optimization as it seems it would be beneficial (or at least not detrimental) 100% of the time.


On Sat, Nov 2, 2013 at 10:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Atri Sharma <atri.jiit@gmail.com> writes:
> I understand the reasons for executing SELECT before the sort. But,
> couldnt we get the planner to see the LIMIT part and push the sort
> node above the select node for this specific case?

[ Shrug... ]  I don't see the point.  If the OP actually cares about the
speed of this query, he's going to want to avoid the sort step too,
which is what makes the index a good idea.

More generally, this is not a transformation we could apply
unconditionally --- at the very least it'd need to be avoided when
volatile functions are involved, and I don't think it's invariably
a win from a cost standpoint even if there aren't semantic blockers.
But right now the planner has no real ability to reason about placement
of SELECT-list evaluation: it's done in a fixed spot in any particular
plan structure.  I don't think it's practical to add such considerations
to the rat's nest that is grouping_planner and friends.  I have
ambitions to replace all that with a Path-generation-and-comparison
approach, and the Paths used for this would need to carry some
indication of which expressions would be evaluated where.  So maybe
once that's done we could think about whether this is worth doing.
I remain dubious though.

                        regards, tom lane



--
Joe's Computer Service
405-227-0951
Computer Running Slow? Call Joe!
$125, Your computer will run like new!
www.JoesComputerService.net
Joe Love <joe@primoweb.com> writes:
> I'm wondering what type of index would work for this as it is a volatile
> function. Not knowing how PGs optimizer runs, I'm at a loss as to why this
> wouldn't be possible or worth doing. It seems to me that all functions in
> the "select" part of the statement could be calculated at the end of the
> query after the results have been gathered, and even after the sorting had
> been done as long as the column wasn't part of the order by (or perhaps
> group by).

The short answer is that doing so directly contradicts the computational
model defined by the SQL standard, and will break applications that rely
on the current behavior.  Since there's already a clear way to write the
query in a way that specifies evaluating the functions after the
sort/limit steps (ie, put the order by/limit in a sub-select), IMHO that's
what you should do, not lobby to make the optimizer reinterpret what you
wrote.
        regards, tom lane



On Tue, Nov 5, 2013 at 11:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Joe Love <joe@primoweb.com> writes:
>> I'm wondering what type of index would work for this as it is a volatile
>> function. Not knowing how PGs optimizer runs, I'm at a loss as to why this
>> wouldn't be possible or worth doing. It seems to me that all functions in
>> the "select" part of the statement could be calculated at the end of the
>> query after the results have been gathered, and even after the sorting had
>> been done as long as the column wasn't part of the order by (or perhaps
>> group by).
>
> The short answer is that doing so directly contradicts the computational
> model defined by the SQL standard, and will break applications that rely
> on the current behavior.  Since there's already a clear way to write the
> query in a way that specifies evaluating the functions after the
> sort/limit steps (ie, put the order by/limit in a sub-select), IMHO that's
> what you should do, not lobby to make the optimizer reinterpret what you
> wrote.
>
>

+1.

I thought more about our earlier discussion on this, and I agree with
the point that making the planner push limit over select for this
specific case is not a good idea.



-- 
Regards,

Atri
l'apprenant