Thread: Ordering behavior for aggregates

Ordering behavior for aggregates

From
Vik Fearing
Date:
The standard only defines an ORDER BY clause inside of an aggregate for 
ARRAY_AGG().  As an extension to the standard, we allow it for all 
aggregates, which is very convenient for non-standard things like 
string_agg().

However, it is completely useless for things like AVG() or SUM().  If 
you include it, the aggregate will do the sort even though it is neither 
required nor desired.

I am proposing something like pg_aggregate.aggordering which would be an 
enum of behaviors such as f=Forbidden, a=Allowed, r=Required.  Currently 
all aggregates would have 'a' but I am thinking that a lot of them could 
be switched to 'f'.  In that case, if a user supplies an ordering, an 
error is raised.

My main motivation behind this is to be able to optimize aggregates that 
could stop early such as ANY_VALUE(), but also to self-optimize queries 
written in error (or ignorance).

There is recurring demand for a first_agg() of some sort, and that one 
(whether implemented in core or left to extensions) would use 'r' so 
that an error is raised if the user does not supply an ordering.

I have not started working on this because I envision quite a lot of 
bikeshedding, but this is the approach I am aiming for.

Thoughts?
-- 
Vik Fearing



Re: Ordering behavior for aggregates

From
Magnus Hagander
Date:
On Tue, Dec 13, 2022 at 1:51 PM Vik Fearing <vik@postgresfriends.org> wrote:
The standard only defines an ORDER BY clause inside of an aggregate for
ARRAY_AGG().  As an extension to the standard, we allow it for all
aggregates, which is very convenient for non-standard things like
string_agg().

However, it is completely useless for things like AVG() or SUM().  If
you include it, the aggregate will do the sort even though it is neither
required nor desired.

I am proposing something like pg_aggregate.aggordering which would be an
enum of behaviors such as f=Forbidden, a=Allowed, r=Required.  Currently
all aggregates would have 'a' but I am thinking that a lot of them could
be switched to 'f'.  In that case, if a user supplies an ordering, an
error is raised.

Should there perhaps also be an option for "ignored" where we'd allow the user to specify it, but not actually do the sort because we know it's pointless? Or maybe that should be the behaviour of "forbidden", which should then perhaps have a different name?


My main motivation behind this is to be able to optimize aggregates that
could stop early such as ANY_VALUE(), but also to self-optimize queries
written in error (or ignorance).

There is recurring demand for a first_agg() of some sort, and that one
(whether implemented in core or left to extensions) would use 'r' so
that an error is raised if the user does not supply an ordering.

I have not started working on this because I envision quite a lot of
bikeshedding, but this is the approach I am aiming for.

Thoughts?

 For consistency, should we have a similar flag for DISITNCT? That could be interesting to forbid for something like first_agg() wouldn't it? I'm not sure what the usecase would be to require it, but maybe there is one?

--

Re: Ordering behavior for aggregates

From
Vik Fearing
Date:
On 12/13/22 13:55, Magnus Hagander wrote:
> On Tue, Dec 13, 2022 at 1:51 PM Vik Fearing <vik@postgresfriends.org> wrote:
> 
>> The standard only defines an ORDER BY clause inside of an aggregate for
>> ARRAY_AGG().  As an extension to the standard, we allow it for all
>> aggregates, which is very convenient for non-standard things like
>> string_agg().
>>
>> However, it is completely useless for things like AVG() or SUM().  If
>> you include it, the aggregate will do the sort even though it is neither
>> required nor desired.
>>
>> I am proposing something like pg_aggregate.aggordering which would be an
>> enum of behaviors such as f=Forbidden, a=Allowed, r=Required.  Currently
>> all aggregates would have 'a' but I am thinking that a lot of them could
>> be switched to 'f'.  In that case, if a user supplies an ordering, an
>> error is raised.
>>
> 
> Should there perhaps also be an option for "ignored" where we'd allow the
> user to specify it, but not actually do the sort because we know it's
> pointless? Or maybe that should be the behaviour of "forbidden", which
> should then perhaps have a different name?


I did think about that but I can't think of any reason we would want to 
silently ignore something the user has written.  If the ordering doesn't 
make sense, we should forbid it.


> My main motivation behind this is to be able to optimize aggregates that
>> could stop early such as ANY_VALUE(), but also to self-optimize queries
>> written in error (or ignorance).
>>
>> There is recurring demand for a first_agg() of some sort, and that one
>> (whether implemented in core or left to extensions) would use 'r' so
>> that an error is raised if the user does not supply an ordering.
>>
>> I have not started working on this because I envision quite a lot of
>> bikeshedding, but this is the approach I am aiming for.
>>
>> Thoughts?
>>
> 
>   For consistency, should we have a similar flag for DISITNCT? That could be
> interesting to forbid for something like first_agg() wouldn't it? I'm not
> sure what the usecase would be to require it, but maybe there is one?


I thought about that too, but decided it could be a separate patch 
because far fewer aggregates would need it.
-- 
Vik Fearing




Re: Ordering behavior for aggregates

From
Isaac Morland
Date:
On Tue, 13 Dec 2022 at 07:50, Vik Fearing <vik@postgresfriends.org> wrote:

I am proposing something like pg_aggregate.aggordering which would be an
enum of behaviors such as f=Forbidden, a=Allowed, r=Required.  Currently
all aggregates would have 'a' but I am thinking that a lot of them could
be switched to 'f'.  In that case, if a user supplies an ordering, an
error is raised.

Although I find "r" attractive, I have two concerns about it:

1) Do we really want to require ordering? I know it's weird and partially undefined to call something like string_agg without an ordering, but what if in the specific application it doesn’t matter in what order the items appear?

2) There is a backward compatibility issue here; it’s not clear to me we could apply "r" to any existing aggregate.

Actually upon consideration, I think I have similar concerns about "f". We don’t usually forbid "dumb" things; e.g., I can write a function which ignores its inputs. And in some situations, "dumb" things make sense. For example, if I’m specifying a function to use as a filter, it could be reasonable in a particular instance to provide a function which ignores one or more of its inputs.

Re: Ordering behavior for aggregates

From
Vik Fearing
Date:
On 12/13/22 14:25, Isaac Morland wrote:
> On Tue, 13 Dec 2022 at 07:50, Vik Fearing <vik@postgresfriends.org> wrote:
> 
> I am proposing something like pg_aggregate.aggordering which would be an
>> enum of behaviors such as f=Forbidden, a=Allowed, r=Required.  Currently
>> all aggregates would have 'a' but I am thinking that a lot of them could
>> be switched to 'f'.  In that case, if a user supplies an ordering, an
>> error is raised.
>>
> 
> Although I find "r" attractive, I have two concerns about it:
> 
> 1) Do we really want to require ordering? I know it's weird and partially
> undefined to call something like string_agg without an ordering, but what
> if in the specific application it doesn’t matter in what order the items
> appear?
> 
> 2) There is a backward compatibility issue here; it’s not clear to me we
> could apply "r" to any existing aggregate.


I do not intend to add 'r' to any existing aggregate.  I included it in 
the hypothetical enum for future aggregates and extensions.  It isn't 
perfect either because first_value((x, y) ORDER BY x) can still give a 
semi-random result.


> Actually upon consideration, I think I have similar concerns about "f". We
> don’t usually forbid "dumb" things; e.g., I can write a function which
> ignores its inputs. And in some situations, "dumb" things make sense. For
> example, if I’m specifying a function to use as a filter, it could be
> reasonable in a particular instance to provide a function which ignores one
> or more of its inputs.


Sure, but this isn't a function; this is syntax.  And in your example 
you are ignoring the input whereas currently aggregates *do not* ignore 
the ordering when they could/should.
-- 
Vik Fearing




Re: Ordering behavior for aggregates

From
Ronan Dunklau
Date:
Le mardi 13 décembre 2022, 14:05:10 CET Vik Fearing a écrit :
> On 12/13/22 13:55, Magnus Hagander wrote:
> > On Tue, Dec 13, 2022 at 1:51 PM Vik Fearing <vik@postgresfriends.org>
wrote:
> >> However, it is completely useless for things like AVG() or SUM().  If
> >> you include it, the aggregate will do the sort even though it is neither
> >> required nor desired.

I'm not sure about this. For AVG and SUM, if you want reproducible results
with floating point numbers, you may want it. And if you disallow it for most
avg and sum implementations except for floating point types, it's not a very
consistent user experience.


> >>
> >> I am proposing something like pg_aggregate.aggordering which would be an
> >> enum of behaviors such as f=Forbidden, a=Allowed, r=Required.  Currently
> >> all aggregates would have 'a' but I am thinking that a lot of them could
> >> be switched to 'f'.  In that case, if a user supplies an ordering, an
> >> error is raised.
> >
> > Should there perhaps also be an option for "ignored" where we'd allow the
> > user to specify it, but not actually do the sort because we know it's
> > pointless? Or maybe that should be the behaviour of "forbidden", which
> > should then perhaps have a different name?
>
> I did think about that but I can't think of any reason we would want to
> silently ignore something the user has written.  If the ordering doesn't
> make sense, we should forbid it.

It is allowed as of now, and so it would be a compatibility issue for queries
existing in the wild. Ignoring it is just an optimization, just how we
optimize away some joins entirely.

--
Ronan Dunklau





Re: Ordering behavior for aggregates

From
Tom Lane
Date:
Ronan Dunklau <ronan.dunklau@aiven.io> writes:
> Le mardi 13 décembre 2022, 14:05:10 CET Vik Fearing a écrit :
>> On 12/13/22 13:55, Magnus Hagander wrote:
>>> On Tue, Dec 13, 2022 at 1:51 PM Vik Fearing <vik@postgresfriends.org> 
>>> wrote:
>>>> However, it is completely useless for things like AVG() or SUM().  If
>>>> you include it, the aggregate will do the sort even though it is neither
>>>> required nor desired.

> I'm not sure about this. For AVG and SUM, if you want reproducible results 
> with floating point numbers, you may want it.

Yeah, I was about to mention the floating-point issue.  IIRC, we went
over exactly this ground when we introduced aggregate ORDER BY, and
decided that it was not our business to legislate whether particular
aggregates need ordering or not.  We don't try to second-guess users'
inclusion of ORDER BY in subqueries either, and that's just about
the same thing.  (Indeed, if you're feeding the subquery output to
an aggregate, it's exactly the same thing.)

Accordingly, I find nothing at all attractive in this proposal.
I think the main thing it'd accomplish is to drive users back to
the bad old days of ordering-by-subquery, if they have a requirement
we failed to account for.

            regards, tom lane



Re: Ordering behavior for aggregates

From
Ronan Dunklau
Date:
Le mardi 13 décembre 2022, 16:13:34 CET Tom Lane a écrit :
> Accordingly, I find nothing at all attractive in this proposal.
> I think the main thing it'd accomplish is to drive users back to
> the bad old days of ordering-by-subquery, if they have a requirement
> we failed to account for.

I think the ability to mark certain aggregates as being able to completely
ignore the ordering because they produce exactly the same results is still a
useful optimization.

--
Ronan Dunklau





Re: Ordering behavior for aggregates

From
Tom Lane
Date:
Ronan Dunklau <ronan.dunklau@aiven.io> writes:
> Le mardi 13 décembre 2022, 16:13:34 CET Tom Lane a écrit :
>> Accordingly, I find nothing at all attractive in this proposal.
>> I think the main thing it'd accomplish is to drive users back to
>> the bad old days of ordering-by-subquery, if they have a requirement
>> we failed to account for.

> I think the ability to mark certain aggregates as being able to completely
> ignore the ordering because they produce exactly the same results is still a
> useful optimization.

That is *exactly* the position I do not accept.

I think it's fairly unlikely that a user would trouble to write ORDER BY
within an aggregate call if they didn't need it.  So my opinion of this
proposal is that it's a lot of work to create an optimization effect that
will be useless to nearly all users, and might actively break the queries
of some.

            regards, tom lane



Re: Ordering behavior for aggregates

From
"David G. Johnston"
Date:
On Tue, Dec 13, 2022 at 9:45 AM Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
Le mardi 13 décembre 2022, 16:13:34 CET Tom Lane a écrit :
> Accordingly, I find nothing at all attractive in this proposal.
> I think the main thing it'd accomplish is to drive users back to
> the bad old days of ordering-by-subquery, if they have a requirement
> we failed to account for.

I think the ability to mark certain aggregates as being able to completely
ignore the ordering because they produce exactly the same results is still a
useful optimization.


I seriously doubt that users are adding unnecessary ORDER BY clauses to their aggregates. The more compelling use case would be existing ORMs that produce such problematic SQL - are there any though?

I'm more keen on the idea of having the system understand when an ORDER BY is missing - that seems like what users are more likely to actually do.  But it doesn't seem all that useful given the lack of aggregates that would actually use it meaningfully.

David J.



Re: Ordering behavior for aggregates

From
Tom Lane
Date:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> I'm more keen on the idea of having the system understand when an ORDER BY
> is missing - that seems like what users are more likely to actually do.

That side of it could perhaps be useful, but not if it's an unintelligent
analysis.  If someone has a perfectly safe query written according to
the old-school method:

    SELECT string_agg(...) FROM (SELECT ... ORDER BY ...) ss;

they are not going to be too pleased with a nanny-ish warning (much
less an error) saying that the aggregate's input ordering is
underspecified.

I also wonder whether we'd accept any ORDER BY whatsoever, or try
to require one that produces a sufficiently-unique input ordering.

            regards, tom lane



Re: Ordering behavior for aggregates

From
Vik Fearing
Date:
On 12/13/22 18:22, Tom Lane wrote:
> "David G. Johnston" <david.g.johnston@gmail.com> writes:
>> I'm more keen on the idea of having the system understand when an ORDER BY
>> is missing - that seems like what users are more likely to actually do.
> 
> That side of it could perhaps be useful, but not if it's an unintelligent
> analysis.  If someone has a perfectly safe query written according to
> the old-school method:
> 
>     SELECT string_agg(...) FROM (SELECT ... ORDER BY ...) ss;
> 
> they are not going to be too pleased with a nanny-ish warning (much
> less an error) saying that the aggregate's input ordering is
> underspecified.

That is a good point

> I also wonder whether we'd accept any ORDER BY whatsoever, or try
> to require one that produces a sufficiently-unique input ordering.

I would accept anything.  agg(x order by y) is a common thing.

-- 
Vik Fearing