Thread: Idea: PostgreSQL equivalent to Oracle's KEEP clause

Idea: PostgreSQL equivalent to Oracle's KEEP clause

From
Ben Clements
Date:
I have an idea/request for enhancement to PostgreSQL (I'm new to PostgreSQL and this mailing list).
------------------------------------------------
Idea:

There's a technique in Oracle SQL that can be used to simplify aggregation queries:
Aggregate on a particular column, but get information from a different column, using a simple calculated column in the SELECT list.
 
--Oracle
--For a given country, what city has the highest population? (where the country has more than one city)
--Include the city name as a column.
select
   country,
   count(*),
   max(population),
   max(city) keep (dense_rank first order by population desc)
from
   cities
group by
   country
having
   count(*) > 1
 
As shown above, the following calculated column can bring in the city name, even though the city name isn't in the GROUP BY:
   max(city) keep (dense_rank first order by population desc)
 
There are a number of ways to achieve that kind of thing using PostgreSQL. I want a solution that lets me do it in a calculated column -- all within a single SELECT query (no subqueries, joins, WITH, etc.).
 
Could that functionality be added to PostgreSQL?
 
Related:
Thanks,

-Ben

Re: Idea: PostgreSQL equivalent to Oracle's KEEP clause

From
Tom Lane
Date:
Ben Clements <benhasgonewalking@gmail.com> writes:
> As shown above, the following calculated column can bring in the city name,
> even though the city name isn't in the GROUP BY:
>    max(city) keep (dense_rank first order by population desc)

You haven't really explained what this does, let alone why it can't
be implemented with existing features such as FILTER and ORDER BY.

            regards, tom lane



Re: Idea: PostgreSQL equivalent to Oracle's KEEP clause

From
David Rowley
Date:
On Tue, 7 Mar 2023 at 12:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Ben Clements <benhasgonewalking@gmail.com> writes:
> > As shown above, the following calculated column can bring in the city name,
> > even though the city name isn't in the GROUP BY:
> >    max(city) keep (dense_rank first order by population desc)
>
> You haven't really explained what this does, let alone why it can't
> be implemented with existing features such as FILTER and ORDER BY.

(It wasn't clear to me until I watched the youtube video.)

Likely KEEP is more flexible than just the given example but I think
that something similar to the example given could be done by inventing
a TOP() and BOTTOM() aggregate. Then you could write something like:

select
   country,
   count(*),
   max(population),
   bottom(city, population)
from
   cities
group by
   country
having
   count(*) > 1

the transfn for bottom() would need to remember the city and the
population for the highest yet seen value of the 2nd arg.  The
combinefn would need to find the aggregate state with the highest 2nd
arg value, the finalfn would just spit out the column that's stored in
the state.  Where this wouldn't work would be if multiple columns were
required to tiebreak the sort.

You could at least parallelize the aggregation this way. If there were
to be some form of ORDER BY in the aggregate then no parallelization
would be possible.  I'd assume since the whole thing can be done with
a subquery that the entire point of having special syntax for this
would be because we don't want to pay the price of looking at the
table twice, i.e. performance must matter, so the ability to have
parallel aggregates here seems good.

I can't quite think of a way to have parallel query and an arbitrarily
long list of columns to sort on...

For Ben, we do tend to shy away from copying other RDBMS's extensions
to the SQL language.  The problem is that copying these can cause
problems in the future if/when the standard adopts that syntax with
variations or invents something else that conflicts with the grammar
that we've added.  One example of something we didn't do was Oracle's
CONNECT BY.  Eventually, the SQL standard got WITH RECURSIVE to allow
queries on hierarchical data. Of course, we do have many of our own
extensions to the standard, so we certainly do make exceptions
sometimes. So, don't be too surprised that there's some discussion of
other methods which might make this work which don't involve copying
what someone else has done.

David



Re: Idea: PostgreSQL equivalent to Oracle's KEEP clause

From
"David G. Johnston"
Date:
On Mon, Mar 6, 2023 at 7:51 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 7 Mar 2023 at 12:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Ben Clements <benhasgonewalking@gmail.com> writes:
> > As shown above, the following calculated column can bring in the city name,
> > even though the city name isn't in the GROUP BY:
> >    max(city) keep (dense_rank first order by population desc)
>
> You haven't really explained what this does, let alone why it can't
> be implemented with existing features such as FILTER and ORDER BY.

(It wasn't clear to me until I watched the youtube video.) 

Likely KEEP is more flexible than just the given example but I think
that something similar to the example given could be done by inventing
a TOP() and BOTTOM() aggregate. Then you could write something like:

select
   country,
   count(*),
   max(population),
   bottom(city, population)
from
   cities
group by
   country
having
   count(*) > 1

the transfn for bottom() would need to remember the city and the
population for the highest yet seen value of the 2nd arg.

BOTTOM() remembers the highest value?
 
Where this wouldn't work would be if multiple columns were
required to tiebreak the sort.

TOP(city, ROW(population, land_area)) ?

I'd assume since the whole thing can be done with
a subquery that the entire point of having special syntax for this
would be because we don't want to pay the price of looking at the
table twice, i.e. performance must matter, so the ability to have
parallel aggregates here seems good.

SELECT country, city,
  rank() over (partition by country order by population desc), 
  count() OVER (partition by country)
FROM cities
WINDOW_HAVING count > 0 AND rank = 1;

That would be, IMO, the idiomatic query form to perform ranking - not abusing GROUP BY.  To add this encourages abusing GROUP BY.

Though I suppose if there is a sufficient performance gain to be had under GROUP BY the effort might make sense if further improvements to window function processing cannot be found.

David J.

Re: Idea: PostgreSQL equivalent to Oracle's KEEP clause

From
David Rowley
Date:
On Tue, 7 Mar 2023 at 16:11, David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Mon, Mar 6, 2023 at 7:51 PM David Rowley <dgrowleyml@gmail.com> wrote:
>> the transfn for bottom() would need to remember the city and the
>> population for the highest yet seen value of the 2nd arg.
>
>
> BOTTOM() remembers the highest value?

I was thinking in terms of a window with all values sorted in
ascending order. Maybe your mental modal differs from mine.  If Ben
wants to implement some new aggregate functions in an extension, then
he might think of better names.

> SELECT country, city,
>   rank() over (partition by country order by population desc),
>   count() OVER (partition by country)
> FROM cities
> WINDOW_HAVING count > 0 AND rank = 1;
>
> That would be, IMO, the idiomatic query form to perform ranking - not abusing GROUP BY.  To add this encourages
abusingGROUP BY. 
>
> Though I suppose if there is a sufficient performance gain to be had under GROUP BY the effort might make sense if
furtherimprovements to window function processing cannot be found. 

Ideally, we'd be able to just sort the top-1 value and not the entire
window by population desc.  Maybe SupportRequestWFuncMonotonic could
be extended to instruct WindowAgg to do that for certain functions.
Greg was talking about something like this in [1]. Likely that would
be easier for row_number() since any number of rows could have
rank==1.

David

[1] https://postgr.es/m/CAM-w4HN7D1wgTnKqUEnjie=E_6kJRC08CuGTLQgSirFPo3kY6A@mail.gmail.com



Re: Idea: PostgreSQL equivalent to Oracle's KEEP clause

From
Ben Clements
Date:
Thanks David.

Similar to your "TOP() and BOTTOM() aggregate" idea, you might find Erwin Brandstetter's solution using the LAST() aggregate function interesting: (https://dba.stackexchange.com/a/324646/100880)

If the FIRST_LAST_AGG extension is installed, then we can do something like this:
SELECT country     , count(*) AS ct_cities     , max(population) AS highest_population     , last(city ORDER BY population, city) AS biggest_city  -- !
FROM   cities
GROUP  BY country
HAVING count(*) > 1;

-Ben

On Mon, Mar 6, 2023 at 9:51 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 7 Mar 2023 at 12:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Ben Clements <benhasgonewalking@gmail.com> writes:
> > As shown above, the following calculated column can bring in the city name,
> > even though the city name isn't in the GROUP BY:
> >    max(city) keep (dense_rank first order by population desc)
>
> You haven't really explained what this does, let alone why it can't
> be implemented with existing features such as FILTER and ORDER BY.

(It wasn't clear to me until I watched the youtube video.)

Likely KEEP is more flexible than just the given example but I think
that something similar to the example given could be done by inventing
a TOP() and BOTTOM() aggregate. Then you could write something like:

select
   country,
   count(*),
   max(population),
   bottom(city, population)
from
   cities
group by
   country
having
   count(*) > 1

the transfn for bottom() would need to remember the city and the
population for the highest yet seen value of the 2nd arg.  The
combinefn would need to find the aggregate state with the highest 2nd
arg value, the finalfn would just spit out the column that's stored in
the state.  Where this wouldn't work would be if multiple columns were
required to tiebreak the sort.

You could at least parallelize the aggregation this way. If there were
to be some form of ORDER BY in the aggregate then no parallelization
would be possible.  I'd assume since the whole thing can be done with
a subquery that the entire point of having special syntax for this
would be because we don't want to pay the price of looking at the
table twice, i.e. performance must matter, so the ability to have
parallel aggregates here seems good.

I can't quite think of a way to have parallel query and an arbitrarily
long list of columns to sort on...

For Ben, we do tend to shy away from copying other RDBMS's extensions
to the SQL language.  The problem is that copying these can cause
problems in the future if/when the standard adopts that syntax with
variations or invents something else that conflicts with the grammar
that we've added.  One example of something we didn't do was Oracle's
CONNECT BY.  Eventually, the SQL standard got WITH RECURSIVE to allow
queries on hierarchical data. Of course, we do have many of our own
extensions to the standard, so we certainly do make exceptions
sometimes. So, don't be too surprised that there's some discussion of
other methods which might make this work which don't involve copying
what someone else has done.

David

Re: Idea: PostgreSQL equivalent to Oracle's KEEP clause

From
Alban Hertroys
Date:
> On 7 Mar 2023, at 4:11, David G. Johnston <david.g.johnston@gmail.com> wrote:
>
> On Mon, Mar 6, 2023 at 7:51 PM David Rowley <dgrowleyml@gmail.com> wrote:
> On Tue, 7 Mar 2023 at 12:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Ben Clements <benhasgonewalking@gmail.com> writes:
> > > As shown above, the following calculated column can bring in the city name,
> > > even though the city name isn't in the GROUP BY:
> > >    max(city) keep (dense_rank first order by population desc)
> >
> > You haven't really explained what this does, let alone why it can't
> > be implemented with existing features such as FILTER and ORDER BY.
>
> (It wasn't clear to me until I watched the youtube video.)
>
> Likely KEEP is more flexible than just the given example but I think
> that something similar to the example given could be done by inventing
> a TOP() and BOTTOM() aggregate. Then you could write something like:
>
> select
>    country,
>    count(*),
>    max(population),
>    bottom(city, population)
> from
>    cities
> group by
>    country
> having
>    count(*) > 1
>
> the transfn for bottom() would need to remember the city and the
> population for the highest yet seen value of the 2nd arg.
>
> BOTTOM() remembers the highest value?
>
> Where this wouldn't work would be if multiple columns were
> required to tiebreak the sort.
>
> TOP(city, ROW(population, land_area)) ?

What should be the expected behaviour on a tie though?

Say that we count the number of districts or airfields or train stations per city and query for the one(s) with the
mostor least of them? There could well be multiple cities with the same max number, and there will be many cities with
thesame minimum number (namely 0). 

Should the result be just the first of the maximums (or minimums) through some selection criterium (such as their
alphabeticalorder), should that give each of the tied results, or should there be a means to define that behaviour? 

I suppose a combination with FIRST and LAST could solve that issue?

Regards,
Alban Hertroys
--
There is always an exception to always.







Re: Idea: PostgreSQL equivalent to Oracle's KEEP clause

From
David Rowley
Date:
On Tue, 14 Mar 2023 at 21:01, Alban Hertroys <haramrae@gmail.com> wrote:
> > On 7 Mar 2023, at 4:11, David G. Johnston <david.g.johnston@gmail.com> wrote:
> > TOP(city, ROW(population, land_area)) ?
>
> What should be the expected behaviour on a tie though?

Undefined.  Same as having an ORDER BY on a column that's not unique.
The sort implementation effectively defines the order.  David did
specify the ROW() idea as a means to add additional columns so that
the tiebreak could be done with some other deciding factor.

> Should the result be just the first of the maximums (or minimums) through some selection criterium (such as their
alphabeticalorder), should that give each of the tied results, or should there be a means to define that behaviour?
 

It's an aggregate function. There's only 1 return value per group. If
you didn't want that you'd likely want to use a window function such
as rank() and add an outer query and filter out anything that's not
rank 1.

David



Re: Idea: PostgreSQL equivalent to Oracle's KEEP clause

From
David Rowley
Date:
On Tue, 14 Mar 2023 at 16:07, Ben Clements <benhasgonewalking@gmail.com> wrote:
> Similar to your "TOP() and BOTTOM() aggregate" idea, you might find Erwin Brandstetter's solution using the LAST()
aggregatefunction interesting: (https://dba.stackexchange.com/a/324646/100880)
 

Interesting.  Just note that ORDER BY aggregates cannot be
parallelised and there are no shortcuts to just look for the highest /
lowest ordered row.  All rows in the group must be sorted and the
aggregate will just take the first or last of those once the sort is
done.  The difference there (unless using PG16 and an index provides
presorted input) is that there would be O(N log2 N) comparisons to
perform the sort, where as the TOP() / BOTTOM() idea both allows
parallelism and requires less memory and only requires O(N)
comparisons.

If performance is not too critical row now, then what you've found
looks great.  I just wanted to mention that as it may be a factor that
matters at some point, even if it does not right now.

David