Re: Questions regarding distinct operation implementation - Mailing list pgsql-hackers
From | David Rowley |
---|---|
Subject | Re: Questions regarding distinct operation implementation |
Date | |
Msg-id | CAApHDvrgi-J_3ATq9JhFm_LvJCVMH9MXRjFHEpy6Setue-XE=A@mail.gmail.com Whole thread Raw |
In response to | Re: Questions regarding distinct operation implementation (Ankit Kumar Pandey <itsankitkp@gmail.com>) |
Responses |
Re: Questions regarding distinct operation implementation
Re: Questions regarding distinct operation implementation |
List | pgsql-hackers |
On Fri, 2 Dec 2022 at 08:10, Ankit Kumar Pandey <itsankitkp@gmail.com> wrote: > select avg(distinct id) over (partition by name) from mytable (in oracle db) yields: > 2 > 2 > 2 > 2 > 10 > > From this, it is seen distinct is taken across the all rows in the partition. Due to the lack of ORDER BY clause, all rows in the partition are in the window frame at once. The question is, what *should* happen if you add an ORDER BY. Looking at the copy of the standard that I have, I see nothing explicitly mentioned about aggregates with DISTINCT used as window functions, however, I do see in the Window Function section: "The window aggregate functions compute an <aggregate function> (COUNT, SUM, AVG, etc.), the same as a group aggregate function, except that the computation aggregates over the window frame of a row rather than over a group of a grouped table. The hypothetical set functions are not permitted as window aggregate functions." So you could deduce that the DISTINCT would also need to be applied over the frame too. The question is, what do you want to make work? If you're not worried about supporting DISTINCT when there is an ORDER BY clause and the frame options are effectively ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, then it's going to be much easier to make work. You never need to worry about rows dropping out of visibility in the frame. Simply all rows in the partition are in the frame. You do need to be careful as, if I remember correctly, we do support some non-standard things here. I believe the standard requires an ORDER BY when specifying frame options. I think we didn't see any technical reason to apply that limitation, so didn't. That means in Postgres, you can do things like: select avg(id) over (partition by name ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) from mytable; but that's unlikely to work on most other databases without adding an ORDER BY. So if you are going to limit this to only being supported without an ORDER BY, then you'll need to ensure that the specified frame options don't cause your code to break. I'm unsure, but this might be a case of checking for FRAMEOPTION_NONDEFAULT unless it's FRAMEOPTION_START_UNBOUNDED_PRECEDING|FRAMEOPTION_END_UNBOUNDED_FOLLOWING. You'll need to study that a bit more than I just did though. One way to make that work might be to add code to eval_windowaggregates() around the call to advance_windowaggregate(), you can see the row being aggregated is set by: winstate->tmpcontext->ecxt_outertuple = agg_row_slot; what you'd need to do here is change the code so that you put all the rows to aggregate into a tuplesort then sort them by the distinct column and instead, feed the tuplesort rows to advance_windowaggregate(). You'd need to add code similar to what is in process_ordered_aggregate_single() in nodeAgg.c to have the duplicate consecutive rows skipped. Just a word of warning on this. This is a hugely complex area of Postgres. If I was you, I'd make sure and spend quite a bit of time reading nodeWindowAgg.c and likely much of nodeAgg.c. Any changes we accept in that area are going to have to be very carefully done. Make sure you're comfortable with the code before doing too much. It would be very easy to end up with a giant mess if you try to do this without fully understanding the implications of your changes. Also, you'll need to show you've not regressed the performance of the existing features with the code you've added. Good luck! David
pgsql-hackers by date: