Re: Questions regarding distinct operation implementation - Mailing list pgsql-hackers
From | Ankit Kumar Pandey |
---|---|
Subject | Re: Questions regarding distinct operation implementation |
Date | |
Msg-id | 8216e6e8-568d-f8c6-7f0e-3f698b10d855@gmail.com Whole thread Raw |
In response to | Re: Questions regarding distinct operation implementation (David Rowley <dgrowleyml@gmail.com>) |
List | pgsql-hackers |
On 02/12/22 03:07, David Rowley wrote: > On Fri, 2 Dec 2022 at 08:10, Ankit Kumar Pandey <itsankitkp@gmail.com> wrote: >> select avg(distinct id) over (partition by name) from mytable (in oracle db) yields: >> 2 >> 2 >> 2 >> 2 >> 10 >> >> From this, it is seen distinct is taken across the all rows in the partition. > Due to the lack of ORDER BY clause, all rows in the partition are in > the window frame at once. The question is, what *should* happen if > you add an ORDER BY. > > Looking at the copy of the standard that I have, I see nothing > explicitly mentioned about aggregates with DISTINCT used as window > functions, however, I do see in the Window Function section: > > "The window aggregate functions compute an <aggregate function> > (COUNT, SUM, AVG, etc.), the same as > a group aggregate function, except that the computation aggregates > over the window frame of a row rather than > over a group of a grouped table. The hypothetical set functions are > not permitted as window aggregate functions." > > So you could deduce that the DISTINCT would also need to be applied > over the frame too. > > The question is, what do you want to make work? If you're not worried > about supporting DISTINCT when there is an ORDER BY clause and the > frame options are effectively ROWS BETWEEN UNBOUNDED PRECEDING AND > UNBOUNDED FOLLOWING, then it's going to be much easier to make work. > You never need to worry about rows dropping out of visibility in the > frame. Simply all rows in the partition are in the frame. > > You do need to be careful as, if I remember correctly, we do support > some non-standard things here. I believe the standard requires an > ORDER BY when specifying frame options. I think we didn't see any > technical reason to apply that limitation, so didn't. That means in > Postgres, you can do things like: > > select avg(id) over (partition by name ROWS BETWEEN CURRENT ROW AND 3 > FOLLOWING) from mytable; > > but that's unlikely to work on most other databases without adding an ORDER BY. > > So if you are going to limit this to only being supported without an > ORDER BY, then you'll need to ensure that the specified frame options > don't cause your code to break. I'm unsure, but this might be a case > of checking for FRAMEOPTION_NONDEFAULT unless it's > FRAMEOPTION_START_UNBOUNDED_PRECEDING|FRAMEOPTION_END_UNBOUNDED_FOLLOWING. > You'll need to study that a bit more than I just did though. > > One way to make that work might be to add code to > eval_windowaggregates() around the call to advance_windowaggregate(), > you can see the row being aggregated is set by: > > winstate->tmpcontext->ecxt_outertuple = agg_row_slot; > > what you'd need to do here is change the code so that you put all the > rows to aggregate into a tuplesort then sort them by the distinct > column and instead, feed the tuplesort rows to > advance_windowaggregate(). You'd need to add code similar to what is > in process_ordered_aggregate_single() in nodeAgg.c to have the > duplicate consecutive rows skipped. > > Just a word of warning on this. This is a hugely complex area of > Postgres. If I was you, I'd make sure and spend quite a bit of time > reading nodeWindowAgg.c and likely much of nodeAgg.c. Any changes we > accept in that area are going to have to be very carefully done. Make > sure you're comfortable with the code before doing too much. It would > be very easy to end up with a giant mess if you try to do this without > fully understanding the implications of your changes. Also, you'll > need to show you've not regressed the performance of the existing > features with the code you've added. > > Good luck! > > David Thanks a lot David, this is of an immense help. I will go through mentioned pointers, biggest being that this is complex piece and will take its due course. Thanks again -- Regards, Ankit Kumar Pandey
pgsql-hackers by date: