Re: Questions regarding distinct operation implementation - Mailing list pgsql-hackers

From Ankit Kumar Pandey
Subject Re: Questions regarding distinct operation implementation
Date
Msg-id 8216e6e8-568d-f8c6-7f0e-3f698b10d855@gmail.com
Whole thread Raw
In response to Re: Questions regarding distinct operation implementation  (David Rowley <dgrowleyml@gmail.com>)
List pgsql-hackers
On 02/12/22 03:07, David Rowley wrote:
> On Fri, 2 Dec 2022 at 08:10, Ankit Kumar Pandey <itsankitkp@gmail.com> wrote:
>> select avg(distinct id) over (partition by name) from mytable (in oracle db) yields:
>> 2
>> 2
>> 2
>> 2
>> 10
>>
>>  From this, it is seen distinct is taken across the all rows in the partition.
> Due to the lack of ORDER BY clause, all rows in the partition are in
> the window frame at once.  The question is, what *should* happen if
> you add an ORDER BY.
>
> Looking at the copy of the standard that I have, I see nothing
> explicitly mentioned about aggregates with DISTINCT used as window
> functions, however, I do see in the Window Function section:
>
> "The window aggregate functions compute an <aggregate function>
> (COUNT, SUM, AVG, etc.), the same as
> a group aggregate function, except that the computation aggregates
> over the window frame of a row rather than
> over a group of a grouped table. The hypothetical set functions are
> not permitted as window aggregate functions."
>
> So you could deduce that the DISTINCT would also need to be applied
> over the frame too.
>
> The question is, what do you want to make work?  If you're not worried
> about supporting DISTINCT when there is an ORDER BY clause and the
> frame options are effectively ROWS BETWEEN UNBOUNDED PRECEDING AND
> UNBOUNDED FOLLOWING, then it's going to be much easier to make work.
> You never need to worry about rows dropping out of visibility in the
> frame. Simply all rows in the partition are in the frame.
>
> You do need to be careful as, if I remember correctly, we do support
> some non-standard things here. I believe the standard requires an
> ORDER BY when specifying frame options. I think we didn't see any
> technical reason to apply that limitation, so didn't.  That means in
> Postgres, you can do things like:
>
> select avg(id) over (partition by name ROWS BETWEEN CURRENT ROW AND 3
> FOLLOWING) from mytable;
>
> but that's unlikely to work on most other databases without adding an ORDER BY.
>
> So if you are going to limit this to only being supported without an
> ORDER BY, then you'll need to ensure that the specified frame options
> don't cause your code to break.  I'm unsure, but this might be a case
> of checking for FRAMEOPTION_NONDEFAULT unless it's
> FRAMEOPTION_START_UNBOUNDED_PRECEDING|FRAMEOPTION_END_UNBOUNDED_FOLLOWING.
> You'll need to study that a bit more than I just did though.
>
> One way to make that work might be to add code to
> eval_windowaggregates() around the call to advance_windowaggregate(),
> you can see the row being aggregated is set by:
>
> winstate->tmpcontext->ecxt_outertuple = agg_row_slot;
>
> what you'd need to do here is change the code so that you put all the
> rows to aggregate into a tuplesort then sort them by the distinct
> column and instead, feed the tuplesort rows to
> advance_windowaggregate(). You'd need to add code similar to what is
> in process_ordered_aggregate_single() in nodeAgg.c to have the
> duplicate consecutive rows skipped.
>
> Just a word of warning on this. This is a hugely complex area of
> Postgres. If I was you, I'd make sure and spend quite a bit of time
> reading nodeWindowAgg.c and likely much of nodeAgg.c. Any changes we
> accept in that area are going to have to be very carefully done. Make
> sure you're comfortable with the code before doing too much. It would
> be very easy to end up with a giant mess if you try to do this without
> fully understanding the implications of your changes.  Also, you'll
> need to show you've not regressed the performance of the existing
> features with the code you've added.
>
> Good luck!
>
> David

Thanks a lot David, this is of an immense help. I will go through 
mentioned pointers, biggest being that this is complex piece and will 
take its due course.

Thanks again

-- 
Regards,
Ankit Kumar Pandey




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Removing another gen_node_support.pl special case
Next
From: Ankit Kumar Pandey
Date:
Subject: Re: Questions regarding distinct operation implementation