Thread: [PoC] Implementation of distinct in Window Aggregates

[PoC] Implementation of distinct in Window Aggregates

From

Ankit Pandey

Date:

24 December 2022, 15:52:03

Hi,

This is a PoC patch which implements distinct operation in window aggregates (without order by and for single column aggregation, final version may vary wrt these limitations). Purpose of this PoC is to get feedback on the approach used and corresponding implementation, any nitpicking as deemed reasonable.

Distinct operation is mirrored from implementation in nodeAgg. Existing partitioning logic determines if row is in partition and when distinct is required, all tuples for the aggregate column are stored in tuplesort. When finalize_windowaggregate gets called, tuples are sorted and duplicates are removed, followed by calling the transition function on each tuple.

When distinct is not required, the above process is skipped and the transition function gets called directly and nothing gets inserted into tuplesort.

Note: For each partition, in tuplesort_begin and tuplesort_end is involved to rinse tuplesort, so at any time, max tuples in tuplesort is equal to tuples in a particular partition.

I have verified it for interger and interval column aggregates (to rule out obvious issues related to data types).

Sample cases:

create table mytable(id int, name text);

insert into mytable values(1, 'A');

insert into mytable values(5, 'B');

insert into mytable values(3, 'A');

insert into mytable values(1, 'A');

select avg(distinct id) over (partition by name) from mytable;

avg
--------------------

2.0000000000000000

5.0000000000000000

select avg(id) over (partition by name) from mytable;

avg
--------------------
1.5000000000000000
1.5000000000000000
1.5000000000000000
1.5000000000000000
5.0000000000000000

select avg(distinct id) over () from mytable;
avg
--------------------
3.0000000000000000
3.0000000000000000
3.0000000000000000
3.0000000000000000
3.0000000000000000

select avg(distinct id) from mytable;
avg
--------------------
3.0000000000000000

This is my first-time contribution. Please let me know if anything can be

improved as I`m eager to learn.

Regards,

Ankit Kumar Pandey

Attachment

v1-0001-Implement-distinct-in-Window-Aggregates.patch

Re: [PoC] Implementation of distinct in Window Aggregates

From

Ankit Kumar Pandey

Date:

29 December 2022, 18:28:58

On 24/12/22 18:22, Ankit Pandey wrote:
> Hi,
>
> This is a PoC patch which implements distinct operation in window 
> aggregates (without order by and for single column aggregation, final 
> version may vary wrt these limitations). Purpose of this PoC is to get 
> feedback on the approach used and corresponding implementation, any 
> nitpicking as deemed reasonable.
>
> Distinct operation is mirrored from implementation in nodeAgg. 
> Existing partitioning logic determines if row is in partition and when 
> distinct is required, all tuples for the aggregate column are stored 
> in tuplesort. When finalize_windowaggregate gets called, tuples are 
> sorted and duplicates are removed, followed by calling the transition 
> function on each tuple.
> When distinct is not required, the above process is skipped and the 
> transition function gets called directly and nothing gets inserted 
> into tuplesort.
> Note: For each partition, in tuplesort_begin and tuplesort_end is 
> involved to rinse tuplesort, so at any time, max tuples in tuplesort 
> is equal to tuples in a particular partition.
>
> I have verified it for interger and interval column aggregates (to 
> rule out obvious issues related to data types).
>
> Sample cases:
>
> create table mytable(id int, name text);
> insert into mytable values(1, 'A');
> insert into mytable values(1, 'A');
> insert into mytable values(5, 'B');
> insert into mytable values(3, 'A');
> insert into mytable values(1, 'A');
>
> select avg(distinct id) over (partition by name) from mytable;
>         avg
> --------------------
> 2.0000000000000000
> 2.0000000000000000
> 2.0000000000000000
> 2.0000000000000000
> 5.0000000000000000
>
> select avg(id) over (partition by name) from mytable;
>         avg
> --------------------
>  1.5000000000000000
>  1.5000000000000000
>  1.5000000000000000
>  1.5000000000000000
>  5.0000000000000000
>
> select avg(distinct id) over () from mytable;
>         avg
> --------------------
>  3.0000000000000000
>  3.0000000000000000
>  3.0000000000000000
>  3.0000000000000000
>  3.0000000000000000
>
> select avg(distinct id)  from mytable;
>         avg
> --------------------
>  3.0000000000000000
>
> This is my first-time contribution. Please let me know if anything can be
> improved as I`m eager to learn.
>
> Regards,
> Ankit Kumar Pandey

Hi all,

I know everyone is busy with holidays (well, Happy Holidays!) but I will 
be glad if someone can take a quick look at this PoC and share thoughts.

This is my first time contribution so I am pretty sure there will be 
some very obvious feedbacks (which will help me to move forward with 
this change).


-- 
Regards,
Ankit Kumar Pandey

Re: [PoC] Implementation of distinct in Window Aggregates

From

Ankit Kumar Pandey

Date:

04 January 2023, 15:40:32

On 29/12/22 20:58, Ankit Kumar Pandey wrote:
>
> On 24/12/22 18:22, Ankit Pandey wrote:
>> Hi,
>>
>> This is a PoC patch which implements distinct operation in window 
>> aggregates (without order by and for single column aggregation, final 
>> version may vary wrt these limitations). Purpose of this PoC is to 
>> get feedback on the approach used and corresponding implementation, 
>> any nitpicking as deemed reasonable.
>>
>> Distinct operation is mirrored from implementation in nodeAgg. 
>> Existing partitioning logic determines if row is in partition and 
>> when distinct is required, all tuples for the aggregate column are 
>> stored in tuplesort. When finalize_windowaggregate gets called, 
>> tuples are sorted and duplicates are removed, followed by calling the 
>> transition function on each tuple.
>> When distinct is not required, the above process is skipped and the 
>> transition function gets called directly and nothing gets inserted 
>> into tuplesort.
>> Note: For each partition, in tuplesort_begin and tuplesort_end is 
>> involved to rinse tuplesort, so at any time, max tuples in tuplesort 
>> is equal to tuples in a particular partition.
>>
>> I have verified it for interger and interval column aggregates (to 
>> rule out obvious issues related to data types).
>>
>> Sample cases:
>>
>> create table mytable(id int, name text);
>> insert into mytable values(1, 'A');
>> insert into mytable values(1, 'A');
>> insert into mytable values(5, 'B');
>> insert into mytable values(3, 'A');
>> insert into mytable values(1, 'A');
>>
>> select avg(distinct id) over (partition by name) from mytable;
>>         avg
>> --------------------
>> 2.0000000000000000
>> 2.0000000000000000
>> 2.0000000000000000
>> 2.0000000000000000
>> 5.0000000000000000
>>
>> select avg(id) over (partition by name) from mytable;
>>         avg
>> --------------------
>>  1.5000000000000000
>>  1.5000000000000000
>>  1.5000000000000000
>>  1.5000000000000000
>>  5.0000000000000000
>>
>> select avg(distinct id) over () from mytable;
>>         avg
>> --------------------
>>  3.0000000000000000
>>  3.0000000000000000
>>  3.0000000000000000
>>  3.0000000000000000
>>  3.0000000000000000
>>
>> select avg(distinct id)  from mytable;
>>         avg
>> --------------------
>>  3.0000000000000000
>>
>> This is my first-time contribution. Please let me know if anything 
>> can be
>> improved as I`m eager to learn.
>>
>> Regards,
>> Ankit Kumar Pandey
>
> Hi all,
>
> I know everyone is busy with holidays (well, Happy Holidays!) but I 
> will be glad if someone can take a quick look at this PoC and share 
> thoughts.
>
> This is my first time contribution so I am pretty sure there will be 
> some very obvious feedbacks (which will help me to move forward with 
> this change).
>
>
Updated patch with latest master. Last patch was an year old.

-- 
Regards,
Ankit Kumar Pandey

Attachment

v1-0002-Implement-distinct-in-Window-Aggregates.patch

Re: [PoC] Implementation of distinct in Window Aggregates

From

Ankit Kumar Pandey

Date:

12 March 2023, 10:25:48

On 04/01/23 18:10, Ankit Kumar Pandey wrote:
> On 29/12/22 20:58, Ankit Kumar Pandey wrote:
> >
> > On 24/12/22 18:22, Ankit Pandey wrote:
> >> Hi,
> >>
> >> This is a PoC patch which implements distinct operation in window 
> >> aggregates (without order by and for single column aggregation, final 
> >> version may vary wrt these limitations). Purpose of this PoC is to 
> >> get feedback on the approach used and corresponding implementation, 
> >> any nitpicking as deemed reasonable.
> >>
> >> Distinct operation is mirrored from implementation in nodeAgg. 
> >> Existing partitioning logic determines if row is in partition and 
> >> when distinct is required, all tuples for the aggregate column are 
> >> stored in tuplesort. When finalize_windowaggregate gets called, 
> >> tuples are sorted and duplicates are removed, followed by calling the 
> >> transition function on each tuple.
> >> When distinct is not required, the above process is skipped and the 
> >> transition function gets called directly and nothing gets inserted 
> >> into tuplesort.
> >> Note: For each partition, in tuplesort_begin and tuplesort_end is 
> >> involved to rinse tuplesort, so at any time, max tuples in tuplesort 
> >> is equal to tuples in a particular partition.
> >>
> >> I have verified it for interger and interval column aggregates (to 
> >> rule out obvious issues related to data types).
> >>
> >> Sample cases:
> >>
> >> create table mytable(id int, name text);
> >> insert into mytable values(1, 'A');
> >> insert into mytable values(1, 'A');
> >> insert into mytable values(5, 'B');
> >> insert into mytable values(3, 'A');
> >> insert into mytable values(1, 'A');
> >>
> >> select avg(distinct id) over (partition by name) from mytable;
> >>         avg
> >> --------------------
> >> 2.0000000000000000
> >> 2.0000000000000000
> >> 2.0000000000000000
> >> 2.0000000000000000
> >> 5.0000000000000000
> >>
> >> select avg(id) over (partition by name) from mytable;
> >>         avg
> >> --------------------
> >>  1.5000000000000000
> >>  1.5000000000000000
> >>  1.5000000000000000
> >>  1.5000000000000000
> >>  5.0000000000000000
> >>
> >> select avg(distinct id) over () from mytable;
> >>         avg
> >> --------------------
> >>  3.0000000000000000
> >>  3.0000000000000000
> >>  3.0000000000000000
> >>  3.0000000000000000
> >>  3.0000000000000000
> >>
> >> select avg(distinct id)  from mytable;
> >>         avg
> >> --------------------
> >>  3.0000000000000000
> >>
> >> This is my first-time contribution. Please let me know if anything 
> >> can be
> >> improved as I`m eager to learn.
> >>
> >> Regards,
> >> Ankit Kumar Pandey
> >
> > Hi all,
> >
> > I know everyone is busy with holidays (well, Happy Holidays!) but I 
> > will be glad if someone can take a quick look at this PoC and share 
> > thoughts.
> >
> > This is my first time contribution so I am pretty sure there will be 
> > some very obvious feedbacks (which will help me to move forward with 
> > this change).
> >
> >
> Updated patch with latest master. Last patch was an year old.
>
Attaching patch with rebase from latest HEAD


Thanks,

Ankit

Attachment

distinct_windows.patch

Re: [PoC] Implementation of distinct in Window Aggregates

From

Ankit Kumar Pandey

Date:

12 March 2023, 11:17:50

Attaching updated patch with a fix for an issue in window function.

I have also fixed naming convention of patch as last patch had 
incompatible name.

Note:

1. Pending: Investigation of test cases failures.


Regards,

Ankit

Attachment

v1-0003-Implement-distinct-in-Window-Aggregates.patch

Re: [PoC] Implementation of distinct in Window Aggregates

From

Andreas Karlsson

Date:

11 July 2023, 02:06:07

On 3/12/23 09:17, Ankit Kumar Pandey wrote:
> Attaching updated patch with a fix for an issue in window function.
> 
> I have also fixed naming convention of patch as last patch had 
> incompatible name.

Hi,

This patch does not apply to master. Could you rebase it and submit it 
as one patch which applies directly to master? Maybe I am wrong but the 
latest version looks like it only applies on top of one of your previous 
patches which makes it hard for the reviewer.

Andreas

Re: [PoC] Implementation of distinct in Window Aggregates

From

Daniel Gustafsson

Date:

03 August 2023, 23:35:11

> On 11 Jul 2023, at 01:06, Andreas Karlsson <andreas@proxel.se> wrote:
>
> On 3/12/23 09:17, Ankit Kumar Pandey wrote:
>> Attaching updated patch with a fix for an issue in window function.
>> I have also fixed naming convention of patch as last patch had incompatible name.
>
> Hi,
>
> This patch does not apply to master. Could you rebase it and submit it as one patch which applies directly to master?
MaybeI am wrong but the latest version looks like it only applies on top of one of your previous patches which makes it
hardfor the reviewer. 

Since no update was posted, the patch was considered a PoC and the thread has
stalled, I will mark this returned with feedback.  Please feel free to reopen a
new CF entry when there is a new patch available which addresses Andreas'
feedback on patch structure.

--
Daniel Gustafsson