Home > mailing lists

Thread: Questions regarding distinct operation implementation

Questions regarding distinct operation implementation

From

Ankit Kumar Pandey

Date:

23 November 2022, 21:18:20

Hello,

I have questions regarding distinct operation and would be glad if someone could help me out.

Consider the following table (mytable):

id, name

1, A

2, B

3, A

1, A

If we do select avg(id) over (partition by name) from mytable, partition logic goes like this:

for A: 1, 1, 3, 1

If we want to implement something like this select avg(distinct id) over (partition by name) from mytable

and remove duplicate by storing last datum of aggregate column (id) and comparing it with current value. It fails here because aggregate column is not sorted within the partition.

Questions:

1. Is sorting prerequisite for finding distinct values?

2. Is it okay to sort aggregate column (within partition) for distinct to work in case of window function?

3. Is an alternative way exists to handle this scenario (because here sort is of no use in aggregation)?

Thanks

-- 
Regards,
Ankit Kumar Pandey

Re: Questions regarding distinct operation implementation

From

Ankit Kumar Pandey

Date:

24 November 2022, 20:57:11

On 23/11/22 23:48, Ankit Kumar Pandey wrote:

Hello,

I have questions regarding distinct operation and would be glad if someone could help me out.
Consider the following table (mytable):
id, name
1, A
1, A
2, B
3, A
1, A
If we do select avg(id) over (partition by name) from mytable, partition logic goes like this:
for A: 1, 1, 3, 1
If we want to implement something like this select avg(distinct id) over (partition by name) from mytable
and remove duplicate by storing last datum of aggregate column (id) and comparing it with current value. It fails here because aggregate column is not sorted within the partition.
Questions:
1. Is sorting prerequisite for finding distinct values?
2. Is it okay to sort aggregate column (within partition) for distinct to work in case of window function?
3. Is an alternative way exists to handle this scenario (because here sort is of no use in aggregation)?

Thanks
-- 
Regards,
Ankit Kumar Pandey

Hi,

After little more digging, I can see that aggregation on Window functions are of running type, it would be bit more effective if a lookup hashtable is created where every value in current aggregate column get inserted. Whenever frame moves ahead, a lookup if performed for presence of duplicate.

On performance standpoint, this might be bad idea though.

Please let me know any opinions on this.

-- 
Regards,
Ankit Kumar Pandey

Re: Questions regarding distinct operation implementation

From

David Rowley

Date:

24 November 2022, 23:44:51

On Fri, 25 Nov 2022 at 06:57, Ankit Kumar Pandey <itsankitkp@gmail.com> wrote:
> Please let me know any opinions on this.

I think if you're planning on working on this then step 1 would have
to be checking the SQL standard to see which set of rows it asks
implementations to consider for duplicate checks when deciding if the
transition should be performed or not.  Having not looked, I don't
know if this is the entire partition or just the rows in the current
frame.

Depending on what you want, an alternative today would be to run a
subquery to uniquify the rows the way you want and then do the window
function stuff.

David

Re: Questions regarding distinct operation implementation

From

Ankit Kumar Pandey

Date:

25 November 2022, 08:30:33

On 25/11/22 02:14, David Rowley wrote:
> On Fri, 25 Nov 2022 at 06:57, Ankit Kumar Pandey <itsankitkp@gmail.com> wrote:
>> Please let me know any opinions on this.
> I think if you're planning on working on this then step 1 would have
> to be checking the SQL standard to see which set of rows it asks
> implementations to consider for duplicate checks when deciding if the
> transition should be performed or not.  Having not looked, I don't
> know if this is the entire partition or just the rows in the current
> frame.
>
> Depending on what you want, an alternative today would be to run a
> subquery to uniquify the rows the way you want and then do the window
> function stuff.
>
> David
Thanks David, these are excellent pointers, I will look into SQL 
standard first and so on.

-- 
Regards,
Ankit Kumar Pandey

Re: Questions regarding distinct operation implementation

From

Ankit Kumar Pandey

Date:

01 December 2022, 22:10:46

On 25/11/22 11:00, Ankit Kumar Pandey wrote:

On 25/11/22 02:14, David Rowley wrote:
On Fri, 25 Nov 2022 at 06:57, Ankit Kumar Pandey <itsankitkp@gmail.com> wrote:
Please let me know any opinions on this.
I think if you're planning on working on this then step 1 would have
to be checking the SQL standard to see which set of rows it asks
implementations to consider for duplicate checks when deciding if the
transition should be performed or not. Having not looked, I don't
know if this is the entire partition or just the rows in the current
frame.

Depending on what you want, an alternative today would be to run a
subquery to uniquify the rows the way you want and then do the window
function stuff.

David
Thanks David, these are excellent pointers, I will look into SQL standard first and so on.

Hi,

Looking further into it, I am bit clear about expectations of having distinct in Windows Aggregates (although I couldn't got hands on SQL standard as it is not in public domain but distinct in windows aggregate is supported by Oracle and I am using it as reference).

For table (mytable):

id, name

1, A

10, B

3, A

1, A

select avg(distinct id) over (partition by name) from mytable (in oracle db) yields:

From this, it is seen distinct is taken across the all rows in the partition.

I also thought of using a subquery approach: select avg(id) over (partition by name) from (select distinct(id), name from mytable)

but this obviously doesn't yield right answer because result should contain same number of rows as input. This implies we need to find partition first and then remove duplicates within the partition.

Can we avoid any ordering/sort until existing logic finds if value is in frame (so as to respect any order by clause given by user), and once it is determined that tuple is in frame, skip the tuple if it is a duplicate? If aforementioned approach is right, question is how do we check if it is duplicate? Should we create a lookup table (as tuples coming to advance_windowaggregate can be in arbitrary order)? Or any other approach would be better?

Any opinion on this will be appreciated.

-- 
Regards,
Ankit Kumar Pandey

Re: Questions regarding distinct operation implementation

From

David Rowley

Date:

02 December 2022, 00:37:33

On Fri, 2 Dec 2022 at 08:10, Ankit Kumar Pandey <itsankitkp@gmail.com> wrote:
> select avg(distinct id) over (partition by name) from mytable (in oracle db) yields:
> 2
> 2
> 2
> 2
> 10
>
> From this, it is seen distinct is taken across the all rows in the partition.

Due to the lack of ORDER BY clause, all rows in the partition are in
the window frame at once.  The question is, what *should* happen if
you add an ORDER BY.

Looking at the copy of the standard that I have, I see nothing
explicitly mentioned about aggregates with DISTINCT used as window
functions, however, I do see in the Window Function section:

"The window aggregate functions compute an <aggregate function>
(COUNT, SUM, AVG, etc.), the same as
a group aggregate function, except that the computation aggregates
over the window frame of a row rather than
over a group of a grouped table. The hypothetical set functions are
not permitted as window aggregate functions."

So you could deduce that the DISTINCT would also need to be applied
over the frame too.

The question is, what do you want to make work?  If you're not worried
about supporting DISTINCT when there is an ORDER BY clause and the
frame options are effectively ROWS BETWEEN UNBOUNDED PRECEDING AND
UNBOUNDED FOLLOWING, then it's going to be much easier to make work.
You never need to worry about rows dropping out of visibility in the
frame. Simply all rows in the partition are in the frame.

You do need to be careful as, if I remember correctly, we do support
some non-standard things here. I believe the standard requires an
ORDER BY when specifying frame options. I think we didn't see any
technical reason to apply that limitation, so didn't.  That means in
Postgres, you can do things like:

select avg(id) over (partition by name ROWS BETWEEN CURRENT ROW AND 3
FOLLOWING) from mytable;

but that's unlikely to work on most other databases without adding an ORDER BY.

So if you are going to limit this to only being supported without an
ORDER BY, then you'll need to ensure that the specified frame options
don't cause your code to break.  I'm unsure, but this might be a case
of checking for FRAMEOPTION_NONDEFAULT unless it's
FRAMEOPTION_START_UNBOUNDED_PRECEDING|FRAMEOPTION_END_UNBOUNDED_FOLLOWING.
You'll need to study that a bit more than I just did though.

One way to make that work might be to add code to
eval_windowaggregates() around the call to advance_windowaggregate(),
you can see the row being aggregated is set by:

winstate->tmpcontext->ecxt_outertuple = agg_row_slot;

what you'd need to do here is change the code so that you put all the
rows to aggregate into a tuplesort then sort them by the distinct
column and instead, feed the tuplesort rows to
advance_windowaggregate(). You'd need to add code similar to what is
in process_ordered_aggregate_single() in nodeAgg.c to have the
duplicate consecutive rows skipped.

Just a word of warning on this. This is a hugely complex area of
Postgres. If I was you, I'd make sure and spend quite a bit of time
reading nodeWindowAgg.c and likely much of nodeAgg.c. Any changes we
accept in that area are going to have to be very carefully done. Make
sure you're comfortable with the code before doing too much. It would
be very easy to end up with a giant mess if you try to do this without
fully understanding the implications of your changes.  Also, you'll
need to show you've not regressed the performance of the existing
features with the code you've added.

Good luck!

David

Re: Questions regarding distinct operation implementation

From

"David G. Johnston"

Date:

02 December 2022, 00:51:58

On Thu, Dec 1, 2022 at 2:37 PM David Rowley <dgrowleyml@gmail.com> wrote:

The question is, what do you want to make work? If you're not worried
about supporting DISTINCT when there is an ORDER BY clause and the
frame options are effectively ROWS BETWEEN UNBOUNDED PRECEDING AND
UNBOUNDED FOLLOWING, then it's going to be much easier to make work.
You never need to worry about rows dropping out of visibility in the
frame. Simply all rows in the partition are in the frame.

I would definitely want the ability to have the output ordered and distinct at the same time.

array_agg(distinct col) over (order by whatever)

Conceptually this seems like it can be trivially accomplished with a simple lookup table, the key being the distinct column(s) and the value being a counter - with the entry being removed when the counter goes to zero (decreases happening each time a row goes out of scope). The main concern, I suspect, isn't implementation ability, it is speed and memory consumption.

I would expect the distinct output to be identical to the non-distinct output except for duplicates removed. Using array_agg as an example makes seeing the distinction quite easy.

Thinking over the above a bit more, is something like this possible?

array_agg(distinct col order by col) over (order by whatever)

i.e., can we add order by within the aggregate to control its internal ordering separately from the ordering needed for the window framing?

David J.

Re: Questions regarding distinct operation implementation

From

Ankit Kumar Pandey

Date:

02 December 2022, 23:10:01

On 02/12/22 00:40, Ankit Kumar Pandey wrote:

On 25/11/22 11:00, Ankit Kumar Pandey wrote:

On 25/11/22 02:14, David Rowley wrote:
On Fri, 25 Nov 2022 at 06:57, Ankit Kumar Pandey <itsankitkp@gmail.com> wrote:
Please let me know any opinions on this.
I think if you're planning on working on this then step 1 would have
to be checking the SQL standard to see which set of rows it asks
implementations to consider for duplicate checks when deciding if the
transition should be performed or not. Having not looked, I don't
know if this is the entire partition or just the rows in the current
frame.

Depending on what you want, an alternative today would be to run a
subquery to uniquify the rows the way you want and then do the window
function stuff.

David
Thanks David, these are excellent pointers, I will look into SQL standard first and so on.

Hi,
Looking further into it, I am bit clear about expectations of having distinct in Windows Aggregates (although I couldn't got hands on SQL standard as it is not in public domain but distinct in windows aggregate is supported by Oracle and I am using it as reference).
For table (mytable):
id, name
1, A
1, A
10, B
3, A
1, A

select avg(distinct id) over (partition by name) from mytable (in oracle db) yields:
2
2
2
2
10

From this, it is seen distinct is taken across the all rows in the partition.
I also thought of using a subquery approach: select avg(id) over (partition by name) from (select distinct(id), name from mytable)
but this obviously doesn't yield right answer because result should contain same number of rows as input. This implies we need to find partition first and then remove duplicates within the partition.
Can we avoid any ordering/sort until existing logic finds if value is in frame (so as to respect any order by clause given by user), and once it is determined that tuple is in frame, skip the tuple if it is a duplicate? If aforementioned approach is right, question is how do we check if it is duplicate? Should we create a lookup table (as tuples coming to advance_windowaggregate can be in arbitrary order)? Or any other approach would be better?
Any opinion on this will be appreciated.
-- 
Regards,
Ankit Kumar Pandey

Hi,

I am still looking at this but unable to move ahead as I am not able to use prior use-cases (normal aggregates) to implement distinct in window function because they both differ in design (and window function is bit unique in general). One approach (among others) that I thought was that during spool_tuples, rescan tuplestore and add new tuples only if they are not already present. This is not very efficient because of multiple read operation on tuplestore, only for checking if tuple already exists and other issues (like tuplestore_in_memory forcing entire partition to get spooled in one go) etc.

Any ideas will be much appreciated.

Thanks.

-- 
Regards,
Ankit Kumar Pandey

Re: Questions regarding distinct operation implementation

From

Ankit Kumar Pandey

Date:

02 December 2022, 23:33:16

On 02/12/22 03:07, David Rowley wrote:
> On Fri, 2 Dec 2022 at 08:10, Ankit Kumar Pandey <itsankitkp@gmail.com> wrote:
>> select avg(distinct id) over (partition by name) from mytable (in oracle db) yields:
>> 2
>> 2
>> 2
>> 2
>> 10
>>
>>  From this, it is seen distinct is taken across the all rows in the partition.
> Due to the lack of ORDER BY clause, all rows in the partition are in
> the window frame at once.  The question is, what *should* happen if
> you add an ORDER BY.
>
> Looking at the copy of the standard that I have, I see nothing
> explicitly mentioned about aggregates with DISTINCT used as window
> functions, however, I do see in the Window Function section:
>
> "The window aggregate functions compute an <aggregate function>
> (COUNT, SUM, AVG, etc.), the same as
> a group aggregate function, except that the computation aggregates
> over the window frame of a row rather than
> over a group of a grouped table. The hypothetical set functions are
> not permitted as window aggregate functions."
>
> So you could deduce that the DISTINCT would also need to be applied
> over the frame too.
>
> The question is, what do you want to make work?  If you're not worried
> about supporting DISTINCT when there is an ORDER BY clause and the
> frame options are effectively ROWS BETWEEN UNBOUNDED PRECEDING AND
> UNBOUNDED FOLLOWING, then it's going to be much easier to make work.
> You never need to worry about rows dropping out of visibility in the
> frame. Simply all rows in the partition are in the frame.
>
> You do need to be careful as, if I remember correctly, we do support
> some non-standard things here. I believe the standard requires an
> ORDER BY when specifying frame options. I think we didn't see any
> technical reason to apply that limitation, so didn't.  That means in
> Postgres, you can do things like:
>
> select avg(id) over (partition by name ROWS BETWEEN CURRENT ROW AND 3
> FOLLOWING) from mytable;
>
> but that's unlikely to work on most other databases without adding an ORDER BY.
>
> So if you are going to limit this to only being supported without an
> ORDER BY, then you'll need to ensure that the specified frame options
> don't cause your code to break.  I'm unsure, but this might be a case
> of checking for FRAMEOPTION_NONDEFAULT unless it's
> FRAMEOPTION_START_UNBOUNDED_PRECEDING|FRAMEOPTION_END_UNBOUNDED_FOLLOWING.
> You'll need to study that a bit more than I just did though.
>
> One way to make that work might be to add code to
> eval_windowaggregates() around the call to advance_windowaggregate(),
> you can see the row being aggregated is set by:
>
> winstate->tmpcontext->ecxt_outertuple = agg_row_slot;
>
> what you'd need to do here is change the code so that you put all the
> rows to aggregate into a tuplesort then sort them by the distinct
> column and instead, feed the tuplesort rows to
> advance_windowaggregate(). You'd need to add code similar to what is
> in process_ordered_aggregate_single() in nodeAgg.c to have the
> duplicate consecutive rows skipped.
>
> Just a word of warning on this. This is a hugely complex area of
> Postgres. If I was you, I'd make sure and spend quite a bit of time
> reading nodeWindowAgg.c and likely much of nodeAgg.c. Any changes we
> accept in that area are going to have to be very carefully done. Make
> sure you're comfortable with the code before doing too much. It would
> be very easy to end up with a giant mess if you try to do this without
> fully understanding the implications of your changes.  Also, you'll
> need to show you've not regressed the performance of the existing
> features with the code you've added.
>
> Good luck!
>
> David

Thanks a lot David, this is of an immense help. I will go through 
mentioned pointers, biggest being that this is complex piece and will 
take its due course.

Thanks again

-- 
Regards,
Ankit Kumar Pandey

Re: Questions regarding distinct operation implementation

From

Ankit Kumar Pandey

Date:

02 December 2022, 23:42:55

On 02/12/22 03:21, David G. Johnston wrote:
>  The main concern, I suspect, isn't implementation ability, it is 
> speed and memory consumption.

Hi David,

Shouldn't this be an acceptable tradeoff if someone wants to perform 
extra operation in plain old aggregates? Although I am not sure how much 
this extra memory and compute usage is considered as acceptable.

-- 
Regards,
Ankit Kumar Pandey

Re: Questions regarding distinct operation implementation

From

David Rowley

Date:

03 December 2022, 22:20:35

On Sat, 3 Dec 2022 at 20:36, Ankit Kumar Pandey <itsankitkp@gmail.com> wrote:
> Shouldn't this be an acceptable tradeoff if someone wants to perform
> extra operation in plain old aggregates? Although I am not sure how much
> this extra memory and compute usage is considered as acceptable.

We do our best to ensure that a given executor node never uses more
than work_mem.  Certainly, we still do have nodes that can exceed this
by a long way.  It would be unlikely that we'd accept anything new
that could do this.  Since nodeWindowAgg.c already can use up to
work_mem for the tuplestore, it does not seem unreasonable that if
there is a DISTINCT aggregate that you could use 50% of work_mem for
each,  that is, providing you can code it in such a way that you only
allocate one of these at once, i.e not allocate one per DISTINCT
aggregate all at once.

David

Re: Questions regarding distinct operation implementation

From

Ankit Kumar Pandey

Date:

03 December 2022, 22:57:40

On 04/12/22 00:50, David Rowley wrote:

We do our best to ensure that a given executor node never uses more
than work_mem.  Certainly, we still do have nodes that can exceed this
by a long way.  It would be unlikely that we'd accept anything new
that could do this.

Makes sense, also would definitely rule out any brute force algorithms. Good point to know

providing you can code it in such a way that you only
allocate one of these at once, i.e not allocate one per DISTINCT
aggregate all at once.

I am not sure if I understand this, does it means at given time, do allocation for only one distinct aggregate
instead of all, in case of multiple aggregates using distinct?

-- 
Regards,
Ankit Kumar Pandey

Re: Questions regarding distinct operation implementation

From

David Rowley

Date:

03 December 2022, 23:57:44

On Sun, 4 Dec 2022 at 08:57, Ankit Kumar Pandey <itsankitkp@gmail.com> wrote:
> On 04/12/22 00:50, David Rowley wrote:
>> providing you can code it in such a way that you only
>> allocate one of these at once, i.e not allocate one per DISTINCT
>> aggregate all at once.
>
> I am not sure if I understand this, does it means at given time, do allocation for only one distinct aggregate
> instead of all, in case of multiple aggregates using distinct?

If you were to limit this to only working with the query you mentioned
in [1], i.e PARTITION BY without an ORDER BY, then you only need to
aggregate once per partition per aggregate and you only need to do
that once all of the tuples for the partition are in the tuplestore.
It seems to me like you could add all the records to a tuplesort and
then sort by the DISTINCT column then aggregate everything except for
consecutive duplicates. You can then aggregate any other aggregates
which share the same DISTINCT column, otherwise, you just destroy the
tuplesort and rinse and repeat for the next aggregate.

To make this work when rows can exit the window frame seems
significantly harder. Likely a hash table would be a better data
structure to remove records from, but then how are you going to spill
the hash table to disk when it reaches work_mem? As David J mentions,
it seems like you'd need a hash table with a counter to track how many
times a given value appears and only remove it from the table once
that counter reaches 0.  Unsure how you're going to constrain that to
not use more than work_mem though.

Are there any other databases which support DISTINCT window aggregate
with an ORDER BY in the window clause?

David

[1] https://postgr.es/m/b10d2b78-a07e-e520-0cfc-e19f0ec685b2@gmail.com

Re: Questions regarding distinct operation implementation

From

Ankit Kumar Pandey

Date:

04 December 2022, 16:34:24

On 04/12/22 02:27, David Rowley wrote:
>
> To make this work when rows can exit the window frame seems
> significantly harder. Likely a hash table would be a better data
> structure to remove records from, but then how are you going to spill
> the hash table to disk when it reaches work_mem? As David J mentions,
> it seems like you'd need a hash table with a counter to track how many
> times a given value appears and only remove it from the table once
> that counter reaches 0.  Unsure how you're going to constrain that to
> not use more than work_mem though.
>
Interesting problem, Hashtables created in normal aggregates (AGG_HASHED 
mode) may provide some reference as they have hashagg_spill_tuple but I 
am not sure of any prior implementation of hashtable with counter and 
spill. Major concern is, if we go through tuplesort route (without order 
by case), would we get handicapped in future if we want order by or more 
features?

> Are there any other databases which support DISTINCT window aggregate
> with an ORDER BY in the window clause?
>
Oracle db support distinct window aggregates albeit without order by 
clause. Rest of databases which I've tried (mysql/sqlserver express) 
don't even support distinct in window aggregates so those gets ruled out 
as well.

> If you were to limit this to only working with the query you mentioned
> in [1], i.e PARTITION BY without an ORDER BY, then you only need to
> aggregate once per partition per aggregate and you only need to do
> that once all of the tuples for the partition are in the tuplestore.
> It seems to me like you could add all the records to a tuplesort and
> then sort by the DISTINCT column then aggregate everything except for
> consecutive duplicates. You can then aggregate any other aggregates
> which share the same DISTINCT column, otherwise, you just destroy the
> tuplesort and rinse and repeat for the next aggregate.
This looks like way to go that would ensure main use case of portability 
from Oracle.

> If you were to limit this to only working with the query you mentioned
> in [1], i.e PARTITION BY without an ORDER BY,

I need to dig deeper into order by case.

-- 
Regards,
Ankit Kumar Pandey

Re: Questions regarding distinct operation implementation

From

Vik Fearing

Date:

04 December 2022, 19:55:06

On 12/4/22 14:34, Ankit Kumar Pandey wrote:
> 
> On 04/12/22 02:27, David Rowley wrote:
>>
> 
>> If you were to limit this to only working with the query you mentioned
>> in [1], i.e PARTITION BY without an ORDER BY, then you only need to
>> aggregate once per partition per aggregate and you only need to do
>> that once all of the tuples for the partition are in the tuplestore.
>> It seems to me like you could add all the records to a tuplesort and
>> then sort by the DISTINCT column then aggregate everything except for
>> consecutive duplicates. You can then aggregate any other aggregates
>> which share the same DISTINCT column, otherwise, you just destroy the
>> tuplesort and rinse and repeat for the next aggregate.
 >
> This looks like way to go that would ensure main use case of portability 
> from Oracle.

The goal should not be portability from Oracle, but adherence to the 
standard.
-- 
Vik Fearing

Re: Questions regarding distinct operation implementation

From

Ankit Kumar Pandey

Date:

04 December 2022, 20:09:05

On 04/12/22 22:25, Vik Fearing wrote:
> On 12/4/22 14:34, Ankit Kumar Pandey wrote:
>
>> This looks like way to go that would ensure main use case of 
>> portability from Oracle.
>
> The goal should not be portability from Oracle, but adherence to the 
> standard.
Yes, Vik. You are right. Wrong remark from my side.

-- 
Regards,
Ankit Kumar Pandey

Re: Questions regarding distinct operation implementation

From

David Rowley

Date:

06 December 2022, 01:24:32

On Mon, 5 Dec 2022 at 02:34, Ankit Kumar Pandey <itsankitkp@gmail.com> wrote:
> Interesting problem, Hashtables created in normal aggregates (AGG_HASHED
> mode) may provide some reference as they have hashagg_spill_tuple but I
> am not sure of any prior implementation of hashtable with counter and
> spill.

I'm unsure if there's such guidance that can be gleaned from studying
nodeAgg.c. IIRC, that works by deciding up-front how many partitions
we're going to split the hash key space into and then writing out
tuples to files based on "hashkey MOD number-of-partitions".  At the
end of that, you can just aggregate tuples one partition at a time.
All groups are in the same file/partition.

The reason this does not seem useful to your case is that you need to
be able to quickly look up a given Datum or set of Datums to check if
they are unique or not. For that, you'd need to reload the hash table
every time your lookup lands on a different partition of the hashkey
space. I fail to see how that could ever be fast unless there happened
to only be 1 partition.   To make that worse, when a tuple goes out of
the frame and the counter that's tracking how many times the Datum(s)
appeared reaches 0, you need to write the entire file out again minus
that tuple.  Let's say you're window function is on a column which is
distinct or *very* close to it and the given window is moving the
window frame forward 1 tuple per input tuple. If each subsequent Datum
hashes to a different partition, then you're going to need to load the
file for that hash key space to check if that Datum has already been
seen, then you're going to have to evict that tuple from the file as
it moves out of frame, so that means reading and writing that entire
file per input tuple consumed.  That'll perform very poorly!   It's
possible that you could maybe speed it up a bit with some lossy hash
table that sits atop of this can only tell you if the given key
definitely does *not* exists.  You'd then be able to just write that
tuple out to the partition and you'd not have to read or write out the
file again.  It's going to slow down to a crawl when the lossy table
contains too many false positives though.

> Major concern is, if we go through tuplesort route (without order
> by case), would we get handicapped in future if we want order by or more
> features?

Yeah, deciding that before you go down one of the paths is going to be
important. I imagine the reason that you've not found another database
that supports DISTINCT window functions in a window with an ORDER BY
clause is that it's very hard to make it in a way where it performs
well in all cases.

Maybe another way to go about it that will give you less lock-in if we
decide to make ORDER BY work later would be to design some new
tuple-store-like data structure that can be defined with a lookup key
so you could ask it if a given key is stored and it would return the
answer quickly without having to trawl through all stored tuples. It
would also need to support the same positional lookups as tuplestore
does today so that all evicting-tuples-from-the-window-frame stuff
works as it does today. If you made something like that, then the
changes required in nodeWindowAgg.c would be significantly reduced.
You'd also just have 1 work_mem limit to abide by instead of having to
consider sharing that between a tuplestore and a hashtable/tuplesort.

Maybe as step 1, you could invent keyedtuplestore.c and consume
tuplestore's functions but layer on the lossy hashtable idea that I
mentioned above. That'd have to be more than just a bloom filter as
you need a payload of the count of tuples matching the given hashkey
MOD nbuckets.  If you then had a function like
keyedtuplestore_key_definately_does_not_exist() (can't think of a
better name now) then you can just lookup the lossy table and if there
are 0 tuples at that lossy bucket, then you can
keyedtuplestore_puttupleslot() from nodeWindowAgg.c.
keyedtuplestore_key_definately_does_not_exist() would have to work
much harder if there were>0 tuples with the same lossy hashkey. You'd
need to trawl through the tuples and check each one.  Perhaps that
could be tuned a bit so if you get too many collisions then the lossy
table could be rehashed to a larger size. It's going to fall flat on
its face, performance-wise, when the hash table can't be made larger
due to work_mem constraints.

Anyway, that's a lot of only partially thought-through ideas above. If
you're working on a patch like this, you should expect to have to
rewrite it a dozen or 2 times as new ideas arrive. If you're good at
using the fact that the new patch is better than the old one as
motivation to continue, then you're onto an attitude that is
PostgreSQL-community-proof :-)  (thankfully) we're often not good at
"let's just commit it now and make it better later".

David