Hi Hackers
I’d like to start a patch series to add support for DISTINCT in plain aggregate window functions.
PostgreSQL currently rejects cases such as:
---------------------------------------------------------------------------------------------------------
count(DISTINCT x) OVER (PARTITION BY p)
sum(DISTINCT x) OVER ()
---------------------------------------------------------------------------------------------------------
My plan is to implement this incrementally, by frame class and by feature dimension, rather than trying to solve every case in a single patch.
For the first step, I’m posting patches 1-2 only and would appreciate your review on those.
Patch 1 is intentionally very small:
- add parse/deparse plumbing for DISTINCT in plain aggregate window functions
- carry the information through WindowFunc
- preserve it in ruleutils / deparse
- but still reject execution
Patch 1 by itself does not add user-visible execution support, so I think it is best reviewed together with patch 2.
Patch 2 adds the first real executor support:
- plain aggregate window functions only
- single-argument DISTINCT only
- whole-partition frames only
That means support for cases where the frame is effectively the entire partition, for example:
---------------------------------------------------------------------------------------------------------
count(DISTINCT x) OVER (PARTITION BY p)
sum(DISTINCT x) OVER ()
avg(DISTINCT x) OVER (
PARTITION BY p
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
---------------------------------------------------------------------------------------------------------
The executor approach in patch 2 is deliberately conservative:
- collect the partition’s aggregate inputs
- sort and deduplicate them
- feed the distinct values into the aggregate transition function
- finalize once
- reuse the cached result for all rows in the partition
This avoids the much harder moving-frame cases for now.
My proposed overall roadmap is below:
Patch 1
- parse/deparse plumbing only
- allow DISTINCT to be represented on plain aggregate window functions
- preserve it through deparse / view definition
- still reject execution
Patch 2
- executor support for whole-partition frames
- plain aggregate window functions only
- single-argument DISTINCT only
- sort-and-dedup implementation
Patch 3
- executor support for non-shrinking frames
- frames starting at UNBOUNDED PRECEDING with no EXCLUDE
- incremental hash-based seen-set
- covers default ORDER BY frame and supported ... CURRENT ROW / ... FOLLOWING cases
Patch 4
- executor support for sliding ROWS frames
- refcounted DISTINCT state
- add/remove distinct contributions as rows enter and leave the frame
- fallback to restart/recompute for aggregates without inverse transition support
Patch 5
- extend the sliding DISTINCT machinery to sliding RANGE and GROUPS
- keep the same refcounted model
- no EXCLUDE yet
Patch 6
- support EXCLUDE clauses
- likely correctness-first, with restart/recompute where incremental maintenance is too awkward
Patch 7
- support multi-argument DISTINCT
- upgrade DISTINCT keys from single datum to tuple/composite key representation
Patch 8
- support aggregate ORDER BY inside window aggregates
- left until last because it is orthogonal to frame-shape support and substantially complicates both parse representation and executor behavior
In short, the roadmap is:
- plumbing
- whole-partition
- non-shrinking
- sliding ROWS
- sliding RANGE / GROUPS
- EXCLUDE
- multi-arg DISTINCT
- aggregate ORDER BY
For this posting, I’d especially appreciate feedback on:
- whether patch 1 + patch 2 is a reasonable first split
- whether whole-partition-only executor support is a good first executable step
- whether the proposed long-term breakdown seems sensible
Thanks in advance for any review or comments.
Best regards,
Haibo Yan