[PATCH] DISTINCT in plain aggregate window functions - Mailing list pgsql-hackers

From Haibo Yan
Subject [PATCH] DISTINCT in plain aggregate window functions
Date
Msg-id CABXr29H2X+HtaPw-R3EheZUgv9fM7nSBjQCCaWCRv62mDYdM3w@mail.gmail.com
Whole thread Raw
List pgsql-hackers

Hi Hackers

I’d like to start a patch series to add support for DISTINCT in plain aggregate window functions.

PostgreSQL currently rejects cases such as:

---------------------------------------------------------------------------------------------------------

count(DISTINCT x) OVER (PARTITION BY p)

sum(DISTINCT x)   OVER ()

---------------------------------------------------------------------------------------------------------

My plan is to implement this incrementally, by frame class and by feature dimension, rather than trying to solve every case in a single patch.

For the first step, I’m posting patches 1-2 only and would appreciate your review on those.

Patch 1 is intentionally very small:

  • add parse/deparse plumbing for DISTINCT in plain aggregate window functions
  • carry the information through WindowFunc
  • preserve it in ruleutils / deparse
  • but still reject execution

Patch 1 by itself does not add user-visible execution support, so I think it is best reviewed together with patch 2.

Patch 2 adds the first real executor support:

  • plain aggregate window functions only
  • single-argument DISTINCT only
  • whole-partition frames only

That means support for cases where the frame is effectively the entire partition, for example:

---------------------------------------------------------------------------------------------------------

count(DISTINCT x) OVER (PARTITION BY p)
sum(DISTINCT x)   OVER ()
avg(DISTINCT x)   OVER (
    PARTITION BY p
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)

---------------------------------------------------------------------------------------------------------

The executor approach in patch 2 is deliberately conservative:

  • collect the partition’s aggregate inputs
  • sort and deduplicate them
  • feed the distinct values into the aggregate transition function
  • finalize once
  • reuse the cached result for all rows in the partition

This avoids the much harder moving-frame cases for now.

My proposed overall roadmap is below:

Patch 1

  • parse/deparse plumbing only
  • allow DISTINCT to be represented on plain aggregate window functions
  • preserve it through deparse / view definition
  • still reject execution

Patch 2

  • executor support for whole-partition frames
  • plain aggregate window functions only
  • single-argument DISTINCT only
  • sort-and-dedup implementation

Patch 3

  • executor support for non-shrinking frames
  • frames starting at UNBOUNDED PRECEDING with no EXCLUDE
  • incremental hash-based seen-set
  • covers default ORDER BY frame and supported ... CURRENT ROW / ... FOLLOWING cases

Patch 4

  • executor support for sliding ROWS frames
  • refcounted DISTINCT state
  • add/remove distinct contributions as rows enter and leave the frame
  • fallback to restart/recompute for aggregates without inverse transition support

Patch 5

  • extend the sliding DISTINCT machinery to sliding RANGE and GROUPS
  • keep the same refcounted model
  • no EXCLUDE yet

Patch 6

  • support EXCLUDE clauses
  • likely correctness-first, with restart/recompute where incremental maintenance is too awkward

Patch 7

  • support multi-argument DISTINCT
  • upgrade DISTINCT keys from single datum to tuple/composite key representation

Patch 8

  • support aggregate ORDER BY inside window aggregates
  • left until last because it is orthogonal to frame-shape support and substantially complicates both parse representation and executor behavior

In short, the roadmap is:

  1. plumbing
  2. whole-partition
  3. non-shrinking
  4. sliding ROWS
  5. sliding RANGE / GROUPS
  6. EXCLUDE
  7. multi-arg DISTINCT
  8. aggregate ORDER BY

For this posting, I’d especially appreciate feedback on:

  • whether patch 1 + patch 2 is a reasonable first split
  • whether whole-partition-only executor support is a good first executable step
  • whether the proposed long-term breakdown seems sensible

Thanks in advance for any review or comments.

Best regards,

Haibo Yan


Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: [Proposal] Adding Log File Capability to pg_createsubscriber
Next
From: Tom Lane
Date:
Subject: Re: Do we still need MULE_INTERNAL?