Thread: NULL-handling in aggregate(DISTINCT ...)

NULL-handling in aggregate(DISTINCT ...)

From
Andrew Gierth
Date:
Quoth the comments in nodeAgg.c:
* We don't currently implement DISTINCT aggs for aggs having more* than one argument.  This isn't required for anything
inthe SQL* spec, but really it ought to be implemented for* feature-completeness.  FIXME someday.
 

and:
* DISTINCT always suppresses nulls, per SQL spec, regardless of the* transition function's strictness.

(What the SQL spec actually says is that aggregate calls which are
<general set operation> ignore all nulls regardless of whether they
are ALL or DISTINCT. Other kinds of aggregates are not permitted by
the spec to use ALL or DISTINCT.)

Currently we have this behaviour:

postgres=# select array_agg(all a) from (values (1),(null)) v(a);array_agg 
-----------{1,NULL}
(1 row)

postgres=# select array_agg(distinct a) from (values (1),(null)) v(a);array_agg 
-----------{1}
(1 row)

which personally I feel is somewhat wrong, since 1 and NULL are in
fact distinct, but which is due to the logic expressed in the second
comment above. (The spec does not allow array_agg(distinct a) so it
is no help here.)

Now the question: If the limit of one argument for DISTINCT aggs were
removed (which I'm considering doing as part of an update to the
aggregate ORDER BY patch I posted a while back), what should be the
behaviour of agg(distinct x,y) where one or both of x or y is null?
And should it depend on the strictness of the transition function?

-- 
Andrew (irc:RhodiumToad)


Re: NULL-handling in aggregate(DISTINCT ...)

From
Tom Lane
Date:
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> Now the question: If the limit of one argument for DISTINCT aggs were
> removed (which I'm considering doing as part of an update to the
> aggregate ORDER BY patch I posted a while back), what should be the
> behaviour of agg(distinct x,y) where one or both of x or y is null?
> And should it depend on the strictness of the transition function?

I think you could probably just change it: make DISTINCT work as per
regular DISTINCT (treat null like a value, keep one copy).  All the
spec-conforming aggregates are strict and would ignore the null in the
next step anyway.
        regards, tom lane


Re: NULL-handling in aggregate(DISTINCT ...)

From
Andrew Gierth
Date:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
> Andrew Gierth <andrew@tao11.riddles.org.uk> writes:>> Now the question: If the limit of one argument for DISTINCT
aggswere>> removed (which I'm considering doing as part of an update to the>> aggregate ORDER BY patch I posted a while
back),what should be the>> behaviour of agg(distinct x,y) where one or both of x or y is null?>> And should it depend
onthe strictness of the transition function?
 
Tom> I think you could probably just change it: make DISTINCT work asTom> per regular DISTINCT (treat null like a
value,keep one copy).Tom> All the spec-conforming aggregates are strict and would ignoreTom> the null in the next step
anyway.

Change it for single-arg DISTINCT too? And the resulting change to the
established behaviour of array_agg is acceptable? Just want to be clear
here.

-- 
Andrew.


Re: NULL-handling in aggregate(DISTINCT ...)

From
Tom Lane
Date:
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>  Tom> I think you could probably just change it: make DISTINCT work as
>  Tom> per regular DISTINCT (treat null like a value, keep one copy).
>  Tom> All the spec-conforming aggregates are strict and would ignore
>  Tom> the null in the next step anyway.

> Change it for single-arg DISTINCT too? And the resulting change to the
> established behaviour of array_agg is acceptable? Just want to be clear
> here.

I doubt that very many people are depending on the behavior of
array_agg(DISTINCT); and anyway it could be argued that the present
behavior is a bug, since it doesn't work like standard DISTINCT.
I don't see a problem with changing it, though it should be
release-noted.
        regards, tom lane


Re: NULL-handling in aggregate(DISTINCT ...)

From
Andrew Gierth
Date:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Tom> I think you could probably just change it: make DISTINCT work asTom> per regular DISTINCT (treat null like a
value,keep one copy).Tom> All the spec-conforming aggregates are strict and would ignoreTom> the null in the next step
anyway.
>> Change it for single-arg DISTINCT too? And the resulting change to the>> established behaviour of array_agg is
acceptable?Just want to be clear>> here.
 
Tom> I doubt that very many people are depending on the behavior ofTom> array_agg(DISTINCT); and anyway it could be
arguedthat theTom> present behavior is a bug, since it doesn't work like standardTom> DISTINCT.  I don't see a problem
withchanging it, though itTom> should be release-noted.
 

A followup question: currently the code uses the "datum" interface for
tuplesort. Obviously with multiple columns the slot interface is used
instead; but is there any performance advantage for staying with the
datum interface for the single-column case?

-- 
Andrew.


Re: NULL-handling in aggregate(DISTINCT ...)

From
Tom Lane
Date:
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> A followup question: currently the code uses the "datum" interface for
> tuplesort. Obviously with multiple columns the slot interface is used
> instead; but is there any performance advantage for staying with the
> datum interface for the single-column case?

No idea ... measure it and see.
        regards, tom lane