Thread: NULL-handling in aggregate(DISTINCT ...)
Quoth the comments in nodeAgg.c: * We don't currently implement DISTINCT aggs for aggs having more* than one argument. This isn't required for anything inthe SQL* spec, but really it ought to be implemented for* feature-completeness. FIXME someday. and: * DISTINCT always suppresses nulls, per SQL spec, regardless of the* transition function's strictness. (What the SQL spec actually says is that aggregate calls which are <general set operation> ignore all nulls regardless of whether they are ALL or DISTINCT. Other kinds of aggregates are not permitted by the spec to use ALL or DISTINCT.) Currently we have this behaviour: postgres=# select array_agg(all a) from (values (1),(null)) v(a);array_agg -----------{1,NULL} (1 row) postgres=# select array_agg(distinct a) from (values (1),(null)) v(a);array_agg -----------{1} (1 row) which personally I feel is somewhat wrong, since 1 and NULL are in fact distinct, but which is due to the logic expressed in the second comment above. (The spec does not allow array_agg(distinct a) so it is no help here.) Now the question: If the limit of one argument for DISTINCT aggs were removed (which I'm considering doing as part of an update to the aggregate ORDER BY patch I posted a while back), what should be the behaviour of agg(distinct x,y) where one or both of x or y is null? And should it depend on the strictness of the transition function? -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > Now the question: If the limit of one argument for DISTINCT aggs were > removed (which I'm considering doing as part of an update to the > aggregate ORDER BY patch I posted a while back), what should be the > behaviour of agg(distinct x,y) where one or both of x or y is null? > And should it depend on the strictness of the transition function? I think you could probably just change it: make DISTINCT work as per regular DISTINCT (treat null like a value, keep one copy). All the spec-conforming aggregates are strict and would ignore the null in the next step anyway. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Andrew Gierth <andrew@tao11.riddles.org.uk> writes:>> Now the question: If the limit of one argument for DISTINCT aggswere>> removed (which I'm considering doing as part of an update to the>> aggregate ORDER BY patch I posted a while back),what should be the>> behaviour of agg(distinct x,y) where one or both of x or y is null?>> And should it depend onthe strictness of the transition function? Tom> I think you could probably just change it: make DISTINCT work asTom> per regular DISTINCT (treat null like a value,keep one copy).Tom> All the spec-conforming aggregates are strict and would ignoreTom> the null in the next step anyway. Change it for single-arg DISTINCT too? And the resulting change to the established behaviour of array_agg is acceptable? Just want to be clear here. -- Andrew.
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> I think you could probably just change it: make DISTINCT work as > Tom> per regular DISTINCT (treat null like a value, keep one copy). > Tom> All the spec-conforming aggregates are strict and would ignore > Tom> the null in the next step anyway. > Change it for single-arg DISTINCT too? And the resulting change to the > established behaviour of array_agg is acceptable? Just want to be clear > here. I doubt that very many people are depending on the behavior of array_agg(DISTINCT); and anyway it could be argued that the present behavior is a bug, since it doesn't work like standard DISTINCT. I don't see a problem with changing it, though it should be release-noted. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> I think you could probably just change it: make DISTINCT work asTom> per regular DISTINCT (treat null like a value,keep one copy).Tom> All the spec-conforming aggregates are strict and would ignoreTom> the null in the next step anyway. >> Change it for single-arg DISTINCT too? And the resulting change to the>> established behaviour of array_agg is acceptable?Just want to be clear>> here. Tom> I doubt that very many people are depending on the behavior ofTom> array_agg(DISTINCT); and anyway it could be arguedthat theTom> present behavior is a bug, since it doesn't work like standardTom> DISTINCT. I don't see a problem withchanging it, though itTom> should be release-noted. A followup question: currently the code uses the "datum" interface for tuplesort. Obviously with multiple columns the slot interface is used instead; but is there any performance advantage for staying with the datum interface for the single-column case? -- Andrew.
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > A followup question: currently the code uses the "datum" interface for > tuplesort. Obviously with multiple columns the slot interface is used > instead; but is there any performance advantage for staying with the > datum interface for the single-column case? No idea ... measure it and see. regards, tom lane