Thread: Contradictory behavior of array_agg(distinct) aggregate.
Hi hackers! Is it only me who consider that current behavior of array_agg(distinct) contradicts to interpretation of nulls in other cases ("null" is something like "unknown" which means that we can not say weather two nulls are the same or not). This is why it is allowed to insert multiple nulls in the unique column. postgres=# create table t(x integer unique); CREATE TABLE postgres=# insert into t values (null),(null); INSERT 0 2 postgres=# select count(distinct x) from t; count ------- 0 (1 row) postgres=# select array_agg(distinct x) from t; array_agg ----------- {NULL} (1 row) postgres=# select array_agg(x) from t; array_agg ------------- {NULL,NULL} (1 row) So what is the number of distinct "x" values in this case? I think that according to SQL model - 0 (as count(distinct) returns). Why in this case array_agg(distinct x) returns non-empty array? Yes, unlike most other aggregates, `array_agg` is not ignoring null values. But is it correct to treat two null values as the same (non-distinct)? IMHO correct result in this case should be either {} or NULL, either {NULL,NULL}.
Konstantin Knizhnik <knizhnik@garret.ru> writes: > postgres=# create table t(x integer unique); > CREATE TABLE > postgres=# insert into t values (null),(null); > INSERT 0 2 > postgres=# select count(distinct x) from t; > count > ------- > 0 > (1 row) > postgres=# select array_agg(distinct x) from t; > array_agg > ----------- > {NULL} > (1 row) > postgres=# select array_agg(x) from t; > array_agg > ------------- > {NULL,NULL} > (1 row) I see nothing contradictory here. "array_agg(distinct x)" combines the two NULLs into one, which is the normal behavior of DISTINCT. "count(distinct x)" does the same thing --- but count() only counts non-null inputs, so you end with zero. regards, tom lane
On 04/12/2024 9:03 am, Tom Lane wrote: > Konstantin Knizhnik <knizhnik@garret.ru> writes: >> postgres=# create table t(x integer unique); >> CREATE TABLE >> postgres=# insert into t values (null),(null); >> INSERT 0 2 >> postgres=# select count(distinct x) from t; >> count >> ------- >> 0 >> (1 row) >> postgres=# select array_agg(distinct x) from t; >> array_agg >> ----------- >> {NULL} >> (1 row) >> postgres=# select array_agg(x) from t; >> array_agg >> ------------- >> {NULL,NULL} >> (1 row) > I see nothing contradictory here. "array_agg(distinct x)" > combines the two NULLs into one, which is the normal > behavior of DISTINCT. Sorry. It is actually inconsistency in basic SQL model in interpretation of NULL by UNIQUE and DISTINCT, and array_agg just follows this rule.
On Tuesday, December 3, 2024, Konstantin Knizhnik <knizhnik@garret.ru> wrote:
Is it only me who consider that current behavior of array_agg(distinct) contradicts to interpretation of nulls in other cases ("null" is something like "unknown" which means that we can not say weather two nulls are the same or not).
Null value handling has a few varied behaviors related to it. This particular one is best thought of as desirably being consistent with group by.
This is why it is allowed to insert multiple nulls in the unique column.
Nowadays the DBA gets to choose which of the two behaviors a unique index applies, which allows indexes to get on the same page as group by et al., thus fixing your inconsistency claim here.
postgres=# create table t(x integer unique);
CREATE TABLE
postgres=# insert into t values (null),(null);
INSERT 0 2
postgres=# select count(distinct x) from t;
count
-------
0
(1 row)
postgres=# select array_agg(distinct x) from t;
array_agg
-----------
{NULL}
(1 row)
postgres=# select array_agg(x) from t;
array_agg
-------------
{NULL,NULL}
(1 row)
So what is the number of distinct "x" values in this case? I think that according to SQL model - 0 (as count(distinct) returns).
1, but getting this answer computed is a non-trivial expression as the count aggregate can’t do the counting.
Why in this case array_agg(distinct x) returns non-empty array?
I can be convinced to see an inconsistency here. In count though, not array_agg. The inability for count to see and thus count null values cannot be worked around while you can always apply a filter clause to ignore null values you don’t want.
Yes, unlike most other aggregates, `array_agg` is not ignoring null values.
But is it correct to treat two null values as the same (non-distinct)?
Yes, in most contexts where null values are forced to be compared to each other they do so by defining all null values to be representationally identical. See group by for the most authoritative reference case.
IMHO correct result in this case should be either {} or NULL, either {NULL,NULL}.
You have a typo here somewhere…
If you want the empty array use a filter clause to make it behave “strictly”. Producing a null output seems indefensible.
A policy that all nulls values are indistinct from one another, which is the most prevalent one in SQL, makes the most sense to me. My gut says that “group by col nulls [not] distinct” would be an undesirable thing to add to the language. It was probably added to unique indexes because they went the wrong way and needed a way to course-correct.
David J.
Konstantin Knizhnik <knizhnik@garret.ru> writes: > On 04/12/2024 9:03 am, Tom Lane wrote: >> Konstantin Knizhnik <knizhnik@garret.ru> writes: >>> postgres=# create table t(x integer unique); >>> CREATE TABLE >>> postgres=# insert into t values (null),(null); >>> INSERT 0 2 >>> postgres=# select count(distinct x) from t; >>> count >>> ------- >>> 0 >>> (1 row) >>> postgres=# select array_agg(distinct x) from t; >>> array_agg >>> ----------- >>> {NULL} >>> (1 row) >>> postgres=# select array_agg(x) from t; >>> array_agg >>> ------------- >>> {NULL,NULL} >>> (1 row) >> I see nothing contradictory here. "array_agg(distinct x)" >> combines the two NULLs into one, which is the normal >> behavior of DISTINCT. > > > Sorry. > It is actually inconsistency in basic SQL model in interpretation of > NULL by UNIQUE and DISTINCT, and array_agg just follows this rule. The behaviour of DISTINCT in aggregates matches the behaviour of the IS (NOT) DISTINCT FROM predicate, which considers NULLs NOT DISTINCT from eachother. UNIQUE constraints leave it implementation-defined whether NULLs are considered distinct (PostgreSQL defaults to NULLS DISTINCT), but that can be overridden in the constraint definition. - ilmari