Re: ANY_VALUE aggregate - Mailing list pgsql-hackers

From David Rowley
Subject Re: ANY_VALUE aggregate
Date
Msg-id CAApHDvoCMki1NWLyeijUj_P5uxymWejGf5r2zX7o_JOEku_55Q@mail.gmail.com
Whole thread Raw
In response to Re: ANY_VALUE aggregate  (Vik Fearing <vik@postgresfriends.org>)
Responses Re: ANY_VALUE aggregate
List pgsql-hackers
On Thu, 19 Jan 2023 at 06:01, Vik Fearing <vik@postgresfriends.org> wrote:
> Thank you for the review.  Attached is a new version rebased to d540a02a72.

I've only a bunch of nit-picks, personal preferences and random
thoughts to offer as a review:

1. I'd be inclined *not* to mention the possible future optimisation in:

+ * Currently this just returns the first value, but in the future it might be
+ * able to signal to the aggregate that it does not need to be called anymore.

I think it's unlikely that the transfn would "signal" such a thing. It
seems more likely if we did anything about it that nodeAgg.c would
maybe have some additional knowledge not to call that function if the
agg state already has a value. Just so we're not preempting how we
might do such a thing in the future, it seems best just to remove the
mention of it. I don't really think it serves as a good reminder that
we might want to do this one day anyway.

2. +any_value_trans(PG_FUNCTION_ARGS)

Many of transition function names end in "transfn", not "trans". I
think it's better to follow the existing (loosely followed) naming
pattern that a few aggregates seem to follow rather than invent a new
one.

3. I tend to try to copy the capitalisation of keywords from the
surrounding regression tests. I see the following breaks that.

+SELECT any_value(v) FILTER (WHERE v > 2) FROM (VALUES (1), (2), (3)) AS v (v);

(obviously, ideally, we'd always just follow the same capitalisation
of keywords everywhere in each .sql file, but we've long broken that
and the best way can do is be consistent with surrounding tests)

4. I think I'd use the word "Returns" instead of "Chooses" in:

+        Chooses a non-deterministic value from the non-null input values.

5. I've not managed to find a copy of the 2023 draft, so I'm assuming
you've got the ignoring of NULLs correct. I tried to see what other
databases do using https://www.db-fiddle.com/ . I was surprised to see
MySQL 8.0 returning NULL with:

create table a (a int, b int);
insert into a values(1,null),(1,2),(1,null);

select any_value(b) from a group by a;

I'd have expected "2" to be returned. (It gets even weirder without
the GROUP BY clause, so I'm not too hopeful any useful information can
be obtained from looking here)

I know MySQL doesn't follow the spec quite as closely as we do, so I
might not be that surprised if they didn't pay attention to the
wording when implementing this, however, I've not seen the spec, so I
can only speculate what value should be returned. Certainly not doing
any aggregation for any_value() when there is no GROUP BY seems
strange. I see they don't do the same with sum(). Perhaps this is just
a side effect of their loose standards when it came to columns in the
SELECT clause that are not in the GROUP BY clause.

6. Is it worth adding a WindowFunc test somewhere in window.sql with
an any_value(...) over (...)?  Is what any_value() returns as a
WindowFunc equally as non-deterministic as when it's used as an
Aggref? Can we assume there's no guarantee that it'll return the same
value for each partition in each row? Does the spec mention anything
about that?

7. I wondered if it's worth adding a
SupportRequestOptimizeWindowClause support function for this
aggregate. I'm thinking that it might not be as likely people would
use something more specific like first_value/nth_value/last_value
instead of using any_value as a WindowFunc. Also, I'm currently
thinking that a SupportRequestWFuncMonotonic for any_value() is not
worth the dozen or so lines of code it would take to write it. I'm
assuming it would always be a MONOTONICFUNC_BOTH function. It seems
unlikely that someone would have a subquery with a WHERE clause in the
upper-level query referencing the any_value() aggregate.  Thought I'd
mention both of these things anyway as someone else might think of
some good reason we should add them that I didn't think of.

David



pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: Wasted Vacuum cycles when OldestXmin is not moving
Next
From: "houzj.fnst@fujitsu.com"
Date:
Subject: RE: Perform streaming logical transactions by background workers and parallel apply