Thread: onlyvalue aggregate (was: First Aggregate Funtion?)
Hi, Here's a patch for the aggregate function outlined by Corey Huinker in CADkLM=foA_oC_Ri23F9PbfLnfwXFbC3Lt8bBzRu3=CB77G9_qw@mail.gmail.com . I called it "onlyvalue", which is a horrible name, but I have nothing better to offer. (Corey called it "only", but that doesn't really work since ONLY is a fully reserved keyword.) I'll add this to September's commit fest, but if you want to bash me or the patch in the meanwhile, go ahead. .m
Attachment
2015-10-28 17:50 GMT+01:00 Marko Tiikkaja <marko@joh.to>:
Hi,
Here's a patch for the aggregate function outlined by Corey Huinker in CADkLM=foA_oC_Ri23F9PbfLnfwXFbC3Lt8bBzRu3=CB77G9_qw@mail.gmail.com . I called it "onlyvalue", which is a horrible name, but I have nothing better to offer. (Corey called it "only", but that doesn't really work since ONLY is a fully reserved keyword.)
I'll add this to September's commit fest, but if you want to bash me or the patch in the meanwhile, go ahead.
Hi
what is use case for this function and why it should be in core?
Regards
Pavel
.m
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/28/15 5:53 PM, Pavel Stehule wrote: > what is use case for this function and why it should be in core? Corey had one example in his email, but I can offer another one which came up this week at $work. The query looked something like this: SELECT a, sum(amount), onlyvalue(rolling_count) FROM ( SELECT a, amount, count(*) OVER (ORDER BY a) AS rolling_count FROM tbl ) ss GROUP BY a; We know that all the values for the column are going to be the same value for every "a", so we could use min() or max(). But the advantage of "onlyvalue" is that it actually checks that, so if someone went and changed the window frame to do something slightly different, the query would blow up instead of silently returning the (now likely incorrect) minimum or maximum value. It's also self-documenting for the reader of such queries. In my experience this problem comes up often enough that it would be make sense to have this aggregate in core. .m
2015-10-28 18:03 GMT+01:00 Marko Tiikkaja <marko@joh.to>:
On 10/28/15 5:53 PM, Pavel Stehule wrote:what is use case for this function and why it should be in core?
Corey had one example in his email, but I can offer another one which came up this week at $work. The query looked something like this:
SELECT a, sum(amount), onlyvalue(rolling_count)
FROM
(
SELECT a, amount, count(*) OVER (ORDER BY a) AS rolling_count
FROM tbl
) ss
GROUP BY a;
We know that all the values for the column are going to be the same value for every "a", so we could use min() or max(). But the advantage of "onlyvalue" is that it actually checks that, so if someone went and changed the window frame to do something slightly different, the query would blow up instead of silently returning the (now likely incorrect) minimum or maximum value. It's also self-documenting for the reader of such queries.
In my experience this problem comes up often enough that it would be make sense to have this aggregate in core.
This function is pretty inconsistent with any other builtin aggregate function, so it is one argument against to push it to core. Next, this function can be pretty simply implemented in PLpgSQL. And my last argument - I don't remember too often request for this functionality. It looks like module for PGXN much more, than PG core functionality.
Regards
Pavel
.m
Marko Tiikkaja <marko@joh.to> writes: > Here's a patch for the aggregate function outlined by Corey Huinker in > CADkLM=foA_oC_Ri23F9PbfLnfwXFbC3Lt8bBzRu3=CB77G9_qw@mail.gmail.com . I > called it "onlyvalue", which is a horrible name, but I have nothing > better to offer. (Corey called it "only", but that doesn't really work > since ONLY is a fully reserved keyword.) On the name front, maybe think "single" rather than "only"? This might lead to "single()" or "single_value()", or "singleton()" if you want to sound highbrow. On the semantics front, I'm not sure that I like excluding nulls from the input domain. I'd rather that it acted like IS NOT DISTINCT, ie, nulls are fine as long as all the input values are nulls. The implementation would need some work for that. regards, tom lane
2015-10-28 18:38 GMT+01:00 Tom Lane <tgl@sss.pgh.pa.us>:
Marko Tiikkaja <marko@joh.to> writes:
> Here's a patch for the aggregate function outlined by Corey Huinker in
> CADkLM=foA_oC_Ri23F9PbfLnfwXFbC3Lt8bBzRu3=CB77G9_qw@mail.gmail.com . I
> called it "onlyvalue", which is a horrible name, but I have nothing
> better to offer. (Corey called it "only", but that doesn't really work
> since ONLY is a fully reserved keyword.)
On the name front, maybe think "single" rather than "only"? This might
lead to "single()" or "single_value()", or "singleton()" if you want to
sound highbrow.
this function should to have some distinguish name than other aggregates because important work of this func is not some calculation but some constraint check.
On the semantics front, I'm not sure that I like excluding nulls from the
input domain. I'd rather that it acted like IS NOT DISTINCT, ie, nulls
are fine as long as all the input values are nulls. The implementation
would need some work for that.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Marko Tiikkaja <marko@joh.to> writes:
> Here's a patch for the aggregate function outlined by Corey Huinker in
> CADkLM=foA_oC_Ri23F9PbfLnfwXFbC3Lt8bBzRu3=CB77G9_qw@mail.gmail.com . I
> called it "onlyvalue", which is a horrible name, but I have nothing
> better to offer. (Corey called it "only", but that doesn't really work
> since ONLY is a fully reserved keyword.)
On the name front, maybe think "single" rather than "only"? This might
lead to "single()" or "single_value()", or "singleton()" if you want to
sound highbrow.
On the semantics front, I'm not sure that I like excluding nulls from the
input domain. I'd rather that it acted like IS NOT DISTINCT, ie, nulls
are fine as long as all the input values are nulls. The implementation
would need some work for that.
homogeneous() ?
It is basically an assertion function since in most cases where you would use this you needn't use a function at all but instead simply place the column in the GROUP BY clause.
I have at various times desired having various assertion functions that return the input value if the assertion is met but causes a query error if it is not. That said functions can be both scalar and aggregate oriented makes sense. Adding just this one function in seems like it adds a partial feature to -core. I'd rather it bake more in PGXN before considering putting it into core. There doesn't seem to be any hard need or benefit to doing so at this time.
I would probably stick to the concept of assertion and call it something like
"assert_nongrouping()" to denote that the input does not cause multiple groups to be created due to their being multiple values.
David J.
On Wed, Oct 28, 2015 at 5:50 PM, Marko Tiikkaja <marko@joh.to> wrote: > Here's a patch for the aggregate function outlined by Corey Huinker in > CADkLM=foA_oC_Ri23F9PbfLnfwXFbC3Lt8bBzRu3=CB77G9_qw@mail.gmail.com . I > called it "onlyvalue", which is a horrible name, but I have nothing better > to offer. (Corey called it "only", but that doesn't really work since ONLY > is a fully reserved keyword.) I've written an aggregate that does something like this a few times. I think one time I called it "the", which is probably too clever, but then you could query for the(project_manager) and similar. I've usually written it to not error-check and just return the first non-NULL value it runs across, which suggests a name like any_old() or whatever(). I actually think by comparison with those ideas, onlyvalue() - or maybe only_value() - is not bad. > I'll add this to September's commit fest, November, probably. > but if you want to bash me or the > patch in the meanwhile, go ahead. What if we want to say nice things? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 28 October 2015 at 16:50, Marko Tiikkaja <marko@joh.to> wrote: > Hi, > > Here's a patch for the aggregate function outlined by Corey Huinker in > CADkLM=foA_oC_Ri23F9PbfLnfwXFbC3Lt8bBzRu3=CB77G9_qw@mail.gmail.com . +1. I've wanted something like this a few times. Of the names suggested so far, I think I prefer "single_value", and yes I think it should work with NULLs. Regards, Dean
On 2 November 2015 at 09:32, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
--
On 28 October 2015 at 16:50, Marko Tiikkaja <marko@joh.to> wrote:
> Hi,
>
> Here's a patch for the aggregate function outlined by Corey Huinker in
> CADkLM=foA_oC_Ri23F9PbfLnfwXFbC3Lt8bBzRu3=CB77G9_qw@mail.gmail.com .
+1. I've wanted something like this a few times. Of the names
suggested so far, I think I prefer "single_value", and yes I think it
should work with NULLs.
+1
I think we should avoid using ONLY or VALUE within the names since those already have other meanings in Postgres.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/2/15 9:32 AM, Dean Rasheed wrote: > On 28 October 2015 at 16:50, Marko Tiikkaja <marko@joh.to> wrote: >> Here's a patch for the aggregate function outlined by Corey Huinker in >> CADkLM=foA_oC_Ri23F9PbfLnfwXFbC3Lt8bBzRu3=CB77G9_qw@mail.gmail.com . > > +1. I've wanted something like this a few times. Of the names > suggested so far, I think I prefer "single_value", and yes I think it > should work with NULLs. This was actually a last-minute design change I made before submitting the patch. The two times I've previously written this aggregate both accepted NULLs and only enforced that there must not be more than one non-NULL value, but that's only because I was thinking about the "poor man's FILTER" case, which is obsolete since version 9.4. The reason I changed in this version is that accepting NULLs can also hide bugs, and it's (usually) easy to filter them out with FILTER. Did you have some specific use case in mind where accepting NULLs would be beneficial? .m
On 2 November 2015 at 09:10, Simon Riggs <simon@2ndquadrant.com> wrote: > I think we should avoid using ONLY or VALUE within the names since those > already have other meanings in Postgres. > We already have window functions called first_value, last_value and nth_value, so I think use of "value" in the name is OK (and consistent). Actually thinking some more, I think the name "unique_value" is better because it's more suggestive of the fact that uniqueness is being enforced. Regards, Dean
On 2 November 2015 at 10:59, Marko Tiikkaja <marko@joh.to> wrote: > On 11/2/15 9:32 AM, Dean Rasheed wrote: >> >> On 28 October 2015 at 16:50, Marko Tiikkaja <marko@joh.to> wrote: >>> >>> Here's a patch for the aggregate function outlined by Corey Huinker in >>> CADkLM=foA_oC_Ri23F9PbfLnfwXFbC3Lt8bBzRu3=CB77G9_qw@mail.gmail.com . >> >> >> +1. I've wanted something like this a few times. Of the names >> suggested so far, I think I prefer "single_value", and yes I think it >> should work with NULLs. > > > This was actually a last-minute design change I made before submitting the > patch. The two times I've previously written this aggregate both accepted > NULLs and only enforced that there must not be more than one non-NULL value, > but that's only because I was thinking about the "poor man's FILTER" case, > which is obsolete since version 9.4. The reason I changed in this version > is that accepting NULLs can also hide bugs, and it's (usually) easy to > filter them out with FILTER. > > Did you have some specific use case in mind where accepting NULLs would be > beneficial? > I'm not sure what you mean when you say accepting NULLs can hide bugs. I think that if the input values to the aggregate were 1,1,1,NULL,1,1,1 then it should raise an error. ITSM that that is more likely to reveal problems with your underlying data or the query. If you want to ignore NULLs, you can always add a FILTER(WHERE val IS NOT NULL) clause. Regards, Dean
On 11/2/15 12:40 PM, Dean Rasheed wrote: > I'm not sure what you mean when you say accepting NULLs can hide bugs. > I think that if the input values to the aggregate were > 1,1,1,NULL,1,1,1 then it should raise an error. ITSM that that is more > likely to reveal problems with your underlying data or the query. If > you want to ignore NULLs, you can always add a FILTER(WHERE val IS NOT > NULL) clause. Ah, I see. So you're arguing that the aggregate should accept NULLs as input, but consider them distinct from any non-NULL values. I thought you meant accepting NULLs and *not* considering them distinct, which could easily hide problems. In that case, I don't oppose to changing the behavior. I'll make the necessary changes. .m
Hi Dean, Here's v2 of the patch. How's this look? .m
Attachment
<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Fri, Nov 20, 2015 at 10:54 PM, Marko Tiikkaja <span dir="ltr"><<ahref="mailto:marko@joh.to" target="_blank">marko@joh.to</a>></span> wrote:<br /><blockquote class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Dean,<br /><br /> Here's v2 ofthe patch. How's this look?<span class="HOEnZb"><font color="#888888"><br /><br /><br /> .m<br /></font></span><br /><br/> --<br /> Sent via pgsql-hackers mailing list (<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/> To make changes to your subscription:<br/><a href="http://www.postgresql.org/mailpref/pgsql-hackers" rel="noreferrer" target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/><br /></blockquote></div><br /></div><div class="gmail_extra">SorryI didn't see this thread sooner.</div><div class="gmail_extra"><br /></div><div class="gmail_extra">Ihaven't applied the patch, but the test cases illustrate that the function does everything I want, andhas inspired me to think a bit about some other assert-ish functions. Thanks!</div><div class="gmail_extra"><br /></div><divclass="gmail_extra"><br /></div></div>
On 21 November 2015 at 03:54, Marko Tiikkaja <marko@joh.to> wrote: > Here's v2 of the patch. How's this look? > Here are some initial review comments: * My first thought on reading this patch is that it is somewhat under-commented. For example, I would expect at least a block comment at the top of the new code added by this patch. Also the fields of the new structure could use some comments -- it might be obvious what datum and isnull are for, but fcinfo is less obvious until you read the code that uses it. Likewise the transfn is quite long, with almost no explanatory comments. * There's a clear bug in the null handling in the second branch of the transfn -- when the new value is null and the previous value wasn't. For example: SELECT single_value(x) FROM (VALUES ('x'), (null)) t(x); server closed the connection unexpectedly * In the finalfn, I think calling AggCheckCallContext() should be the first thing it does. Compare for example array_agg_array_finalfn(). * There's another less obvious bug in the way these functions handle complex types. For example: SELECT single_value(t) FROM (VALUES (1,'One'), (1,'One')) t(x,y); ERROR: cache lookup failed for type 2139062143 You might want to look at how array_agg() handles that. Also the code behind array_position() has some elements in common with this patch. * Consider collation handling, as illustrated by array_position(). So I'm afraid there's still some work to do, but there are plenty of examples in existing code to borrow from. Regards, Dean
On Wed, Oct 28, 2015 at 5:03 PM, Marko Tiikkaja <marko@joh.to> wrote: > SELECT a, sum(amount), onlyvalue(rolling_count) > FROM > ( > SELECT a, amount, count(*) OVER (ORDER BY a) AS rolling_count > FROM tbl > ) ss > GROUP BY a; The same thing would happen even in the more common case of having functionally dependent columns if they happen to be buried in a subquery. That might well be convenient if you have some expression you want to use in multiple aggregates such as: SELECT pk, acol, avg(x), min(x), max(x) FROM ( SELECT a,pk, a,acol, b.c+b.d+b.e AS x FROM a JOIN b ON (a.pk= b.fk) ) GROUP BY pk Postgres would happily accept that if you collapsed the subquery and ran the group by directly on the join but the subquery in between is actually enough to hide the functional dependency information so it complains that acol is not functionally dependent on the group by column. -- greg
On Mon, Nov 23, 2015 at 6:29 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > On 21 November 2015 at 03:54, Marko Tiikkaja <marko@joh.to> wrote: >> Here's v2 of the patch. How's this look? >> > > Here are some initial review comments: > > * My first thought on reading this patch is that it is somewhat > under-commented. For example, I would expect at least a block comment > at the top of the new code added by this patch. Also the fields of the > new structure could use some comments -- it might be obvious what > datum and isnull are for, but fcinfo is less obvious until you read > the code that uses it. Likewise the transfn is quite long, with almost > no explanatory comments. > > * There's a clear bug in the null handling in the second branch of the > transfn -- when the new value is null and the previous value wasn't. > For example: > > SELECT single_value(x) FROM (VALUES ('x'), (null)) t(x); > server closed the connection unexpectedly > > * In the finalfn, I think calling AggCheckCallContext() should be the > first thing it does. Compare for example array_agg_array_finalfn(). > > * There's another less obvious bug in the way these functions handle > complex types. For example: > > SELECT single_value(t) FROM (VALUES (1,'One'), (1,'One')) t(x,y); > ERROR: cache lookup failed for type 2139062143 > > You might want to look at how array_agg() handles that. Also the code > behind array_position() has some elements in common with this patch. > > * Consider collation handling, as illustrated by array_position(). > > So I'm afraid there's still some work to do, but there are plenty of > examples in existing code to borrow from. There is a review, but no input from the author for more than 1 month, hence patch has been marked as "returned with feedback". Feel free to move it to next CF if you want to post a new version. -- Michael