Thread: Requesting advanced Group By support
Hi,
From MySQL 5.7, It supports SQL standard 99 and implements the feature such functional dependent on the GROUP By columns, i.e., it detects the non-aggregate columns which are functionally dependent on the GROUP BY columns (not included in GROUP BY) and then executes the query without error.
For example,
SELECT a.sno,b.sno,a.name,b.location FROM Name AS a JOIN Location AS b ON a.sno=b.sno GROUP BY a.sno,b.location
In this case, a.sno is a primary key so no need to include a.name in GROUP By as it would be identified by the primary key and then for b.sno which is again equated with a.sno (primary key) so no need to add this as well but for b.location, we need to add it in GROUP BY or we should use any aggregate function over this column to avoid error. For more info, please check on the below link https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html
Is there any plans on supporting this in Postgres in future versions ?
Thank You,
Arun Kumar
Hi, On 10/09/2018 03:10 PM, Arun Kumar wrote: > Hi, > From MySQL 5.7, It supports SQL standard 99 and implements the feature > such functional dependent on the GROUP By columns, i.e., it detects the > non-aggregate columns which are functionally dependent on the GROUP BY > columns (not included in GROUP BY) and then executes the query without > error. > For example, > > *SELECT a.sno,b.sno,a.name,b.location FROM Name AS a JOIN Location AS b > ON a.sno=b.sno GROUP BY a.sno,b.location * > > In this case, a.sno is a primary key so no need to include a.name in > GROUP By as it would be identified by the primary key and then for b.sno > which is again equated with a.sno (primary key) so no need to add this > as well but for b.location, we need to add it in GROUP BY or we should > use any aggregate function over this column to avoid error. For more > info, please check on the below link > https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html > Is there any plans on supporting this in Postgres in future versions ? > So, which part of this supposedly does not work in PostgreSQL? Consider this: test2=# create table t (id int primary key, b int, c int, d int); CREATE TABLE test2=# explain select * from t group by id, b, c; QUERY PLAN ------------------------------------------------------------ HashAggregate (cost=33.12..51.62 rows=1850 width=16) Group Key: id -> Seq Scan on t (cost=0.00..28.50 rows=1850 width=16) (3 rows) test2=# explain select id, count(*) from t group by id, b, c; QUERY PLAN ------------------------------------------------------------ HashAggregate (cost=37.75..56.25 rows=1850 width=20) Group Key: id -> Seq Scan on t (cost=0.00..28.50 rows=1850 width=12) (3 rows) So clearly we've already eliminated the functionally-dependent columns from the aggregation. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On 10/09/2018 03:10 PM, Arun Kumar wrote: >> *SELECT a.sno,b.sno,a.name,b.location FROM Name AS a JOIN Location AS b >> ON a.sno=b.sno GROUP BY a.sno,b.location * >> >> In this case, a.sno is a primary key so no need to include a.name in >> GROUP By as it would be identified by the primary key and then for b.sno >> which is again equated with a.sno (primary key) so no need to add this >> as well but for b.location, we need to add it in GROUP BY or we should >> use any aggregate function over this column to avoid error. > So, which part of this supposedly does not work in PostgreSQL? The part where it infers that b.sno is unique based solely on it having been equated to a.sno. I'm not sure whether the SQL spec's definition of functional dependencies includes such a proof rule, but I'm not very excited about adding one to PG. It's likely of limited use, seeing that this is the first time I can recall anyone asking for it; and it'd create dependency problems that we don't have today, because validity of the query would depend on the existence of a btree operator class from which we could infer that uniqueness of a.sno implies uniqueness of b.sno. We have enough problems arising from the existing case of validity of the query depending on the existence of a primary key. Also, a primary key is at least a well-defined dependency (there can be only one); but since an equality operator could belong to multiple opclasses, it's not very clear which one the query would get marked as depending on. In short: the cost/benefit ratio of this optimization looks pretty bad. regards, tom lane
I wrote: > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> So, which part of this supposedly does not work in PostgreSQL? > The part where it infers that b.sno is unique based solely on it having > been equated to a.sno. Oh, wait a second: such an inference is actually *wrong* in the general case, or at least underdetermined. It fails in cases where the data type considers distinguishable values to be "equal", as for example zero vs. minus zero in IEEE floats, or numeric values with varying numbers of trailing zeroes, or citext, etc. So for example if the sno columns are type citext, we can be sure that a.sno does not contain both 'X' and 'x', because the pkey would forbid it. But if it contains 'X', while b.sno contains both 'X' and 'x', then (if we allowed this case) it'd be indeterminate which b.sno value is returned by the GROUP BY. One might or might not consider that OK for a particular application, but I don't think the parser should just assume for you that it is. regards, tom lane
På onsdag 10. oktober 2018 kl. 18:46:15, skrev Tomas Vondra <tomas.vondra@2ndquadrant.com>:
Hi,
On 10/09/2018 03:10 PM, Arun Kumar wrote:
> Hi,
> From MySQL 5.7, It supports SQL standard 99 and implements the feature
> such functional dependent on the GROUP By columns, i.e., it detects the
> non-aggregate columns which are functionally dependent on the GROUP BY
> columns (not included in GROUP BY) and then executes the query without
> error.
> For example,
>
> *SELECT a.sno,b.sno,a.name,b.location FROM Name AS a JOIN Location AS b
> ON a.sno=b.sno GROUP BY a.sno,b.location *
>
> In this case, a.sno is a primary key so no need to include a.name in
> GROUP By as it would be identified by the primary key and then for b.sno
> which is again equated with a.sno (primary key) so no need to add this
> as well but for b.location, we need to add it in GROUP BY or we should
> use any aggregate function over this column to avoid error. For more
> info, please check on the below link
> https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html
> Is there any plans on supporting this in Postgres in future versions ?
>
So, which part of this supposedly does not work in PostgreSQL?
Consider this:
test2=# create table t (id int primary key, b int, c int, d int);
CREATE TABLE
test2=# explain select * from t group by id, b, c;
QUERY PLAN
------------------------------------------------------------
HashAggregate (cost=33.12..51.62 rows=1850 width=16)
Group Key: id
-> Seq Scan on t (cost=0.00..28.50 rows=1850 width=16)
(3 rows)
test2=# explain select id, count(*) from t group by id, b, c;
QUERY PLAN
------------------------------------------------------------
HashAggregate (cost=37.75..56.25 rows=1850 width=20)
Group Key: id
-> Seq Scan on t (cost=0.00..28.50 rows=1850 width=12)
(3 rows)
So clearly we've already eliminated the functionally-dependent columns
from the aggregation.
regards
Too bad this doesn't:
create table t (id int NOT NULL UNIQUE, b int, c int, d int);
explain select * from t group by id, b, c;
ERROR: column "t.d" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: explain select * from t group by id, b, c;
ERROR: column "t.d" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: explain select * from t group by id, b, c;
--
Andreas Joseph Krogh
Hi, On October 10, 2018 10:37:40 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> On 10/09/2018 03:10 PM, Arun Kumar wrote: >>> *SELECT a.sno,b.sno,a.name,b.location FROM Name AS a JOIN Location >AS b >>> ON a.sno=b.sno GROUP BY a.sno,b.location * >>> >>> In this case, a.sno is a primary key so no need to include a.name in > >>> GROUP By as it would be identified by the primary key and then for >b.sno >>> which is again equated with a.sno (primary key) so no need to add >this >>> as well but for b.location, we need to add it in GROUP BY or we >should >>> use any aggregate function over this column to avoid error. > >> So, which part of this supposedly does not work in PostgreSQL? > >The part where it infers that b.sno is unique based solely on it having >been equated to a.sno. Isn't the spec compliant thing that's missing dealing with unique not null? Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Andres Freund <andres@anarazel.de> writes: > On October 10, 2018 10:37:40 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >>> So, which part of this supposedly does not work in PostgreSQL? >> The part where it infers that b.sno is unique based solely on it having >> been equated to a.sno. > Isn't the spec compliant thing that's missing dealing with unique not null? IIRC, the spec has a whole bunch of "functional dependency" proof rules, of which the only one we implement at the moment is the one about the other columns of a table all being functionally dependent on its pkey. I don't know if any of the spec's rules are at all close to this one. regards, tom lane
On Wed, Oct 10, 2018 at 1:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > It fails in cases where the data type > considers distinguishable values to be "equal", as for example zero vs. > minus zero in IEEE floats, or numeric values with varying numbers of > trailing zeroes, or citext, etc. So for example if the sno columns are > type citext, we can be sure that a.sno does not contain both 'X' and 'x', > because the pkey would forbid it. But if it contains 'X', while b.sno > contains both 'X' and 'x', then (if we allowed this case) it'd be > indeterminate which b.sno value is returned by the GROUP BY. One might or > might not consider that OK for a particular application, but I don't think > the parser should just assume for you that it is. Since this is approximately the 437,253rd time this problem has come up, and since even reasonably experienced hackers often get confused about it or (ahem) momentarily forget about the problem, it is really well paste time to find some way of labeling operator classes or families or individual operators to indicate whether or not they are testing precisely the exactly-the-same property. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10/10/2018 19:59, Andres Freund wrote: > Isn't the spec compliant thing that's missing dealing with unique not null? I don't think that's what the original poster complained about, but you are right that it's a known missing part. The reason it's missing is that we can't record dependencies on not-null constraints. Some work on that has been in progress for many years. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services