Thread: Ordering with GROUPs
I'm struggling to find an appropriate efficient query for an aggregate-type problem and I'd appreciate suggestions. I have messages associated with a set of locations (zero or more messages per location). A cut down test is as follows: CREATE TABLE location (ident char(4), node point); INSERT INTO location values ('AAAA', '(1,1)'); INSERT INTO location values ('BBBB', '(1,2)'); INSERT INTO location values ('CCCC', '(2,1)'); INSERT INTO location values ('DDDD', '(2,2)'); CREATE TABLE message (ident char(4), message_text text); INSERT INTO message values ('AAAA', 'foo'); INSERT INTO message values ('AAAA', 'bar'); INSERT INTO message values ('AAAA', 'baz'); INSERT INTO message values ('BBBB', 'abel'); INSERT INTO message values ('BBBB', 'baker'); INSERT INTO message values ('CCCC', 'charlie'); so each message is associated with a location as follows -- here's the natural join SELECT location.ident, node, message_text from location, message WHERE location.ident = message.ident;ident | node | message_text -------+-------+--------------AAAA | (1,1) | fooAAAA | (1,1) | barAAAA | (1,1) | bazBBBB | (1,2) | abelBBBB | (1,2)| bakerCCCC | (2,1) | charlie (6 rows) What I want is to know how many messages are available for each location. It's easy if I just want the ident: SELECT location.ident, count(*) from location, message WHERE location.ident = message.ident GROUP BY location.ident;ident | count -------+-------AAAA | 3BBBB | 2CCCC | 1 (3 rows) But I'd like to return the "node" in the same query. If I try: SELECT location.ident, node, count(*) from location, message WHERE location.ident = message.ident GROUP BY location.ident; ERROR: Attribute location.node must be GROUPed or used in an aggregate function it obviously fails. If node were an integer I could just use an aggregate like max() or similar, but it's not, and there's no suitable aggregate for point. I can create a trivial one as a work around, but I hope I don't have to. But if I do it properly, it requires an ordering operator : SELECT location.ident, node, count(*) from location, message WHERE location.ident = message.ident GROUP BY location.ident, location.node; ERROR: Unable to identify an ordering operator '<' for type 'point' Use an explicit ordering operator or modify thequery So how do I specify the operator for GROUP BY? If I compare with ORDER BY (>> operator is "is right of" for type point): SELECT location.ident, node from location, message WHERE location.ident = message.ident ORDER BY location.node USING >>, location.ident;ident | node -------+-------CCCC | (2,1)AAAA | (1,1)AAAA | (1,1)AAAA | (1,1)BBBB | (1,2)BBBB | (1,2) (6 rows) ...and use the same construction, I get a parser error... SELECT location.ident, node from location, message WHERE location.ident = message.ident GROUP BY location.node USING >>, location.ident; ERROR: parser: parse error at or near "," I can't find anything in the doc. One alternative is to use a nested query: SELECT location.ident, node, ( select count(*) from message WHERE location.ident = message.ident ) as count FROM location;ident | node | count -------+-------+-------AAAA | (1,1) | 3BBBB | (1,2) | 2CCCC | (2,1) | 1DDDD | (2,2) | 0 (4 rows) For the test that works fine, but for my real life situation, the nested query seems to be very inefficient, taking vastly longer than the first query illustrated above. Since the information required is clearly contained in the result of that query, it seems unfortunate to say the least that I can't work the GROUP BY mechanism on it to give me what I want. Any tips please? Thanks Julian Scarfe
On Sun, Aug 18, 2002 at 12:49:18 +0100, Julian Scarfe <julian.scarfe@ntlworld.com> wrote: > I'm struggling to find an appropriate efficient query for an aggregate-type > problem and I'd appreciate suggestions. > > I have messages associated with a set of locations (zero or more messages > per location). A cut down test is as follows: > > CREATE TABLE location (ident char(4), node point); > INSERT INTO location values ('AAAA', '(1,1)'); > INSERT INTO location values ('BBBB', '(1,2)'); > INSERT INTO location values ('CCCC', '(2,1)'); > INSERT INTO location values ('DDDD', '(2,2)'); > > CREATE TABLE message (ident char(4), message_text text); > INSERT INTO message values ('AAAA', 'foo'); > INSERT INTO message values ('AAAA', 'bar'); > INSERT INTO message values ('AAAA', 'baz'); > INSERT INTO message values ('BBBB', 'abel'); > INSERT INTO message values ('BBBB', 'baker'); > INSERT INTO message values ('CCCC', 'charlie'); > > I can't find anything in the doc. > > One alternative is to use a nested query: > Instead of a nested query, how about doing a join to a group'd version of message? While you didn't use a unique constraint on the location table it seems likely that there should only be one location for each ident. So you can do the join after counting the number of messages. While this is similar to your example, it might run faster, especially in you don't have the message table indexed by ident, so that you can do one sort instead of a bunch of sequential scans. For example: area=> select location.ident, location.node, howmany from location area-> natural join (select ident, count(*) as howmany from message area(> group by ident) as messgroup;ident | node | howmany -------+-------+---------AAAA | (1,1) | 3BBBB | (1,2) | 2CCCC | (2,1) | 1 (3 rows)
"Julian Scarfe" <julian.scarfe@ntlworld.com> writes: > SELECT location.ident, node, count(*) from location, message > WHERE location.ident = message.ident > GROUP BY location.ident; > ERROR: Attribute location.node must be GROUPed or used in an aggregate > function > it obviously fails. If node were an integer I could just use an aggregate > like max() or similar, but it's not, and there's no suitable aggregate for > point. I can create a trivial one as a work around, but I hope I don't have > to. ISTM the problem here is the lack of any ordering operator for POINT, which defeats GROUP BY, *plus* the lack of any aggregate you might use for an aggregate-based solution. This is not really a language failing but a problem with an impoverished datatype. So, if you don't like Bruno's subselect-based workaround, the dummy aggregate seems the way to go. SQL99 contains a whole bunch of verbiage whose intent seems to be that if you GROUP BY a unique or primary-key column, you can reference the other columns of that table without aggregation (essentially, the system treats them as implicitly GROUP BY'd). Sooner or later we'll probably get around to implementing that, and that would solve your problem as long as you declare location.ident properly. regards, tom lane
From: "Tom Lane" <tgl@sss.pgh.pa.us> > ISTM the problem here is the lack of any ordering operator for POINT, > which defeats GROUP BY, *plus* the lack of any aggregate you might use > for an aggregate-based solution. This is not really a language failing > but a problem with an impoverished datatype. Yes, I agree completely. If I were doing this again from scratch I'd be using PostGIS, but I've got a lot of data that depends on POINT. > So, if you don't like > Bruno's subselect-based workaround, the dummy aggregate seems the way > to go. I've actually implemented the dummy aggregate now, and it works fine. So does Bruno's subselect (thank you Bruno), and the efficiency seems to be similar in each case. > SQL99 contains a whole bunch of verbiage whose intent seems to be that > if you GROUP BY a unique or primary-key column, you can reference the > other columns of that table without aggregation (essentially, the > system treats them as implicitly GROUP BY'd). Sooner or later we'll > probably get around to implementing that, and that would solve your > problem as long as you declare location.ident properly. That makes a lot of sense, though I imagine there are higher priorities. Thanks for your help. Julian Scarfe