Thread: Aggregate query for multiple records
Hello, I am new to the list, my apology if this question is beyond the scope or charter of this list. My questions is: What is the best method to perform an aggregate query to calculate sum() values for each distinct wid as in the example below, but except for all wid's (not just WHERE wid='01/1-6-1-30w1/0'). Also, performance wise, would it be better to build a function for this query. The table has 9 million records and these aggregate queries take hours. SELECT SUM(oil) as sumoil, SUM(hours) as sumhours, FROM (SELECT oil, hours prd_data WHERE wid='01/1-6-1-30w1/0' ORDER BY date LIMIT 6) subtable ; Table description: Table "prd_data" Column | Type | Modifiers --------+-----------------------+----------- date | integer | hours | real | oil | real | gas | real | water | real | pwid | integer |wid | character varying(20) | year | smallint | Indexes: wid_index6 Actual table (prd_data), 9 million records: date | hours | oil | gas | water | pwid | wid | year --------+-------+-------+------+-------+------+-----------------+------ 196507 | 360 | 159.4 | 11.3 | 40.9 | 413 | 01/1-1-1-31w1/0| 1965 196508 | 744 | 280 | 20 | 27.2 | 413 | 01/1-1-1-31w1/0 | 1965 196509 | 360 | 171.1 | 11.4| 50.4 | 413 | 01/1-1-1-31w1/0 | 1965 196510 | 744 | 202.1 | 25 | 89.8 | 413 | 01/1-1-1-31w1/0 | 1965 196512| 744 | 201.3 | 23.8 | 71.9 | 413 | 01/1-1-1-31w1/0 | 1965 196511 | 720 | 184 | 17.6 | 78.9 | 413 | 01/1-1-1-31w1/0| 1965 196610 | 744 | 99.8 | 15.4 | 53.7 | 413 | 01/1-1-1-31w1/0 | 1966 196612 | 744 | 86 | 12.8| 36.1 | 413 | 01/1-1-1-31w1/0 | 1966 196611 | 720 | 86 | 12.6 | 41.7 | 413 | 01/1-1-1-31w1/0 | 1966 196601| 744 | 191.6 | 22.6 | 50.7 | 413 | 01/1-1-1-31w1/0 | 1966 200301 | 461 | 68.8 | 0 | 186.3 | 47899 | 9G/6-1-50-24w3/0| 2003 200310 | 740 | 446.3 | 0 | 563.1 | 47899 | 9G/6-1-50-24w3/0 | 2003 200306 | 667 | 92.1 | 0 | 968.8 | 47899 | 9G/6-1-50-24w3/0 | 2003 200304 | 0 | 0 | 0 | 0 | 47899 | 9G/6-1-50-24w3/0 | 2003 200308| 457 | 100.7 | 0 | 82.8 | 47899 | 9G/6-1-50-24w3/0 | 2003 200307 | 574 | 78 | 0 | 752 | 47899 | 9G/6-1-50-24w3/0| 2003 200312 | 582 | 360.9 | 0 | 569 | 47899 | 9G/6-1-50-24w3/0 | 2003 200311 | 681 | 260.8 | 0 | 563.9 | 47899 | 9G/6-1-50-24w3/0 | 2003 200305 | 452 | 0 | 0 | 0 | 47899 | 9G/6-1-50-24w3/0 | 2003 200309| 637 | 244.6 | 0 | 193.8 | 47899 | 9G/6-1-50-24w3/0 | 2003 (20 rows) Thanks, -- Scott A. Gerhardt, P.Geo. Gerhardt Information Technologies
Scott, > Hello, I am new to the list, my apology if this question is beyond the > scope or charter of this list. We have a charter? Why didn't anyone tell me? > My questions is: > What is the best method to perform an aggregate query to calculate > sum() values for each distinct wid as in the example below, but except > for all wid's (not just WHERE wid='01/1-6-1-30w1/0'). > SELECT > SUM(oil) as sumoil, SUM(hours) as sumhours, > FROM > (SELECT oil, hours prd_data WHERE wid='01/1-6-1-30w1/0' > ORDER BY date LIMIT 6) subtable Your problem here is the "limit 6" There isn't any good+fast way to take only 6 of each thing and total them. Also the above query is missing a FROM clause, so I;ve had to guess at your table name below. Oh, and the word "date" is a reserved word, better to quote it. This is valid, but it won't be fast: SELECT wid, (SELECT SUM(oil) FROM prd_data pd2 WHERE pd2.wid = prd_data.wid ORDER BY "date" LIMIT 6) as tot_oil,(SELECTSUM(hours) FROM prd_data pd2 WHERE pd2.wid = prd_data.wid ORDER BY "date" LIMIT 6) as tot_hours FROM prd_data ORDER BY wid; However, that will be querying the prd_data table about (distinct wid)*2+1 times. Don't hold your breath. > Also, performance wise, would it be better to build a function for this > query. The table has 9 million records and these aggregate queries > take hours. It might. Not for the summaries themselves (which are fastest as aggregates), but to build a table that has only 6 records per WID. -- Josh Berkus Aglio Database Solutions San Francisco
shouldn't take that long, I would think. You have indexes on wid and date? Troy > > Hello, I am new to the list, my apology if this question is beyond the > scope or charter of this list. > > My questions is: > What is the best method to perform an aggregate query to calculate > sum() values for each distinct wid as in the example below, but except > for all wid's (not just WHERE wid='01/1-6-1-30w1/0'). > > Also, performance wise, would it be better to build a function for this > query. The table has 9 million records and these aggregate queries > take hours. > > > SELECT > SUM(oil) as sumoil, SUM(hours) as sumhours, > FROM > (SELECT oil, hours prd_data WHERE wid='01/1-6-1-30w1/0' > ORDER BY date LIMIT 6) subtable > ; > > > Table description: > Table "prd_data" > Column | Type | Modifiers > --------+-----------------------+----------- > date | integer | > hours | real | > oil | real | > gas | real | > water | real | > pwid | integer | > wid | character varying(20) | > year | smallint | > Indexes: wid_index6 > > > Actual table (prd_data), 9 million records: > > date | hours | oil | gas | water | pwid | wid | year > --------+-------+-------+------+-------+------+-----------------+------ > 196507 | 360 | 159.4 | 11.3 | 40.9 | 413 | 01/1-1-1-31w1/0 | 1965 > 196508 | 744 | 280 | 20 | 27.2 | 413 | 01/1-1-1-31w1/0 | 1965 > 196509 | 360 | 171.1 | 11.4 | 50.4 | 413 | 01/1-1-1-31w1/0 | 1965 > 196510 | 744 | 202.1 | 25 | 89.8 | 413 | 01/1-1-1-31w1/0 | 1965 > 196512 | 744 | 201.3 | 23.8 | 71.9 | 413 | 01/1-1-1-31w1/0 | 1965 > 196511 | 720 | 184 | 17.6 | 78.9 | 413 | 01/1-1-1-31w1/0 | 1965 > 196610 | 744 | 99.8 | 15.4 | 53.7 | 413 | 01/1-1-1-31w1/0 | 1966 > 196612 | 744 | 86 | 12.8 | 36.1 | 413 | 01/1-1-1-31w1/0 | 1966 > 196611 | 720 | 86 | 12.6 | 41.7 | 413 | 01/1-1-1-31w1/0 | 1966 > 196601 | 744 | 191.6 | 22.6 | 50.7 | 413 | 01/1-1-1-31w1/0 | 1966 > 200301 | 461 | 68.8 | 0 | 186.3 | 47899 | 9G/6-1-50-24w3/0 | 2003 > 200310 | 740 | 446.3 | 0 | 563.1 | 47899 | 9G/6-1-50-24w3/0 | 2003 > 200306 | 667 | 92.1 | 0 | 968.8 | 47899 | 9G/6-1-50-24w3/0 | 2003 > 200304 | 0 | 0 | 0 | 0 | 47899 | 9G/6-1-50-24w3/0 | 2003 > 200308 | 457 | 100.7 | 0 | 82.8 | 47899 | 9G/6-1-50-24w3/0 | 2003 > 200307 | 574 | 78 | 0 | 752 | 47899 | 9G/6-1-50-24w3/0 | 2003 > 200312 | 582 | 360.9 | 0 | 569 | 47899 | 9G/6-1-50-24w3/0 | 2003 > 200311 | 681 | 260.8 | 0 | 563.9 | 47899 | 9G/6-1-50-24w3/0 | 2003 > 200305 | 452 | 0 | 0 | 0 | 47899 | 9G/6-1-50-24w3/0 | 2003 > 200309 | 637 | 244.6 | 0 | 193.8 | 47899 | 9G/6-1-50-24w3/0 | 2003 > (20 rows) > > > > Thanks, > > -- > Scott A. Gerhardt, P.Geo. > Gerhardt Information Technologies > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings >
Scott, > > SELECT wid, > > (SELECT SUM(oil) FROM prd_data pd2 > > WHERE pd2.wid = prd_data.wid ORDER BY "date" LIMIT 6) as tot_oil, > > (SELECT SUM(hours) FROM prd_data pd2 > > WHERE pd2.wid = prd_data.wid ORDER BY "date" LIMIT 6) as tot_hours > > FROM prd_data > > ORDER BY wid; > > There is only one table being queried (prd_data). What is the purpose > of "pd2"? It's a table alias. There may be only one table, but to do this in a single query you need to reference 3 "instances" of that same table. Hmmm ... let me make the distinction more clear: SELECT wid, (SELECT SUM(oil) FROM prd_data prd2 WHERE prd2.wid = prd1.wid ORDER BY "date" LIMIT 6) as tot_oil, (SELECT SUM(hours) FROM prd3 WHERE prd3.wid = prd1.wid ORDER BY "date" LIMIT 6) as tot_hours FROM prd_data prd1 ORDER BY wid; -- Josh Berkus Aglio Database Solutions San Francisco
Scott, > >> I tried your query but it doesn't seem to work. The interpreter > >> expects prd2.date to be grouped but this doesn't make sence, see > >> below: > > > > Oh, yeah, darn it. > > > > Hmmm ... maybe you could explain the purpose of selecting just 6? > > This query > > is going to get a *lot* uglier if there isn't another way to > > accomplish it. > > The purpose of selecting the first 6 is we need sum values for just the > first 3, 6 and 12 months oil/water production for every oil well in the > database to perform statistical calculations. Optionally, we also need > the _last_ 6 months of oil and water production. > The prd_data table has total oil for each month but some wells have 500 > months of data, we just want the first 6. Well, here's my suggestion, to make this query both easier and faster: 1) Add a column called "month_prod", integer, to the table. This is the "number of months in production". 2) Populate this query with a calculated difference on your "date" column against the first appearance of that WID (I'm assuming that each increment of "date" = 1 month) UPDATE prd_data SET months_prod = prd_data."date" - prd2."date" + 1 FROM prd_data prd2 WHERE prd_data.wid = prd2.wid AND prd2."date" = ( SELECT "date" FROM prd_data prd3WHERE prd3.wid = prd2.widORDER BY "date" LIMIT 1 ); (warning: the above will take a long time and floor your server. Make sure to do VACUUM FULL ANALYZE prd_data afterwards.) (if you can't get the above to complete, which is possible depending on your hardware, you could do it procedurally in a programmming language) 3) Then running your aggregate becomes very easy/fast: SELECT wid, sum(oil) as tot_oil, sum(hours) as tot_hours FROM prd_data WHERE months_prod < 7 GROUP BY wid ORDER BY wid; -- Josh Berkus Aglio Database Solutions San Francisco
Scott Gerhardt <scott@g-it.ca> writes: > Hello, I am new to the list, my apology if this question is beyond the scope or > charter of this list. > > My questions is: > What is the best method to perform an aggregate query to calculate sum() values > for each distinct wid as in the example below, but except for all wid's (not > just WHERE wid='01/1-6-1-30w1/0'). > > Also, performance wise, would it be better to build a function for this query. > The table has 9 million records and these aggregate queries take hours. The "top n" type query (or in this case "first n" or "last n" but it's the same thing) is actually very tricky to do in standard SQL. The best solution seen here for postgres is to use arrays and custom aggregate functions. The following is based on a previous answer from Tom Lane to a similar question. (I thought I already posted this for you on pgsql-general but the list archives are down and you don't seem to have seen it, so I'm resending it) It allows you to do the whole query with a single sort for the grouping and the ordering by date together. You would have to use it with something like: SELECT sum_first_6(oil) FROM (SELECT oil from prd_data ORDER BY wid, "date") GROUP BY wid If you pump up sort_mem enough -- you can do it within the session for the single connection using "SET sort_mem" it should be pretty fast. I think it's the best you're going to get. If you're absolutely sure the data is physically stored in chronological order -- which I would only feel comfortable with if you've never done any updates or deletes, only inserts and perhaps occasional truncates, then you might be able to get by without ordering and convince it to do a hash aggregate. That would be the optimal result, no sorts at all. But it would be hard to make sure it would always work. test=> create or replace function first_6_accum (integer[], integer) returns integer[] language sql immutable as 'selectcase when array_upper($1,1)>=6 then $1 else $1||$2 end'; CREATE FUNCTION test=> create function sum_6(integer[]) returns integer immutable language sql as 'select $1[1]+$1[2]+$1[3]+$1[4]+$1[5]+$1[6]'; CREATE FUNCTION test=> create aggregate sum_first_6 (basetype=integer, sfunc=first_6_accum, stype=integer[],initcond='{}',finalfunc=sum_6); CREATE AGGREGATE test=> select sum_first_6(i) from (select i from (select 1 as i union select 2 union select 3 union select 4 union select5 union select 6 union select 7 union select 8) as x order by i desc) as x;sum_first_6 ------------- 33 (1 row) test=> select sum_first_6(i) from (select i from (select 1 as i union select 2 union select 3 union select 4 union select5 union select 6 union select 7 union select 8) as x order by i asc) as x;sum_first_6 ------------- 21 (1 row) This can easily be switched around to make it "last_6" and you can write functions to handle 6 records or 9 records. And all of these could be combined in a single query, so you only have to do the sort once. Unfortunately you cannot make aggregate functions that take multiple parameters, nor can you pass extra parameters to the state function. So you'll have to create a separate set of functions for each variant. Also, you'll have to change it to use reals. -- greg
Greg Stark <gsstark@mit.edu> writes: > [ nice example snipped ] > ... Also, you'll have to change it to use reals. That part, at least, can be worked around as of 7.4: use polymorphic functions. You can declare the functions and aggregate as working on anyelement/anyarray, and then they will automatically work on any datatype that has a + operator. regression=# create or replace function first_6_accum (anyarray,anyelement) returns anyarray regression-# language sql immutable as 'select case when array_upper($1,1)>=6 then $1 else $1||$2 end'; CREATE FUNCTION regression=# create function sum_6(anyarray) returns anyelement immutable language sql as 'select $1[1]+$1[2]+$1[3]+$1[4]+$1[5]+$1[6]'; CREATE FUNCTION regression=# create aggregate sum_first_6 (basetype=anyelement, sfunc=first_6_accum, stype=anyarray,initcond='{}',finalfunc=sum_6); CREATE AGGREGATE regression=# select sum_first_6(i) from (select i from (select 1 as i union select 2 union select 3 union select 4 unionselect 5 union select 6 union select 7 union select 8) as x order by i desc) as x;sum_first_6 ------------- 33 (1 row) regression=# select sum_first_6(i) from (select i from (select 1.1 as i union select 2 union select 3 union select 4 unionselect 5 union select 6 union select 7.7 union select 8) as x order by i desc) as x;sum_first_6 ------------- 33.7 (1 row) regression=# regards, tom lane
> Greg Stark <gsstark@mit.edu> writes: >> [ nice example snipped ] >> ... Also, you'll have to change it to use reals. > > That part, at least, can be worked around as of 7.4: use polymorphic > functions. You can declare the functions and aggregate as working on > anyelement/anyarray, and then they will automatically work on any > datatype that has a + operator. > > regression=# create or replace function first_6_accum > (anyarray,anyelement) returns anyarray > regression-# language sql immutable as 'select case when > array_upper($1,1)>=6 then $1 else $1||$2 end'; > CREATE FUNCTION > regression=# create function sum_6(anyarray) returns anyelement > immutable language sql as 'select > $1[1]+$1[2]+$1[3]+$1[4]+$1[5]+$1[6]'; > CREATE FUNCTION > regression=# create aggregate sum_first_6 (basetype=anyelement, > sfunc=first_6_accum, stype=anyarray,initcond='{}',finalfunc=sum_6); > CREATE AGGREGATE > regression=# select sum_first_6(i) from (select i from (select 1 as i > union select 2 union select 3 union select 4 union select 5 union > select 6 union select 7 union select 8) as x order by i desc) as x; > sum_first_6 > ------------- > 33 > (1 row) > > regression=# select sum_first_6(i) from (select i from (select 1.1 as > i union select 2 union select 3 union select 4 union select 5 union > select 6 union select 7.7 union select 8) as x order by i desc) as x; > sum_first_6 > ------------- > 33.7 > (1 row) > > regression=# > > regards, tom lane > An alternate solution I'm thinking is to add column to hold a "total_months" value that could be used to simplify queries and speed queries ( i.e. first month of oil productin = 1, second = 2 etc.) That way I can use select the first 6 months by using "where < 6", or any month interval for that matter. The following query, suggested by another list member (thanks Josh Berkus), to populate the "total_months" column sort of work but doesn't handle the year wrapping as it adds 88 when the year wraps (see output below). UPDATE prd_data_test SET months_prod = prd_data_test."date" - prd2."date" + 1 FROM prd_data_test prd2 WHERE prd_data_test.wid = prd2.wid AND prd2."date" = ( SELECT "date" FROM prd_data_test prd3WHERE prd3.wid = prd2.widORDER BY "date" LIMIT 1 ); The results are: SEM=# select * from prd_data_test order by wid, date limit 20; date | hours | oil | gas | water | pwid | wid | year | month_prd | months_prod --------+-------+-------+------+-------+------+-----------------+------ +-----------+------------- 196505 | 480 | 194.3 | 10.3 | 0 | 1 | 01/1-6-1-30w1/0 | 1965 | | 1 196506 | 600 | 279.4 | 13.1 | 0 | 1 | 01/1-6-1-30w1/0 | 1965 | | 2 196507 | 744 | 288.1 | 4.5 | 0 | 1 | 01/1-6-1-30w1/0 | 1965 | | 3 196508 | 720 | 234.6 | 9.4 | 2.9 | 1 | 01/1-6-1-30w1/0 | 1965 | | 4 196509 | 648 | 208.2 | 12.5 | 6 | 1 | 01/1-6-1-30w1/0 | 1965 | | 5 196510 | 744 | 209.8 | 15.3 | 0 | 1 | 01/1-6-1-30w1/0 | 1965 | | 6 196511 | 720 | 180.5 | 13.9 | 27.7 | 1 | 01/1-6-1-30w1/0 | 1965 | | 7 196512 | 744 | 227.4 | 22.8 | 5.2 | 1 | 01/1-6-1-30w1/0 | 1965 | | 8 196601 | 744 | 230.3 | 22.7 | 10 | 1 | 01/1-6-1-30w1/0 | 1966 | | 97 196602 | 672 | 173.2 | 16.5 | 17 | 1 | 01/1-6-1-30w1/0 | 1966 | | 98 196603 | 744 | 197.2 | 18.7 | 9.2 | 1 | 01/1-6-1-30w1/0 | 1966 | | 99 196604 | 720 | 168.1 | 14.1 | 3 | 1 | 01/1-6-1-30w1/0 | 1966 | | 100 Table description: Table "prd_data" Column | Type | Modifiers --------+-----------------------+----------- date | integer | hours | real | oil | real | gas | real | water | real | pwid | integer |wid | character varying(20) | year | smallint | Indexes: wid_index6 Actual table (prd_data), 9 million records: date | hours | oil | gas | water | pwid | wid | year --------+-------+-------+------+-------+------+-----------------+------ 196507 | 360 | 159.4 | 11.3 | 40.9 | 413 | 01/1-1-1-31w1/0| 1965 196508 | 744 | 280 | 20 | 27.2 | 413 | 01/1-1-1-31w1/0 | 1965 196509 | 360 | 171.1 | 11.4| 50.4 | 413 | 01/1-1-1-31w1/0 | 1965 196510 | 744 | 202.1 | 25 | 89.8 | 413 | 01/1-1-1-31w1/0 | 1965 196512| 744 | 201.3 | 23.8 | 71.9 | 413 | 01/1-1-1-31w1/0 | 1965 196511 | 720 | 184 | 17.6 | 78.9 | 413 | 01/1-1-1-31w1/0| 1965 196610 | 744 | 99.8 | 15.4 | 53.7 | 413 | 01/1-1-1-31w1/0 | 1966 196612 | 744 | 86 | 12.8| 36.1 | 413 | 01/1-1-1-31w1/0 | 1966 196611 | 720 | 86 | 12.6 | 41.7 | 413 | 01/1-1-1-31w1/0 | 1966 196601| 744 | 191.6 | 22.6 | 50.7 | 413 | 01/1-1-1-31w1/0 | 1966 200301 | 461 | 68.8 | 0 | 186.3 | 47899 | 9G/6-1-50-24w3/0| 2003 200310 | 740 | 446.3 | 0 | 563.1 | 47899 | 9G/6-1-50-24w3/0 | 2003 200306 | 667 | 92.1 | 0 | 968.8 | 47899 | 9G/6-1-50-24w3/0 | 2003 200304 | 0 | 0 | 0 | 0 | 47899 | 9G/6-1-50-24w3/0 | 2003 200308| 457 | 100.7 | 0 | 82.8 | 47899 | 9G/6-1-50-24w3/0 | 2003 200307 | 574 | 78 | 0 | 752 | 47899 | 9G/6-1-50-24w3/0| 2003 200312 | 582 | 360.9 | 0 | 569 | 47899 | 9G/6-1-50-24w3/0 | 2003 200311 | 681 | 260.8 | 0 | 563.9 | 47899 | 9G/6-1-50-24w3/0 | 2003 200305 | 452 | 0 | 0 | 0 | 47899 | 9G/6-1-50-24w3/0 | 2003 200309| 637 | 244.6 | 0 | 193.8 | 47899 | 9G/6-1-50-24w3/0 | 2003 Thanks, -- Scott A. Gerhardt, P.Geo. Gerhardt Information Technologies
Scott, > Unfortunately, your revised query works like a charm except for the > fact that prd_data."date" - prd2."date" + 1 give incorrect values when > the year wraps, see in the output below. Need to conditionally > subtract 88 from the date or use an incrementing count() function > instead of date math to get the correct values. Oh, that column is text, not a serial number. Well, it's your fault for using wierd custom data types; you figure it out. > I'm also wondering if using the date functions of Postgres would be > helpful since the date column does have the YYYY-year and MM-month > parts. Well, you'd want to convert the column to a timestamp, and then you could compute months. Or you could break it in seperate integer "year" and "month" columns and do the same thing. -- Josh Berkus Aglio Database Solutions San Francisco