Aggregate functions, fast! (long) - Mailing list pgsql-sql

From Ang Chin Han
Subject Aggregate functions, fast! (long)
Date
Msg-id 20000809145345.B5894@pintoo.com
Whole thread Raw
Responses Re: Aggregate functions, fast! (long)  (Ang Chin Han <angch@pintoo.com>)
List pgsql-sql
Apologies in advance for the length of this post, but this has
been bugging me for a week or so.

Consider a table with a primary key pk and a list of attributes a, b
and c:
     Table t   pk  a  b  c   -----------    1  1  1  1    2  1  2  3    :   etc  : 9998  1  1  1 9999  2  1  2

If the table is a large but the range of possible values for the
attributes small, aggregate queries such as the following with be
very slow:
 1.  select min(a) from t; 2.  select count(*) from t where a = 1 and b = 3; 3.  select sum(d) from t where a = 1 and c
>3; 4.  select avg(b) from t where c = 1; 
 
One way of speeding these type of queries is to have a table
summarizing that table t:
    Table t_sum   a  b  c    cnt   --------------   1  1  1     2  (This is from pk 1 and 9998 having the same a,b,c)
1 2  3     1   2  1  2     1   :  : etc    : with a primary key of (a, b, c) and an integer cnt counting the number of
timesa particular combination of (a, b, c) occuring in table t 
 
The queries making use of these might be rewritten as: 1.  select min(a) from t_sum; -- same as above,
             -- but we've less rows to scan 2.  select cnt from t_sum where a = 1 and b = 3; 3.  select sum(d * cnt) as
sumfrom t_sum where a = 1 and c > 3; 4.  select sum(b * cnt) / cnt as avg from t_sum where c = 1;
 
 (CAVEAT/BUG: 1. must return cnt = 0 if it doesn't exist in                 t_sum               2. rows with cnt = 0
willhave to be deleted                 immediately or select min(foo) might return the                 wrong result)  
 
Now, t_sum can be automatically updated by triggers/rules (I'll
get into this later)

It might seem a bit pointless but I've made the example very
generic to show that this _is_ a generic class of problem.
Specific examples might include:   select count(*) from books where category=x;   select count(*) from articles where
category=xand author=y;
 

My point is it'll be nice if there is an easier mechanism ala
CREATE VIEW as a shortcut (and more!) for
'select x from y where z = a;'

The syntax might look like: CREATE AGGREGATE INDEX t1_sum on t1 (a, b, c);
which would create the implicit triggers and table.

The hard part is for postgres to have a rule rewrite system
capable of converting the queries. (Perhaps we'll get this
when views-with-aggregate-columns bug is fixed?)

Of course, the performance gain can be achieved if you manually
rewrite your queries to take advantage of the summary table (or
aggregate index, similar to the normal index speeding up ranged
lookups).


And, oh, the rules/triggers: (I used these, and they work great
for some of the tests I did, but I haven't fully debugged these for
all cases, but they definately have bugs 1 & 2 described above.
)

---------- 8<- Cut here ---------
CREATE TABLE t_sum (   a INTEGER,   b INTEGER,   c INTEGER,   cnt INTEGER,   PRIMARY KEY (a, b, c)
);

CREATE RULE t_sum_add_rule AS ON INSERT TO t_sum WHERE   EXISTS (SELECT cnt FROM t_sum WHERE       a = new.a and b =
new.band c = new.c) DO INSTEAD   UPDATE t_sum set cnt = cnt + 1 WHERE       a = new.a and b = new.b and c = new.c;
 

CREATE RULE t_insert_rule AS ON INSERT TO t DO   INSERT INTO t_sum values (new.a, new.b, new.c, 1);

CREATE FUNCTION "t_sum_upd" ( ) RETURNS opaque AS '
begin   update t_sum set cnt = cnt - 1       where t_sum.a = old.a  and t_sum.b = old.b  and t_sum.c = old.c;   insert
intot_sum values (new.a, new.b, new.c);   return new;
 
end;' LANGUAGE 'plpgsql';

CREATE TRIGGER "t_upd" BEFORE UPDATE ON "t"   FOR EACH ROW EXECUTE PROCEDURE "t_sum_upd" ();

CREATE FUNCTION "t_sum_del" ( ) RETURNS opaque AS '
begin   update t_sum set cnt = cnt - 1       where t_sum.a = old.a  and t_sum.b = old.b  and t_sum.c = old.c;   return
old;
end;' LANGUAGE 'plpgsql';                      
CREATE TRIGGER "t_del" BEFORE DELETE ON "t"   FOR EACH ROW EXECUTE PROCEDURE "t_sum_del" ();
---------- 8<- Cut here ---------

P.S. This post is inspired when someone mentioned on the list that
a separate counter might be kept by postgres to speed up some
aggregate functions like select count(*) from t;

P.P.S. Curious how do the commercial RDBMS handle this:   select count(*) from people where gender='m';
when people contains one million rows and gender distribution is
NEARLY 50% male/female?



pgsql-sql by date:

Previous
From: Ang Chin Han
Date:
Subject: Re: Functions too slow, even with iscachable?
Next
From: Ang Chin Han
Date:
Subject: Re: Aggregate functions, fast! (long)