Accelerating aggregates - Mailing list pgsql-hackers

From Steve Atkins
Subject Accelerating aggregates
Date
Msg-id 20040611075240.GA7860@gp.word-to-the-wise.com
Whole thread Raw
Responses Re: Accelerating aggregates
Re: Accelerating aggregates
List pgsql-hackers
Stop me if you've heard this before.


I'm looking at fast calculation of aggregates (sum(), max(), count())
across large tables, or across fairly simply defined subsets of those
tables.

Lets say that, for a given aggregate function on a given table (with a
given where clause, perhaps), each postgres process maintains a state
variable (stype, in aggregate terms) and there's a also a single state
variable available to all backends via shared memory.

Each time a transaction starts the process initialises its local state
variable to the initcond of the aggregate. Each time a row is inserted
into the table the local state variable is updated, using the
aggregate update function. Each time a row is removed then the local
state variable is either updated, or invalidated, using a
"reverse-update" function, again specific to the aggregate.

If the transaction is rolled back, the local state variable is
thrown away. If the transaction is commited and the local state
variable has been invalidated then the global state variable is
invalidated, otherwise the global state variable is updated using
a a state merge function, specific to the aggregate.

Then, at any point, the correct value of the aggregate function,
taking into account transactions that commited before the current
transaction started can be found by merging the global state variable
and the local one.

If the global state variable isn't valid (because it's been explicitly
invalidated, or the system has just started up or the caching of the
value was just enabled) then it can be calculated by running the
real aggregate function against the table.

By defining the right functions this approach could significantly
accelerate many aggregate queries, including count(), min(), max(),
avg() and also simple 'count() where <something>' or 'sum() where
<something>' calculations.

count() update():          local_stype = local_stype+1 reverse_update():  local_stype = local_stype-1 merge():
global_stype = global_stype + local_stype
 

max() update(X):         local_stype = max(local_stype, X) reverse_update(X): if X >= local_stype then invalidate
                                else nothing merge()            global_stype = max(global_stype, local_stype)
 

This would be fairly cheap on updates, no worse than a fairly cheap
trigger, and either very cheap to read or at worst no more expensive
than calculating the aggregate fully.

It's a fairly nasty kludge, and probably a bad idea, but do you think
it's feasible (ignoring the whole nested transaction issue, for now)?

Cheers, Steve


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Nested transactions and tuple header info
Next
From: Gaetano Mendola
Date:
Subject: Re: Improving postgresql.conf