Home > mailing lists

Accelerating aggregates - Mailing list pgsql-hackers

From	Steve Atkins
Subject	Accelerating aggregates
Date	June 11, 2004 04:42:55
Msg-id	20040611075240.GA7860@gp.word-to-the-wise.com Whole thread Raw
Responses	Re: Accelerating aggregates Re: Accelerating aggregates
List	pgsql-hackers

Tree view

Stop me if you've heard this before.


I'm looking at fast calculation of aggregates (sum(), max(), count())
across large tables, or across fairly simply defined subsets of those
tables.

Lets say that, for a given aggregate function on a given table (with a
given where clause, perhaps), each postgres process maintains a state
variable (stype, in aggregate terms) and there's a also a single state
variable available to all backends via shared memory.

Each time a transaction starts the process initialises its local state
variable to the initcond of the aggregate. Each time a row is inserted
into the table the local state variable is updated, using the
aggregate update function. Each time a row is removed then the local
state variable is either updated, or invalidated, using a
"reverse-update" function, again specific to the aggregate.

If the transaction is rolled back, the local state variable is
thrown away. If the transaction is commited and the local state
variable has been invalidated then the global state variable is
invalidated, otherwise the global state variable is updated using
a a state merge function, specific to the aggregate.

Then, at any point, the correct value of the aggregate function,
taking into account transactions that commited before the current
transaction started can be found by merging the global state variable
and the local one.

If the global state variable isn't valid (because it's been explicitly
invalidated, or the system has just started up or the caching of the
value was just enabled) then it can be calculated by running the
real aggregate function against the table.

By defining the right functions this approach could significantly
accelerate many aggregate queries, including count(), min(), max(),
avg() and also simple 'count() where <something>' or 'sum() where
<something>' calculations.

count() update():          local_stype = local_stype+1 reverse_update():  local_stype = local_stype-1 merge():
global_stype = global_stype + local_stype
 

max() update(X):         local_stype = max(local_stype, X) reverse_update(X): if X >= local_stype then invalidate
                                else nothing merge()            global_stype = max(global_stype, local_stype)
 

This would be fairly cheap on updates, no worse than a fairly cheap
trigger, and either very cheap to read or at worst no more expensive
than calculating the aggregate fully.

It's a fairly nasty kludge, and probably a bad idea, but do you think
it's feasible (ignoring the whole nested transaction issue, for now)?

Cheers, Steve

pgsql-hackers by date:

From: Alvaro Herrera
Date: 11 June 2004, 04:08:30
Subject: Re: Nested transactions and tuple header info

From: Gaetano Mendola
Date: 11 June 2004, 05:02:29
Subject: Re: Improving postgresql.conf

Accelerating aggregates - Mailing list pgsql-hackers

Previous

Next