Accelerating aggregates - Mailing list pgsql-hackers
From | Steve Atkins |
---|---|
Subject | Accelerating aggregates |
Date | |
Msg-id | 20040611075240.GA7860@gp.word-to-the-wise.com Whole thread Raw |
Responses |
Re: Accelerating aggregates
Re: Accelerating aggregates |
List | pgsql-hackers |
Stop me if you've heard this before. I'm looking at fast calculation of aggregates (sum(), max(), count()) across large tables, or across fairly simply defined subsets of those tables. Lets say that, for a given aggregate function on a given table (with a given where clause, perhaps), each postgres process maintains a state variable (stype, in aggregate terms) and there's a also a single state variable available to all backends via shared memory. Each time a transaction starts the process initialises its local state variable to the initcond of the aggregate. Each time a row is inserted into the table the local state variable is updated, using the aggregate update function. Each time a row is removed then the local state variable is either updated, or invalidated, using a "reverse-update" function, again specific to the aggregate. If the transaction is rolled back, the local state variable is thrown away. If the transaction is commited and the local state variable has been invalidated then the global state variable is invalidated, otherwise the global state variable is updated using a a state merge function, specific to the aggregate. Then, at any point, the correct value of the aggregate function, taking into account transactions that commited before the current transaction started can be found by merging the global state variable and the local one. If the global state variable isn't valid (because it's been explicitly invalidated, or the system has just started up or the caching of the value was just enabled) then it can be calculated by running the real aggregate function against the table. By defining the right functions this approach could significantly accelerate many aggregate queries, including count(), min(), max(), avg() and also simple 'count() where <something>' or 'sum() where <something>' calculations. count() update(): local_stype = local_stype+1 reverse_update(): local_stype = local_stype-1 merge(): global_stype = global_stype + local_stype max() update(X): local_stype = max(local_stype, X) reverse_update(X): if X >= local_stype then invalidate else nothing merge() global_stype = max(global_stype, local_stype) This would be fairly cheap on updates, no worse than a fairly cheap trigger, and either very cheap to read or at worst no more expensive than calculating the aggregate fully. It's a fairly nasty kludge, and probably a bad idea, but do you think it's feasible (ignoring the whole nested transaction issue, for now)? Cheers, Steve
pgsql-hackers by date: