Thread: transparent use of pre-aggregated value for very large datasets

transparent use of pre-aggregated value for very large datasets

From
chaudhar@umich.edu (shahbaz)
Date:
Hi all,

[PROBLEM / MOTIVATION... ignore this part if you want]

I have a data set of about 15 million rows and 26 columns (basically
read-only).  Four or five of them are values that might be used in a
group by clause, the rest are ints/floats/doubles that need sum(),
avg(), etc.  I guess the 'group by' values might be thought of as
dimensions and the rest as measures.  My computer has two gigs of ram
and dual cpu amds.  Basically I am trying to manage a fairly large
amount of data on a machine that is not much more than a high-end
desktop.  Worse yet, this data is to be queried over the web, so
response time has to be within seconds.

I've spilt the 15 million row data into about 20 or so tables (about a
million rows added each month).  Built a few indexes on the
appropriate columns.  My response time is still slightly longer than
it should be.  Maintenance is a real pain, since I am not a dba and
would rather spend more time developing the application.

[ROBUST SOLUTIONS]
I've been looking at various ways of speeding up queries in such a
situation.  I suppose someone with money could use an expensive OLAP
(TM1, Query Objects, etc.) product or use a database with some sort of
bit index (oracle, sybase IQ, etc.)  It seems to me that postgresql
should be able to solve this with just a little tweaking.  At least in
my application, pre-aggregating (sum(), avg(), etc.) will, obviously,
GREATLY speed up my queries.  The only thing is, I'd rather not have
to select the right table (raw data, pre-aggs) from my application
logic.

Why can't we tell postgresql to keep these values for us and never let
the app developers worry about it again?

[INDEX]
Perhaps something like a simple index.  Instead of having an index on
certain columns (or functions), is it possible to build an 'index'
that holds pre-calculated values?  I realize that updating such an
index would be difficult, so make it so new values are added to the
index whenever a table or a DB is reindexed.

[Materialized View]
Perhaps some version of a materialized view that query executor
recognizes to be part of the original table.  Imagine a simple table
on which you run the following query:
"select state, city,count(*), sum(profits) from rawdata group by
state,city;"
Assume this query will scan the whole table and calculate the values
for you.  But if this query is used a lot, and the table is HUGE, then
this will obviously take quite a bit of time.  What if you could do
the following:
"CREATE PRECALC state_city_rawdata
    select state, city,count(*), sum(profits) from rawdata group by
state,city;"
Next time someone does a query asking for data grouped by state,city,
it will NOT query the rawdata table, but go to state_city_rawdata
table and get the results much, much faster.

[POSTGRESQL RULE SYSTEM]
I figured that something like [MATERIALIZED VIEW] might be possible
using the pgsql rule system, but it doesn't look like it.  Perhaps
extend the rule system so it does something like the following:
CREATE RULE "state_city_rule" AS ON SELECT TO rawdata WHERE <whatever>
GROUP BY state, city GROUP RULES sum(profits), avg(column1),
max(column4) DO INSTEAD select * from preagg_data.
The GROUP RULES is important because while one query might as for
...sum(profits)..., another quey might have the same group by clause
but do ... avg(profits)... .

[INHERITANCE]
It seems that there already is some mechanism for linking one table to
another (no, not joins).  The inheritance looks to be actually more
complicated than just keeping track of pre-aggregated values.  I
haven't been able to find much info on it beyond its simple use
though.


Any way, while I could tune the heck out of the DB or spend countless
thousands on RAM, wouldn't it be better to take advantage of cheap
hard-disk space.  Not just that, but even a non-database person could
use such mechanisms to improve the performance of their systems by
several orders of magnitude.  No doubt it will take a huge amount of
disk space, but spending a dollar a GIG of disk is certainly worth
having quick and RELIABLE access to as much data as your (or your
analysts') heart desires!

Any pointers?  Or preferably, any hackers will to take this on?  Or,
even better, does this already exist and I just don't know about it
:)?

Re: transparent use of pre-aggregated value for very large datasets

From
Bruno Wolff III
Date:
On Fri, Feb 28, 2003 at 17:36:33 -0800,
  shahbaz <chaudhar@umich.edu> wrote:
>
> Any pointers?  Or preferably, any hackers will to take this on?  Or,
> even better, does this already exist and I just don't know about it
> :)?

If the number of groups isn't too large, the hash aggregate system in 7.4
might speed up your query enough. By using hashes, the sort step is
avoided which will speed things up for large tables.

7.4 won't be out for a while, but the feature is already in CVS so you
can see if it will help with your problem now.

Re: transparent use of pre-aggregated value for very large datasets

From
Greg Stark
Date:
chaudhar@umich.edu (shahbaz) writes:

> Any way, while I could tune the heck out of the DB or spend countless
> thousands on RAM, wouldn't it be better to take advantage of cheap
> hard-disk space.

So the main objection people have had to such caching is that it would
conflict with the concurrency goals. The aggregate caches would become a big
point of contention. There are a lot of applications where there are a lot
more selects than update/inserts however. And there are a lot of applications
where some information doesn't need to be updated continuously.

It's kind of sad to see the rules system not quite being up to what you need.
Are you sure it can't be done? It would be the cleanest way to do it manually.

--
greg

Re: transparent use of pre-aggregated value for very large datasets

From
Oleg Samoylov
Date:
In my project I encounter the same problem.

I use separate table for aggregated data per day and month. And use
triggers for update aggregated data when raw data inserted. It is
incorrect for relational database paradigm, but work fine for me.

--
Olleg Samoylov