transparent use of pre-aggregated value for very large datasets - Mailing list pgsql-general

From chaudhar@umich.edu (shahbaz)
Subject transparent use of pre-aggregated value for very large datasets
Date
Msg-id 1caf1b8c.0302281736.69b93001@posting.google.com
Whole thread Raw
Responses Re: transparent use of pre-aggregated value for very large datasets  (Bruno Wolff III <bruno@wolff.to>)
Re: transparent use of pre-aggregated value for very large datasets  (Greg Stark <gsstark@mit.edu>)
Re: transparent use of pre-aggregated value for very large datasets  (Oleg Samoylov <olleg@telecom.mipt.ru>)
List pgsql-general
Hi all,

[PROBLEM / MOTIVATION... ignore this part if you want]

I have a data set of about 15 million rows and 26 columns (basically
read-only).  Four or five of them are values that might be used in a
group by clause, the rest are ints/floats/doubles that need sum(),
avg(), etc.  I guess the 'group by' values might be thought of as
dimensions and the rest as measures.  My computer has two gigs of ram
and dual cpu amds.  Basically I am trying to manage a fairly large
amount of data on a machine that is not much more than a high-end
desktop.  Worse yet, this data is to be queried over the web, so
response time has to be within seconds.

I've spilt the 15 million row data into about 20 or so tables (about a
million rows added each month).  Built a few indexes on the
appropriate columns.  My response time is still slightly longer than
it should be.  Maintenance is a real pain, since I am not a dba and
would rather spend more time developing the application.

[ROBUST SOLUTIONS]
I've been looking at various ways of speeding up queries in such a
situation.  I suppose someone with money could use an expensive OLAP
(TM1, Query Objects, etc.) product or use a database with some sort of
bit index (oracle, sybase IQ, etc.)  It seems to me that postgresql
should be able to solve this with just a little tweaking.  At least in
my application, pre-aggregating (sum(), avg(), etc.) will, obviously,
GREATLY speed up my queries.  The only thing is, I'd rather not have
to select the right table (raw data, pre-aggs) from my application
logic.

Why can't we tell postgresql to keep these values for us and never let
the app developers worry about it again?

[INDEX]
Perhaps something like a simple index.  Instead of having an index on
certain columns (or functions), is it possible to build an 'index'
that holds pre-calculated values?  I realize that updating such an
index would be difficult, so make it so new values are added to the
index whenever a table or a DB is reindexed.

[Materialized View]
Perhaps some version of a materialized view that query executor
recognizes to be part of the original table.  Imagine a simple table
on which you run the following query:
"select state, city,count(*), sum(profits) from rawdata group by
state,city;"
Assume this query will scan the whole table and calculate the values
for you.  But if this query is used a lot, and the table is HUGE, then
this will obviously take quite a bit of time.  What if you could do
the following:
"CREATE PRECALC state_city_rawdata
    select state, city,count(*), sum(profits) from rawdata group by
state,city;"
Next time someone does a query asking for data grouped by state,city,
it will NOT query the rawdata table, but go to state_city_rawdata
table and get the results much, much faster.

[POSTGRESQL RULE SYSTEM]
I figured that something like [MATERIALIZED VIEW] might be possible
using the pgsql rule system, but it doesn't look like it.  Perhaps
extend the rule system so it does something like the following:
CREATE RULE "state_city_rule" AS ON SELECT TO rawdata WHERE <whatever>
GROUP BY state, city GROUP RULES sum(profits), avg(column1),
max(column4) DO INSTEAD select * from preagg_data.
The GROUP RULES is important because while one query might as for
...sum(profits)..., another quey might have the same group by clause
but do ... avg(profits)... .

[INHERITANCE]
It seems that there already is some mechanism for linking one table to
another (no, not joins).  The inheritance looks to be actually more
complicated than just keeping track of pre-aggregated values.  I
haven't been able to find much info on it beyond its simple use
though.


Any way, while I could tune the heck out of the DB or spend countless
thousands on RAM, wouldn't it be better to take advantage of cheap
hard-disk space.  Not just that, but even a non-database person could
use such mechanisms to improve the performance of their systems by
several orders of magnitude.  No doubt it will take a huge amount of
disk space, but spending a dollar a GIG of disk is certainly worth
having quick and RELIABLE access to as much data as your (or your
analysts') heart desires!

Any pointers?  Or preferably, any hackers will to take this on?  Or,
even better, does this already exist and I just don't know about it
:)?

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: pg_relcheck
Next
From: victor moran
Date:
Subject: Hosting a data file on a SAN