Thread: transparent use of pre-aggregated value for very large datasets
transparent use of pre-aggregated value for very large datasets
From
chaudhar@umich.edu (shahbaz)
Date:
Hi all, [PROBLEM / MOTIVATION... ignore this part if you want] I have a data set of about 15 million rows and 26 columns (basically read-only). Four or five of them are values that might be used in a group by clause, the rest are ints/floats/doubles that need sum(), avg(), etc. I guess the 'group by' values might be thought of as dimensions and the rest as measures. My computer has two gigs of ram and dual cpu amds. Basically I am trying to manage a fairly large amount of data on a machine that is not much more than a high-end desktop. Worse yet, this data is to be queried over the web, so response time has to be within seconds. I've spilt the 15 million row data into about 20 or so tables (about a million rows added each month). Built a few indexes on the appropriate columns. My response time is still slightly longer than it should be. Maintenance is a real pain, since I am not a dba and would rather spend more time developing the application. [ROBUST SOLUTIONS] I've been looking at various ways of speeding up queries in such a situation. I suppose someone with money could use an expensive OLAP (TM1, Query Objects, etc.) product or use a database with some sort of bit index (oracle, sybase IQ, etc.) It seems to me that postgresql should be able to solve this with just a little tweaking. At least in my application, pre-aggregating (sum(), avg(), etc.) will, obviously, GREATLY speed up my queries. The only thing is, I'd rather not have to select the right table (raw data, pre-aggs) from my application logic. Why can't we tell postgresql to keep these values for us and never let the app developers worry about it again? [INDEX] Perhaps something like a simple index. Instead of having an index on certain columns (or functions), is it possible to build an 'index' that holds pre-calculated values? I realize that updating such an index would be difficult, so make it so new values are added to the index whenever a table or a DB is reindexed. [Materialized View] Perhaps some version of a materialized view that query executor recognizes to be part of the original table. Imagine a simple table on which you run the following query: "select state, city,count(*), sum(profits) from rawdata group by state,city;" Assume this query will scan the whole table and calculate the values for you. But if this query is used a lot, and the table is HUGE, then this will obviously take quite a bit of time. What if you could do the following: "CREATE PRECALC state_city_rawdata select state, city,count(*), sum(profits) from rawdata group by state,city;" Next time someone does a query asking for data grouped by state,city, it will NOT query the rawdata table, but go to state_city_rawdata table and get the results much, much faster. [POSTGRESQL RULE SYSTEM] I figured that something like [MATERIALIZED VIEW] might be possible using the pgsql rule system, but it doesn't look like it. Perhaps extend the rule system so it does something like the following: CREATE RULE "state_city_rule" AS ON SELECT TO rawdata WHERE <whatever> GROUP BY state, city GROUP RULES sum(profits), avg(column1), max(column4) DO INSTEAD select * from preagg_data. The GROUP RULES is important because while one query might as for ...sum(profits)..., another quey might have the same group by clause but do ... avg(profits)... . [INHERITANCE] It seems that there already is some mechanism for linking one table to another (no, not joins). The inheritance looks to be actually more complicated than just keeping track of pre-aggregated values. I haven't been able to find much info on it beyond its simple use though. Any way, while I could tune the heck out of the DB or spend countless thousands on RAM, wouldn't it be better to take advantage of cheap hard-disk space. Not just that, but even a non-database person could use such mechanisms to improve the performance of their systems by several orders of magnitude. No doubt it will take a huge amount of disk space, but spending a dollar a GIG of disk is certainly worth having quick and RELIABLE access to as much data as your (or your analysts') heart desires! Any pointers? Or preferably, any hackers will to take this on? Or, even better, does this already exist and I just don't know about it :)?
On Fri, Feb 28, 2003 at 17:36:33 -0800, shahbaz <chaudhar@umich.edu> wrote: > > Any pointers? Or preferably, any hackers will to take this on? Or, > even better, does this already exist and I just don't know about it > :)? If the number of groups isn't too large, the hash aggregate system in 7.4 might speed up your query enough. By using hashes, the sort step is avoided which will speed things up for large tables. 7.4 won't be out for a while, but the feature is already in CVS so you can see if it will help with your problem now.
chaudhar@umich.edu (shahbaz) writes: > Any way, while I could tune the heck out of the DB or spend countless > thousands on RAM, wouldn't it be better to take advantage of cheap > hard-disk space. So the main objection people have had to such caching is that it would conflict with the concurrency goals. The aggregate caches would become a big point of contention. There are a lot of applications where there are a lot more selects than update/inserts however. And there are a lot of applications where some information doesn't need to be updated continuously. It's kind of sad to see the rules system not quite being up to what you need. Are you sure it can't be done? It would be the cleanest way to do it manually. -- greg
In my project I encounter the same problem. I use separate table for aggregated data per day and month. And use triggers for update aggregated data when raw data inserted. It is incorrect for relational database paradigm, but work fine for me. -- Olleg Samoylov