Re: Many, many materialised views - Performance? - Mailing list pgsql-general
From | Toby Corkindale |
---|---|
Subject | Re: Many, many materialised views - Performance? |
Date | |
Msg-id | 5256301F.20502@strategicdata.com.au Whole thread Raw |
In response to | Re: Many, many materialised views - Performance? (Alban Hertroys <haramrae@gmail.com>) |
List | pgsql-general |
On 09/10/13 21:05, Alban Hertroys wrote: > On Oct 9, 2013, at 4:08, Kevin Grittner <kgrittn@ymail.com> wrote: > >> Toby Corkindale <toby.corkindale@strategicdata.com.au> wrote: >> >>> In this instance, we have a lot of queries that build certain >>> aggregate results, which are very slow. The queries were >>> initially all implemented as views, but then we started doing a >>> type of materialising of our own, turning them into tables with >>> CREATE TABLE AS SELECT .... This does make the results very fast >>> to access now, but the side effect is a vast number of (very >>> small) tables. >> >> If you have multiple tables with identical layout but different >> subsets of the data, you will probably get better performance by >> putting them into a single table with indexes which allow you to >> quickly search the smaller sets within the table. > > > I was thinking just that while reading Toby's message. For example, > you could put the results of several related aggregations into a > single materialized view, if they share the same key columns (year, > month, factory or something similar). I'm not sure the new built-in > materialized views can be updated like that though, unless you manage > to combine those aggregations into a single monster-query, but that > will probably not perform well... What we tend to do at work (no > PostgreSQL, unfortunately) is to use external tools to combine those > aggregated results and store that back into the database (which we > often need to do anyway, as we deal with several databases on several > servers). Thanks for the suggestions, all. As I noted in an earlier email -- we're aware that the schema could be better designed, but making large changes is tricky in production systems. Many of the tables are actually unique, but only in the sense that you have various (common) identifier fields, and then a few (unique) aggregate-results per table. eg: int id_key_1, int id_key_2, .., float FooBarXResult I suspect the correct way to handle this would actually be a table that looked like: int id_key_1, int id_key_2, .., text result_name, float result_value Although that would in turn make other queries more verbose, for example, currently one can do: select * from FooResult join BarResult using (id_key_1, id_key_2) where FooResultX > 0.9 and BarResultY < 0.1; I guess that turns into something like this: select id_key_1, id_key_2, a.result_value as FooResultX, b.result_value as FooResultY from AllResults a join AllResults b using (id_key_1, id_key_2) where a.result_name = "FooResultX" and a.result_value > 0.9 and b.result_name = "BarResultY" and b.result_value < 0.1; So it's all do-able, but it does look nicer to separate things into their own tables with named columns. > Additionally, if you have that many tables, it sounds like you > partitioned your data. With aggregated results, the need for > partitioning is much less (or perhaps it isn't even needed at all). > And perhaps you don't even need the data from all partitions; say if > you have monthly partitions of data, do you really need aggregated > results from 5 years ago? You're correct, we do have partitioned tables due to the amount of data in the system, but that's for just the non-aggregated data. Those tables perform just fine! It's the hundreds of thousands of views and tables with just a few rows in them that worry me.. :) > That said, users excel in finding data to request that you thought > they wouldn't need. > Which brings me to another question: Do your users really need the > data from all those views or do they only think they need that? Ah, indeed, users have not individually requested each of these many thousands of tables and views. They are part of a large application, and the results from all of those are required by it. If I rewrote the application today, I'd be looking at doing things very differently, knowing how it would eventually scale.
pgsql-general by date: