Thread: Re: Re: Data Warehouse Reevaluation - MySQL vs Postgres --
Joe Conway <mail@joeconway.com> wrote on 15.09.2004, 06:30:24: > Chris Browne wrote: > > Might we set up the view as: > > > > create view combination_of_logs as > > select * from table_1 where txn_date between 'this' and 'that' > > union all > > select * from table_2 where txn_date between 'this2' and 'that2' > > union all > > select * from table_3 where txn_date between 'this3' and 'that3' > > union all > > select * from table_4 where txn_date between 'this4' and 'that4' > > union all > > ... ad infinitum > > union all > > select * from table_n where txn_date > 'start_of_partition_n'; > > > > and expect that to help, as long as the query that hooks up to this > > has date constraints? > > > > We'd have to regenerate the view with new fixed constants each time we > > set up the tables, but that sounds like it could work... > > That's exactly what we're doing, but using inherited tables instead of a > union view. With inheritance, there is no need to rebuild the view each > time a table is added or removed. Basically, in our application, tables > are partitioned by either month or week, depending on the type of data > involved, and queries are normally date qualified. > > We're not completely done with our data conversion (from a commercial > RDBMSi), but so far the results have been excellent. Similar to what > others have said in this thread, the conversion involved restructuring > the data to better suit Postgres, and the application (data > analysis/mining vs. the source system which is operational). As a result > we've compressed a > 1TB database down to ~0.4TB, and seen at least one > typical query reduced from ~9 minutes down to ~40 seconds. Sounds interesting. The performance gain comes from partition elimination of the inherited tables under the root? I take it the compression comes from use of arrays, avoiding the need for additional rows and key overhead? Best Regards, Simon Riggs
simon@2ndquadrant.com wrote: > Joe Conway <mail@joeconway.com> wrote on 15.09.2004, 06:30:24: >>We're not completely done with our data conversion (from a commercial >>RDBMSi), but so far the results have been excellent. Similar to what >>others have said in this thread, the conversion involved restructuring >>the data to better suit Postgres, and the application (data >>analysis/mining vs. the source system which is operational). As a result >>we've compressed a > 1TB database down to ~0.4TB, and seen at least one >>typical query reduced from ~9 minutes down to ~40 seconds. > > Sounds interesting. > > The performance gain comes from partition elimination of the inherited > tables under the root? > > I take it the compression comes from use of arrays, avoiding the need > for additional rows and key overhead? Sorry, in trying to be concise I was not very clear. I'm using the term compression very generally here. I'll try to give a bit more background, The original data source is a database schema designed for use by an operational application that my company sells to provide enhanced management of equipment that we also sell. The application needs to be very flexible in exactly what data it stores in order to be useful across a wide variety of equipment models and versions. In order to do that there is a very large central "transaction" table that stores name->value pairs in varchar columns. The name->value pairs come from parsed output of the equipment, and as such there is a fair amount of redundancy and unneeded data that ends up getting stored. At each installation in the field this table can get very large (> billion rows). Additionally the application prematerializes a variety of summaries for use by the operators using the GUI. We collect the data exported from each of the systems in the field and accumulate it in a single central database for data mining and analysis. This is the database that is actually being converted. By compression I really mean that unneeded and redundant data is being stripped out, and data known to be of a certain datatype is stored in that type instead of varchar (e.g. values known to be int are stored as int). Also the summaries are not being converted (although we do some post processing to create new materialized summaries). My points in telling this were: - the use of inherited tables to partition this huge number of rows and yet allow simple query access to it seems to work well, at least in early validation tests - had we simply taken the original database and "slammed" it into Postgres with no further thought, we would not have seen the big improvements, and thus the project might have been seen as a failure (even though it saves substantial $) Hope that's a bit more clear. I'm hoping to write up a more detailed case study once we've cut the Postgres system into production and the dust settles a bit. Joe
Joe, > - the use of inherited tables to partition this huge number of rows and > yet allow simple query access to it seems to work well, at least in > early validation tests > - had we simply taken the original database and "slammed" it into > Postgres with no further thought, we would not have seen the big > improvements, and thus the project might have been seen as a failure > (even though it saves substantial $) Any further thoughts on developing this into true table partitioning? -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus wrote: >> - the use of inherited tables to partition this huge number of rows and >> yet allow simple query access to it seems to work well, at least in >> early validation tests >> - had we simply taken the original database and "slammed" it into >> Postgres with no further thought, we would not have seen the big >> improvements, and thus the project might have been seen as a failure >> (even though it saves substantial $) > > > Any further thoughts on developing this into true table partitioning? > Just that I'd love to see it happen ;-) Maybe someday I'll be able to find the time to work on it myself, but for the moment I'm satisfied with the workarounds we've made. Joe
Joe, Your application is very interesting. I've just read your OSCON paper. I'd like to talk more about that. Very similar to Kalido. ...but back to partitioning momentarily: Does the performance gain come from partition elimination of the inherited tables under the root? Best Regards, Simon Riggs
Simon Riggs wrote: > Joe, > > Your application is very interesting. I've just read your OSCON paper. I'd > like to talk more about that. Very similar to Kalido. > > ...but back to partitioning momentarily: Does the performance gain come from > partition elimination of the inherited tables under the root? I think the major part of the peformance gain comes from the fact that the source database has different needs in terms of partitioning criteria because of it's different purpose. The data is basically partitioned by customer installation instead of by date. Our converted scheme partitions by date, which is in line with the analytical queries run at the corporate office. Again, this is an argument in favor of not simply porting what you're handed. We might get similar query performance with a single large table and multiple partial indexes (e.g. one per month), but there would be one tradeoff and one disadvantage to that: 1) The indexes would need to be generated periodically -- this is a tradeoff since we currently need to create inherited tables at the same periodicity 2) It would be much more difficult to "roll off" a month's worth of data when needed. The general idea is that each month we create a new monthly table, then archive and drop the oldest monthly table. If all the data were in one big table we would have to delete many millions of rows from a (possibly) multibillion row table, and then vacuum that table -- no thanks ;-) Joe