Thread: Re: Re: Data Warehouse Reevaluation - MySQL vs Postgres --

Re: Re: Data Warehouse Reevaluation - MySQL vs Postgres --

From
Date:
Joe Conway <mail@joeconway.com> wrote on 15.09.2004, 06:30:24:
> Chris Browne wrote:
> > Might we set up the view as:
> >
> > create view combination_of_logs as
> >   select * from table_1 where txn_date between 'this' and 'that'
> >    union all
> >   select * from table_2 where txn_date between 'this2' and 'that2'
> >    union all
> >   select * from table_3 where txn_date between 'this3' and 'that3'
> >    union all
> >   select * from table_4 where txn_date between 'this4' and 'that4'
> >    union all
> >    ... ad infinitum
> >    union all
> >   select * from table_n where txn_date > 'start_of_partition_n';
> >
> > and expect that to help, as long as the query that hooks up to this
> > has date constraints?
> >
> > We'd have to regenerate the view with new fixed constants each time we
> > set up the tables, but that sounds like it could work...
>
> That's exactly what we're doing, but using inherited tables instead of a
> union view. With inheritance, there is no need to rebuild the view each
> time a table is added or removed. Basically, in our application, tables
> are partitioned by either month or week, depending on the type of data
> involved, and queries are normally date qualified.
>
> We're not completely done with our data conversion (from a commercial
> RDBMSi), but so far the results have been excellent. Similar to what
> others have said in this thread, the conversion involved restructuring
> the data to better suit Postgres, and the application (data
> analysis/mining vs. the source system which is operational). As a result
> we've compressed a > 1TB database down to ~0.4TB, and seen at least one
> typical query reduced from ~9 minutes down to ~40 seconds.

Sounds interesting.

The performance gain comes from partition elimination of the inherited
tables under the root?

I take it the compression comes from use of arrays, avoiding the need
for additional rows and key overhead?

Best Regards, Simon Riggs

Re: Data Warehouse Reevaluation - MySQL vs Postgres --

From
Joe Conway
Date:
simon@2ndquadrant.com wrote:
> Joe Conway <mail@joeconway.com> wrote on 15.09.2004, 06:30:24:
>>We're not completely done with our data conversion (from a commercial
>>RDBMSi), but so far the results have been excellent. Similar to what
>>others have said in this thread, the conversion involved restructuring
>>the data to better suit Postgres, and the application (data
>>analysis/mining vs. the source system which is operational). As a result
>>we've compressed a > 1TB database down to ~0.4TB, and seen at least one
>>typical query reduced from ~9 minutes down to ~40 seconds.
>
> Sounds interesting.
>
> The performance gain comes from partition elimination of the inherited
> tables under the root?
>
> I take it the compression comes from use of arrays, avoiding the need
> for additional rows and key overhead?

Sorry, in trying to be concise I was not very clear. I'm using the term
compression very generally here. I'll try to give a bit more background,

The original data source is a database schema designed for use by an
operational application that my company sells to provide enhanced
management of equipment that we also sell. The application needs to be
very flexible in exactly what data it stores in order to be useful
across a wide variety of equipment models and versions. In order to do
that there is a very large central "transaction" table that stores
name->value pairs in varchar columns. The name->value pairs come from
parsed output of the equipment, and as such there is a fair amount of
redundancy and unneeded data that ends up getting stored. At each
installation in the field this table can get very large (> billion
rows). Additionally the application prematerializes a variety of
summaries for use by the operators using the GUI.

We collect the data exported from each of the systems in the field and
accumulate it in a single central database for data mining and analysis.
This is the database that is actually being converted. By compression I
really mean that unneeded and redundant data is being stripped out, and
data known to be of a certain datatype is stored in that type instead of
varchar (e.g. values known to be int are stored as int). Also the
summaries are not being converted (although we do some post processing
to create new materialized summaries).

My points in telling this were:
  - the use of inherited tables to partition this huge number of rows and
    yet allow simple query access to it seems to work well, at least in
    early validation tests
  - had we simply taken the original database and "slammed" it into
    Postgres with no further thought, we would not have seen the big
    improvements, and thus the project might have been seen as a failure
    (even though it saves substantial $)

Hope that's a bit more clear. I'm hoping to write up a more detailed
case study once we've cut the Postgres system into production and the
dust settles a bit.

Joe

Re: Data Warehouse Reevaluation - MySQL vs Postgres --

From
Josh Berkus
Date:
Joe,

>   - the use of inherited tables to partition this huge number of rows and
>     yet allow simple query access to it seems to work well, at least in
>     early validation tests
>   - had we simply taken the original database and "slammed" it into
>     Postgres with no further thought, we would not have seen the big
>     improvements, and thus the project might have been seen as a failure
>     (even though it saves substantial $)

Any further thoughts on developing this into true table partitioning?

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Data Warehouse Reevaluation - MySQL vs Postgres --

From
Joe Conway
Date:
Josh Berkus wrote:
>>  - the use of inherited tables to partition this huge number of rows and
>>    yet allow simple query access to it seems to work well, at least in
>>    early validation tests
>>  - had we simply taken the original database and "slammed" it into
>>    Postgres with no further thought, we would not have seen the big
>>    improvements, and thus the project might have been seen as a failure
>>    (even though it saves substantial $)
>
>
> Any further thoughts on developing this into true table partitioning?
>

Just that I'd love to see it happen ;-)

Maybe someday I'll be able to find the time to work on it myself, but
for the moment I'm satisfied with the workarounds we've made.

Joe

Re: Data Warehouse Reevaluation - MySQL vs Postgres --

From
"Simon Riggs"
Date:
Joe,

Your application is very interesting. I've just read your OSCON paper. I'd
like to talk more about that. Very similar to Kalido.

...but back to partitioning momentarily: Does the performance gain come from
partition elimination of the inherited tables under the root?

Best Regards, Simon Riggs




Re: Data Warehouse Reevaluation - MySQL vs Postgres --

From
Joe Conway
Date:
Simon Riggs wrote:
> Joe,
>
> Your application is very interesting. I've just read your OSCON paper. I'd
> like to talk more about that. Very similar to Kalido.
>
> ...but back to partitioning momentarily: Does the performance gain come from
> partition elimination of the inherited tables under the root?

I think the major part of the peformance gain comes from the fact that
the source database has different needs in terms of partitioning
criteria because of it's different purpose. The data is basically
partitioned by customer installation instead of by date. Our converted
scheme partitions by date, which is in line with the analytical queries
run at the corporate office. Again, this is an argument in favor of not
simply porting what you're handed.

We might get similar query performance with a single large table and
multiple partial indexes (e.g. one per month), but there would be one
tradeoff and one disadvantage to that:
1) The indexes would need to be generated periodically -- this is a
tradeoff since we currently need to create inherited tables at the same
periodicity
2) It would be much more difficult to "roll off" a month's worth of data
when needed. The general idea is that each month we create a new monthly
table, then archive and drop the oldest monthly table. If all the data
were in one big table we would have to delete many millions of rows from
a (possibly) multibillion row table, and then vacuum that table -- no
thanks ;-)

Joe