Re: Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables - Mailing list pgsql-performance

From Mischa Sandberg
Subject Re: Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables
Date
Msg-id pB21d.193816$X12.122954@edtnps84
Whole thread Raw
In response to Data Warehouse Reevaluation - MySQL vs Postgres  (Mark Cotner <mcotner@yahoo.com>)
Responses Re: Data Warehouse Reevaluation - MySQL vs Postgres --
List pgsql-performance
Mark Cotner wrote:
> Hi all,
> I had a difficult time deciding which list to post
> this to, so please forgive me if this list doesn't
> perfectly match my questions.  My decision will not
> solely be based on performance, but it is the primary
> concern.  I would be very appreciative if you all
> could comment on my test plan.  Real world examples of
> a Postgres implementation of >=600G with a web
> front-end would be great, or any data warehouse with
> some size to it.

I'm only in the 30GB range of database, in case that's a consideration
for my comments that follow.

At this time, I'm working out the best ROLAP options for our PG
transaction store. The transaction store is highly volatile -- longest a
txn stays in it is 15 days ... so you imagine the need for historic
summaries :-)

I've also implemented multiple data servers, including
a federated server that had to make the best of existing engines
(like MySQL, PG and everything from MSJet to RedBrick in the commercial
world).

> The time has come to reevaluate/rearchitect an
> application which I built about 3 years ago.  There
> are no performance concerns with MySQL, but it would
> benefit greatly from stored procedures, views, etc.

If your company is currently happy with MySQL, there probably are other
(nontechnical) reasons to stick with it. I'm impressed that you'd
consider reconsidering PG.

> Some of the mining that we do could benefit from
> stored procedures as well.  MySQL may have these in
> the works, but we won't be able to move to a version
> of MySQL that supports stored procs for another year
> or two.

And PG lets you back-end with some powerful pattern- and
aggregate-handling languages, like Perl. This was definitely a plus for
data mining of web traffic, for example. The power of server-side
extensibility for bailing you out of a design dead-end is not
inconsequential.

PG doesn't have PIVOT operators (qv Oracle and MSSQL), but it makes the
translation from data to column fairly painless otherwise.

> Requirements:
> Merge table definition equivalent.  We use these
> extensively.

Looked all over mysql.com etc, and afaics merge table
is indeed exactly a view of a union-all. Is that right?

PG supports views, of course, as well (now) as tablespaces, allowing you
to split tables/tablesets across multiple disk systems.
PG is also pretty efficient in query plans on such views, where (say)
you make one column a constant (identifier, sort of) per input table.

> Merge table equivalent with all tables containing over
> 100M rows(and about 40 columns, some quite wide) will
> need to do index scans in at least 5 seconds(MySQL
> currently does 2, but we can live with 5) and return
> ~200 rows.

PG has TOAST for handling REALLY BIG columns, and the generic TEXT type
is as efficient as any size-specific VARCHAR() type ... should make
things easier for you.

> Um, gonna sound silly, but the web interface has to
> remain "snappy" under load.  I don't see this as a
> major concern since you don't require table locking.

Agreed. It's more in your warehouse design, and intelligent bounding of
queries. I'd say PG's query analyzer is a few years ahead of MySQL for
large and complex queries.

> If business logic is moved to the database(likely with
> Postgres) performance for inserting with light logic
> on each insert has to keep up with the 4.5M inserts
> per 2 hours(which MySQL completes in ~35min
> currently).  Acceptable numbers for this aggregation
> would be 45-55min using stored procedures.

Again, it's a matter of pipeline design. The tools for creating an
efficient pipeline are at least as good in PG as MySQL.

If you try to insert and postprocess information one row at a time,
procedures or no, there's no offhand way to guarantee your performance
without a test/prototype.

On the other hand, if you do warehouse-style loading (Insert, or PG
COPY, into a temp table; and then 'upsert' into the perm table), I can
guarantee 2500 inserts/sec is no problem.

> Here's our case study if you're interested . . .
> http://www.mysql.com/customers/customer.php?id=16

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options
Next
From: Vijay Moses
Date:
Subject: Four table join with million records - performance improvement?