Re: Postgres for a "data warehouse", 5-10 TB - Mailing list pgsql-performance

From Stephen Frost
Subject Re: Postgres for a "data warehouse", 5-10 TB
Date
Msg-id 20110911230135.GN12765@tamriel.snowman.net
Whole thread Raw
In response to Postgres for a "data warehouse", 5-10 TB  (Igor Chudov <ichudov@gmail.com>)
Responses Re: Postgres for a "data warehouse", 5-10 TB  (Igor Chudov <ichudov@gmail.com>)
List pgsql-performance
* Igor Chudov (ichudov@gmail.com) wrote:
> Right now I have a personal (one user) project to create a 5-10
> Terabyte data warehouse. The largest table will consume the most space
> and will take, perhaps, 200,000,000 rows.

I run data-warehouse databases on that order (current largest single
instance is ~4TB running under 9.0.4).  If the largest table is only
200M rows, PG should handle that quite well.  Our data is partitioned by
month and each month is about 200M records and simple queries can run in
15-20 minutes (with a single thread), with complex windowing queries
(split up and run in parallel) finishing in a couple of hours.

> However, while an hour is fine, two weeks per query is NOT fine.

What's really, really, really useful are two things: EXPLAIN, and this
mailing list. :)  Seriously, run EXPLAIN on your queries before you run
them and see if how the query is going to be executed makes sense.
Here's a real easy hint: if it says "External Sort" and has big numbers,
come talk to us here- that's about one of the worst things you can
possibly do.  Of course, PG's going to avoid doing that, but you may
have written a query (unintentionally) which forces PG to do a sort, or
something else.

> I have a server with about 18 TB of storage and 48 GB of RAM, and 12
> CPU cores.

If you partition up your data and don't mind things running in different
transactions, you can definitely get a speed boost with PG by running
things in parallel.  PG will handle that very well, in fact, if two
queries are running against the same table, PG will actually combine
them and only actually read the data from disk once.

> I cannot shell out $47,000 per CPU for Oracle for this project.

The above data warehouse was migrated from an Oracle-based system. :)

> To be more specific, the batch queries that I would do, I hope,
> would either use small JOINS of a small dataset to a large dataset, or
> just SELECTS from one big table.

Make sure that you set your 'work_mem' correctly- PG will use that to
figure out if it can hash the small table (you want that to happen,
trust me..).  If you do end up having sorts, it'll also use the work_mem
value to figure out how much memory to use for sorting.

> So... Can Postgres support a 5-10 TB database with the use pattern
> stated above?

Yes, certainly.

    Thanks,

        Stephen

Attachment

pgsql-performance by date:

Previous
From: Stephen Frost
Date:
Subject: Re: RAID Controller (HP P400) beat by SW-RAID?
Next
From: Igor Chudov
Date:
Subject: Re: Postgres for a "data warehouse", 5-10 TB