Scott Marlowe wrote:
> On Sun, Oct 5, 2008 at 7:48 PM, Sean Davis <sdavis2@mail.nih.gov> wrote:
>> I am looking at the prospect of building a data warehouse of genomic
>> sequence data. The machine that produces the data adds about
>> 300million rows per month in a central fact table and we will
>> generally want the data to be "online". We don't need instantaneous
>> queries, but we would be using the data for data mining purposes and
>> running some "real-time" queries for reporting and research purposes.
>> I have had the pleasure of working on an Netezza box where this type
>> of thing is quite standard, but we don't have that access anymore, so
>> I'm looking for hints on using postgres in a data warehousing/mining
>> environment. Any suggestions on how DDL, loading, backup, indexing,
>> or (to a certain extent) hardware?
>
> I assume you're familiar with stuff like star schemas.
>
> For loading you might want to look at things like pg_bulkloader, copy,
>
> For indexing remember that you have partial and fuctional indexes in
> postgresql and they can come in quite handy.
>
> For backup of large changing databases look into PITR.
>
> As for hardware, you need enough CPU horsepower and memory to handle
> however many users you're gonna have running simultaneous queries, but
> more important is usually the drive subsystem. Throwing drives,
> battery backed cache and a good RAID controller can make a big
> difference. Usual RAID-10 is preferred, as writes are much faster. If
> you're really squeezed for space and money then you can use RAID-5 but
> it has some seriously negative performance implications for parallel
> load handling and write speed.
>
You may also be interested in looking at BioPostgres
http://phenomics.cs.ucla.edu/
--
Shane Ambler
pgSQL (at) Sheeky (dot) Biz
Get Sheeky @ http://Sheeky.Biz