Thread: General data warehousing questions

General data warehousing questions

From

"Sean Davis"

Date:

05 October 2008, 22:48:21

I am looking at the prospect of building a data warehouse of genomic
sequence data.  The machine that produces the data adds about
300million rows per month in a central fact table and we will
generally want the data to be "online".  We don't need instantaneous
queries, but we would be using the data for data mining purposes and
running some "real-time" queries for reporting and research purposes.
I have had the pleasure of working on an Netezza box where this type
of thing is quite standard, but we don't have that access anymore, so
I'm looking for hints on using postgres in a data warehousing/mining
environment.  Any suggestions on how DDL, loading, backup, indexing,
or (to a certain extent) hardware?

Thanks,
Sean

Re: General data warehousing questions

From

"Scott Marlowe"

Date:

05 October 2008, 23:07:07

On Sun, Oct 5, 2008 at 7:48 PM, Sean Davis <sdavis2@mail.nih.gov> wrote:
> I am looking at the prospect of building a data warehouse of genomic
> sequence data.  The machine that produces the data adds about
> 300million rows per month in a central fact table and we will
> generally want the data to be "online".  We don't need instantaneous
> queries, but we would be using the data for data mining purposes and
> running some "real-time" queries for reporting and research purposes.
> I have had the pleasure of working on an Netezza box where this type
> of thing is quite standard, but we don't have that access anymore, so
> I'm looking for hints on using postgres in a data warehousing/mining
> environment.  Any suggestions on how DDL, loading, backup, indexing,
> or (to a certain extent) hardware?

I assume you're familiar with stuff like star schemas.

For loading you might want to look at things like pg_bulkloader, copy,

For indexing remember that you have partial and fuctional indexes in
postgresql and they can come in quite handy.

For backup of large changing databases look into PITR.

As for hardware, you need enough CPU horsepower and memory to handle
however many users you're gonna have running simultaneous queries, but
more important is usually the drive subsystem.  Throwing drives,
battery backed cache and a good RAID controller can make a big
difference.  Usual RAID-10 is preferred, as writes are much faster. If
you're really squeezed for space and money then you can use RAID-5 but
it has some seriously negative performance implications for parallel
load handling and write speed.

Re: General data warehousing questions

From

Shane Ambler

Date:

06 October 2008, 12:40:15

Scott Marlowe wrote:
> On Sun, Oct 5, 2008 at 7:48 PM, Sean Davis <sdavis2@mail.nih.gov> wrote:
>> I am looking at the prospect of building a data warehouse of genomic
>> sequence data.  The machine that produces the data adds about
>> 300million rows per month in a central fact table and we will
>> generally want the data to be "online".  We don't need instantaneous
>> queries, but we would be using the data for data mining purposes and
>> running some "real-time" queries for reporting and research purposes.
>> I have had the pleasure of working on an Netezza box where this type
>> of thing is quite standard, but we don't have that access anymore, so
>> I'm looking for hints on using postgres in a data warehousing/mining
>> environment.  Any suggestions on how DDL, loading, backup, indexing,
>> or (to a certain extent) hardware?
>
> I assume you're familiar with stuff like star schemas.
>
> For loading you might want to look at things like pg_bulkloader, copy,
>
> For indexing remember that you have partial and fuctional indexes in
> postgresql and they can come in quite handy.
>
> For backup of large changing databases look into PITR.
>
> As for hardware, you need enough CPU horsepower and memory to handle
> however many users you're gonna have running simultaneous queries, but
> more important is usually the drive subsystem.  Throwing drives,
> battery backed cache and a good RAID controller can make a big
> difference.  Usual RAID-10 is preferred, as writes are much faster. If
> you're really squeezed for space and money then you can use RAID-5 but
> it has some seriously negative performance implications for parallel
> load handling and write speed.
>

You may also be interested in looking at BioPostgres
http://phenomics.cs.ucla.edu/



--

Shane Ambler
pgSQL (at) Sheeky (dot) Biz

Get Sheeky @ http://Sheeky.Biz