Re: How to best use 32 15k.7 300GB drives? - Mailing list pgsql-performance

From Stephen Frost
Subject Re: How to best use 32 15k.7 300GB drives?
Date
Msg-id 20110128131410.GC30352@tamriel.snowman.net
Whole thread Raw
In response to Re: How to best use 32 15k.7 300GB drives?  (Robert Schnabel <schnabelr@missouri.edu>)
Responses Re: How to best use 32 15k.7 300GB drives?  (Robert Schnabel <schnabelr@missouri.edu>)
List pgsql-performance
Robert,

* Robert Schnabel (schnabelr@missouri.edu) wrote:
> Once the bulk data is inserted into the tables I generally
> do some updates on columns to set values which characterize the
> data.

Please tell me you're not running actual full-table UPDATE statements...
You would be *much* better off either:
a) munging the data on the way in (if possible/reasonable)
b) loading the data into temp tables first, and then using INSERT
   statements to move the data into the 'final' tables WITH the new
   columns/info you want
c) considering if you can normalize the data into multiple tables and/or
   to cut down the columns to only what you need as you go through the
   above, too

A full-table UPDATE means you're basically making the table twice as big
as it needs to be, and it'll never get smaller..

> These columns then get indexed.  Basically once the initial
> manipulation is done the table is then static and what I'm looking
> for is query speed.

Sadly, this is the same type of DW needs that I've got (though with
telecomm data and phone calls, not genetic stuffs ;), and PG ends up
being limited by the fact that it can only use one core/thread to go
through the data with.

You might consider investing some time trying to figure out how to
parallelize your queries.  My approach to this has been to partition the
data (probably something you're doing already) into multiple tables and
then have shell/perl scripts which will run a given query against all of
the tables, dumping the results of that aggregation/analysis into other
tables, and then having a final 'merge' query.

> The data is sorted by snp_number, sample_id.  So if I want the data
> for a given sample_id it would be a block of ~58k rows.  The size of
> the table depends on how many sample_id's there are.  My largest has
> ~30k sample_id by 58k snp_number per sample.  The other big table
> (with children) is "mutations" and is set up similarly so that I can
> access individual tables (samples) based on constraints.  Each of
> these children have between 5-60M records.

Understand that indexes are only going to be used/useful, typically, if
the amount of records being returned is small relative to the size of
the table (eg: 5%).

> This is all direct attach storage via SAS2 so I'm guessing it's
> probably limited to the single port link between the controller and
> the expander.  Again, geneticist here not computer scientist. ;-)

That link certainly isn't going to help things..  You might consider how
or if you can improve that.

> All of the data could be reloaded.  Basically, once I get the data
> into the database and I'm done manipulating it I create a backup
> copy/dump which then gets stored at a couple different locations.

You might consider turning fsync off while you're doing these massive
data loads..  and make sure that you issue your 'CREATE TABLE' and your
'COPY' statements in the same transaction, and again, I suggest loading
into temporary (CREATE TEMPORARY TABLE) tables first, then doing the
CREATE TABLE/INSERT statement for the 'real' table.  Make sure that you
create *both* your constraints *and* your indexes *after* the table is
populated.

If you turn fsync off, make sure you turn it back on. :)

> My goal is to 1) have a fairly robust system so that I don't have to
> spend my time rebuilding things and 2) be able to query the data
> quickly.  Most of what I do are ad hoc queries.  I have an idea...
> "how many X have Y in this set of Z samples" and write the query to
> get the answer.  I can wait a couple minutes to get an answer but
> waiting an hour is becoming tiresome.

Have you done any analysis to see what the bottleneck actually is?  When
you run top, is your PG process constantly in 'D' state, or is it in 'R'
state, or what?  Might help figure some of that out.  Note that
parallelizing the query will help regardless of if it's disk bound or
CPU bound, when you're running on the kind of hardware you're talking
about (lots of spindles, multiple CPUs, etc).

    Thanks,

        Stephen

Attachment

pgsql-performance by date:

Previous
From: Michael Kohl
Date:
Subject: Re: High load,
Next
From: Mladen Gogala
Date:
Subject: Re: High load,