Re: How to best use 32 15k.7 300GB drives? - Mailing list pgsql-performance

From david@lang.hm
Subject Re: How to best use 32 15k.7 300GB drives?
Date
Msg-id alpine.DEB.2.00.1101271639560.17579@asgard.lang.hm
Whole thread Raw
In response to Re: How to best use 32 15k.7 300GB drives?  (david@lang.hm)
List pgsql-performance
sorry for not replying properly to your response, I managed to delete the
mail.

as I understand your data access pattern it's the following:

for the main table space:

bulk loads every couple of weeks. if the data is lost you can just reload
it.

searches tend to be extracting large sequential chunks of data, either to
external files or into different tables spaces.

for this table space, you are basically only inserting every couple of
weeks, and it sounds as if you do not really care how long it takes to
load the data.


first the disclaimer, I'm not a postgres expert, but I do have good
experiance with large amounts of data on linux systems (and especially
running into the limitations when doing it on the cheap ;-)


with this data pattern your WAL is meaningless (as it's only relavent for
isertes), and you may as well use raid6 as raid10 (both allow you to
utalize all drives for reads, but raid6 gives you 2 drives worth of
reducnancy while the wrong two drives on raid10 could kill the entire
array). You may even want to disable fsync on imports. It will save you a
lot of time, and if the system crashes during the load you can just
reinitialize and reload the data.

however, since you are going to be large sequential data transfers, you
want to be utalizing multiple SAS links, preferrably as evenly as
possible, so rather than putting all your data drives on one port, you may
want to spread them between ports so that your aggragate bandwidth to the
drives is higher (with this many high speed drives, this is a significant
limitation)


the usual reason for keeping the index drives separate is to avoid having
writes interact with index reads. Since you are not going to be doing both
at the same time, I don't know if it helps to separate your indexes.


now, if you pull the data from this main table into a smaller table for
analysis, you may want to do more interesting things with the drives that
you use for this smaller table as you are going to be loading data into
them more frequently.

David Lang


On Thu, 27 Jan 2011, david@lang.hm wrote:

> Date: Thu, 27 Jan 2011 15:19:32 -0800 (PST)
> From: david@lang.hm
> To: Robert Schnabel <schnabelr@missouri.edu>
> Cc: pgsql-performance <pgsql-performance@postgresql.org>
> Subject: Re: [PERFORM] How to best use 32 15k.7 300GB drives?
>
> On Thu, 27 Jan 2011, Robert Schnabel wrote:
>
>> HI,
>>
>> I use PostgreSQL basically as a data warehouse to store all the genetic
>> data that our lab generates.  The only person that accesses the database is
>> myself and therefore I've had it housed on my workstation in my office up
>> till now. However, it's getting time to move it to bigger hardware.  I
>> currently have a server that is basically only storing backup images of all
>> our other workstations so I'm going to move my database onto it.  The
>> server looks like this: Windows Server Enterprise 2008 R2 64-bit, AMD 2350
>> quad-core x2, 32GB RAM.  For my purposes the CPUs and RAM are fine.  I
>> currently have an Adaptec 52445+BBU controller that has the OS (4 drive
>> RAID5), FTP (4 drive RAID5) and two backup arrays (8 drive each RAID0).
>> The backup arrays are in a 16 drive external enclosure through an expander
>> so I actually have 16 ports free on the 52445 card.  I plan to remove 3 of
>> the drives from my backup arrays to make room for 3 - 73GB 15k.5 drives
>> (re-purposed from my workstation).  Two 16 drive enclosures with SAS2
>> expanders just arrived as well as 36 Seagate 15k.7 300GB drives
>> (ST3300657SS).  I also intend on getting an Adaptec 6445 controller with
>> the flash module when it becomes available in about a month or two.  I
>> already have several Adaptec cards so I'd prefer to stick with them.
>>
>> Here's the way I was planning using the new hardware:
>> xlog & wal: 3 - 73G 15k.5 RAID1+hot spare in enclosure A on 52445
>> controller
>> data: 22 - 300G 15k.7 RAID10 enclosure B&C on 6445 controller
>> indexes: 8 - 300G 15k.7 RAID10 enclosure C on 6445 controller
>> 2 - 300G 15k.7 as hot spares enclosure C
>> 4 spare 15k.7 for on the shelf
>>
>> With this configuration I figure I'll have ~3TB for my main data tables and
>> 1TB for indexes.  Right now my database is 500GB total.  The 3:1 split
>> reflects my current table structure and what I foresee coming down the road
>> in terms of new data.
>>
>> So my questions are 1) am I'm crazy for doing this, 2) would you change
>> anything and 3) is it acceptable to put the xlog & wal (and perhaps tmp
>> filespace) on a different controller than everything else?  Please keep in
>> mind I'm a geneticist who happens to know a little bit about bioinformatics
>> and not the reverse. :-)
>
> a number of questions spring to mind
>
> how much of the time are you expecting to spend inserting data into this
> system vs querying data from the system?
>
> is data arriving continuously, or is it a matter of receiving a bunch of
> data, inserting it, then querying it?
>
> which do you need to optimize for, insert speed or query speed?
>
> do you expect your queries to be searching for a subset of the data scattered
> randomly throughlut the input data, or do you expect it to be 'grab this
> (relativly) sequential chunk of input data and manipulate it to generate a
> report' type of thing
>
> what is your connectvity to the raid enclosures? (does putting 22 drives on
> one cable mean that you will be limited due to the bandwidth of this cable
> rather than the performance of the drives)
>
> can you do other forms of raid on these drives or only raid 10?
>
> how critical is the data in this database? if it were to die would it just be
> a matter of recreating it and reloading the data? or would you loose
> irreplaceable data?
>
> David Lang
>

pgsql-performance by date:

Previous
From: Robert Schnabel
Date:
Subject: Re: How to best use 32 15k.7 300GB drives?
Next
From: "Anne Rosset"
Date:
Subject: Re: FW: Queries becoming slow under heavy load