Thread: 12 disks raid setup

12 disks raid setup

From
Franck Routier
Date:
Hi,

I am in the process of setting up a postgresql server with 12 SAS disks.

I am considering two options:

1) set up a 12 disks raid 10 array to get maximum raw performance from
the system and put everything on it (it the whole pg cluster, including
WAL, and every tablespcace)

2) set up 3 raid 10 arrays of 4 disks, and dispatch my data on these
disks via tablespaces :

data1 = pg cluster + references data (dimensions) tablespace
data2 = fact data tablespace
data3 = indices tablespace

Typical workload is either massive insert/update via ETL or complex
queries on big (10 millions tuples) tables with several joins (including
Mondrian ROLAP).

Does anyone have an opinion of what could give best results ?

Thanks,
Franck



Re: 12 disks raid setup

From
"Scott Marlowe"
Date:
On Fri, Feb 29, 2008 at 5:51 AM, Franck Routier
<franck.routier@axege.com> wrote:
> Hi,
>
>  I am in the process of setting up a postgresql server with 12 SAS disks.
>
>  I am considering two options:
>
>  1) set up a 12 disks raid 10 array to get maximum raw performance from
>  the system and put everything on it (it the whole pg cluster, including
>  WAL, and every tablespcace)
>
>  2) set up 3 raid 10 arrays of 4 disks, and dispatch my data on these
>  disks via tablespaces :

How you set it up depends on your RAID controller as much as anything.
 Good battery backed RAID controllers seem to work better with one big
RAID-10 array.  But as with anything, unless you benchmark it, you're
really just guessing which is best.

Re: 12 disks raid setup

From
Franck Routier
Date:
Hi,

my Raid controller is an Adaptec 31205 SAS/RAID controller. The battery
was an option, but I didn't know it at purchase time. So I have no
battery, but the whole system is on an UPS.

I have done quite a few tests using bonnie++, focusing on 'random seek'
results, and found out that:

1) linux md raid 10 performs better than Adaptec hardware raid in this
field (random seek) by 15%, quite consistently
2) hardware raid is better on sequential output
3) md outperforms it again when coming to sequential read, especially
with far layout option.

So in fact I think I will use md raid, but still don't know with which
layout (3x4 or 1x12).
What would you suggest as a benchmarking method ? Simply issue a few big
queries that I expect to be usual and see how long it last, or is there
a more convinient and or "scientific" method ?

Thanks,
Franck



Re: 12 disks raid setup

From
Greg Smith
Date:
On Fri, 29 Feb 2008, Franck Routier wrote:

> my Raid controller is an Adaptec 31205 SAS/RAID controller. The battery
> was an option, but I didn't know it at purchase time. So I have no
> battery, but the whole system is on an UPS.

The UPS is of no help here.  The problem is that PostgreSQL forces the
disk controller to commit WAL writes to disk after every transaction.  If
you have a controller with a battery-backed cache, you can use that cache
to buffer those writes and dramatically increase write performance.  The
USP doesn't give you the same write guarantees.  Let's say someone trips
over the server power cord (simplest example of a whole class of
failures).  With the BBC controller, the cached writes will get committed
when you plug the server back in.  If all you've got is a UPS, writes that
didn't make it to disk before the outage are lost.  That means you can't
buffer those writes without risking database corruption.

The general guideline here is that if you don't have a battery-backed
cache on your controller, based on disk rotation speed you'll be limited
to around 100 (7200 RPM) to 200 (15K RPM) commits/second per single
client, with each commit facing around a 2-4ms delay.  That rises to
perhaps 500/s total with lots of clients.  BBC configurations can easily
clear 3000/s total and individual commits don't have that couple of ms
delay.

> So in fact I think I will use md raid, but still don't know with which
> layout (3x4 or 1x12).

The only real downside of md RAID is that if you lose the boot device it
can be tricky to get the system to start again; hardware RAID hides that
little detail from the BIOS.  Make sure you simulate a failure of the
primary boot drive and are comfortable with recovering from that situation
before you go into production with md.

The only way to know which layout will work better is to have a lot of
knowledge of this application and how it bottlenecks under load.  If you
know, for example, that there's a particular set of tables/indexes that
are critical to real-time users, whereas others are only used by batch
operations, things like that can be used to figure out how to optimize
disk layout.  If you don't know your database to that level, put
everything into one big array and forget about it; you won't do any better
than that.

> What would you suggest as a benchmarking method ? Simply issue a few big
> queries that I expect to be usual and see how long it last, or is there
> a more convinient and or "scientific" method ?

Benchmarking is hard and you have to use a whole array of tests if you
want to quantify the many aspects of performance.  You're doing the right
thing using bonnie++ to quantify disk speed.  If you've got some typical
queries, using those to fine-tune postgresql.conf parameters is a good
idea; just make sure to set shared_buffers, estimated_cache_size, and run
ANALYZE on your tables.  Be careful to note performance differences when
the cache is already filled with data from previous runs.  Measuring
write/commit performance is probably easiest using pgbench.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: 12 disks raid setup

From
"Jignesh K. Shah"
Date:

Franck Routier wrote:
> Hi,
>
> I am in the process of setting up a postgresql server with 12 SAS disks.
>
> I am considering two options:
>
> 1) set up a 12 disks raid 10 array to get maximum raw performance from
> the system and put everything on it (it the whole pg cluster, including
> WAL, and every tablespcace)
>
> 2) set up 3 raid 10 arrays of 4 disks, and dispatch my data on these
> disks via tablespaces :
>
> data1 = pg cluster + references data (dimensions) tablespace
> data2 = fact data tablespace
> data3 = indices tablespace
>
>
>

Option 2: Infact I would also say within one of the RAID1 use another
softpartition and separate out pg_xlog also.

My 2 cents based on my benchmarks.

-Jignesh


Re: 12 disks raid setup

From
"Joshua D. Drake"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, 29 Feb 2008 12:17:29 -0500
"Jignesh K. Shah" <J.K.Shah@sun.com> wrote:

> 
> 
> Franck Routier wrote:
> > Hi,
> >
> > I am in the process of setting up a postgresql server with 12 SAS
> > disks.
> >
> > I am considering two options:
> >
> > 1) set up a 12 disks raid 10 array to get maximum raw performance
> > from the system and put everything on it (it the whole pg cluster,
> > including WAL, and every tablespcace)

I would do this (assuming you have other spindles for the OS):

/data1 - RAID 10  journalled filesystem + 1 (so 9 disks)
/xlogs - RAID 1 non journalled filesystem + 1 (so 3 disks)


You can create any number of tablespaces for further growth you see fit
and move them as more IO becomes available.

Sincerely,

Joshua D. Drake


- -- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL SPI Liaison | SPI Director |  PostgreSQL political pundit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHyEX6ATb/zqfZUUQRAu1XAKCpszYwF4dbI0hidg71JhmcrPqbmACcDhdc
E0qVOtKrUBpEEerGUjTMF9I=
=LzZS
-----END PGP SIGNATURE-----

Re: 12 disks raid setup

From
Matthew
Date:
On Fri, 29 Feb 2008, Joshua D. Drake wrote:
> /data1 - RAID 10  journalled filesystem + 1 (so 9 disks)
> /xlogs - RAID 1 non journalled filesystem + 1 (so 3 disks)

Sounds good. Can't they share the hot spare, rather than having two?

However, I would recommend splashing out on the battery for the cache, and
then just putting then all in one RAID 10 lump.

Matthew

--
To be or not to be           -- Shakespeare
To do is to be               -- Nietzsche
To be is to do               -- Sartre
Do be do be do               -- Sinatra

Re: 12 disks raid setup

From
Vivek Khera
Date:
On Feb 29, 2008, at 9:51 AM, Franck Routier wrote:

> my Raid controller is an Adaptec 31205 SAS/RAID controller. The
> battery
> was an option, but I didn't know it at purchase time. So I have no
> battery, but the whole system is on an UPS.

Go find one on ebay or google search, and plug it in.  Adaptec
batteries just snap in and sometimes have a bracket to clip them in
place.

Your performance will be awful without one since you can't safely
write cache.  Also, if your card has upgradable RAM (but I've never
seen an adaptec card which could) max it out.


Re: 12 disks raid setup

From
Mark Kirkwood
Date:
Greg Smith wrote:
>
> The only real downside of md RAID is that if you lose the boot device
> it can be tricky to get the system to start again; hardware RAID hides
> that little detail from the BIOS.  Make sure you simulate a failure of
> the primary boot drive and are comfortable with recovering from that
> situation before you go into production with md.

+1

I usually ensure there is a separate /boot that is setup RAID1 (with md
using all the disks for the RAID1 - so the I can keep the partition map
the same for all the disks, otherwise it is fiddly!)

Cheers

Mark

Re: 12 disks raid setup

From
Shane Ambler
Date:
Greg Smith wrote:
> On Fri, 29 Feb 2008, Franck Routier wrote:
>
>> my Raid controller is an Adaptec 31205 SAS/RAID controller. The battery
>> was an option, but I didn't know it at purchase time. So I have no
>> battery, but the whole system is on an UPS.
>
> The UPS is of no help here.  The problem is that PostgreSQL forces the
> disk controller to commit WAL writes to disk after every transaction.
> If you have a controller with a battery-backed cache, you can use that
> cache to buffer those writes and dramatically increase write
> performance.  The USP doesn't give you the same write guarantees.  Let's
> say someone trips over the server power cord (simplest example of a
> whole class of failures).  With the BBC controller, the cached writes
> will get committed when you plug the server back in.  If all you've got
> is a UPS, writes that didn't make it to disk before the outage are
> lost.  That means you can't buffer those writes without risking database
> corruption.
>
> The general guideline here is that if you don't have a battery-backed
> cache on your controller, based on disk rotation speed you'll be limited
> to around 100 (7200 RPM) to 200 (15K RPM) commits/second per single
> client, with each commit facing around a 2-4ms delay.  That rises to
> perhaps 500/s total with lots of clients.  BBC configurations can easily
> clear 3000/s total and individual commits don't have that couple of ms
> delay.
>

It may be the way you have worded this but it makes it sound like the
cache and the battery backup are as one (or that the cache doesn't work
unless you have the battery) The cache may be optional (or plug-in) in
some cards, even of varied size. The battery is normally optional. You
can normally add/remove the battery without changing the cache options.

If the raid card has the cache without the battery you would get the
performance figures you mentioned, you just wouldn't have the
reliability of finishing writes after a power off situation.


correct me if I am wrong here.


--

Shane Ambler
pgSQL (at) Sheeky (dot) Biz

Get Sheeky @ http://Sheeky.Biz

Re: 12 disks raid setup

From
Greg Smith
Date:
On Sat, 1 Mar 2008, Shane Ambler wrote:

> It may be the way you have worded this but it makes it sound like the
> cache and the battery backup are as one (or that the cache doesn't work
> unless you have the battery)...If the raid card has the cache without
> the battery you would get the performance figures you mentioned, you
> just wouldn't have the reliability of finishing writes after a power off
> situation.

Wording is intentional--if you don't have a battery for it, the cache has
to be turned off (or set to write-through so it's only being used on
reads) in order for the database to be reliable.  If you can't finish
writes after a power off, you can't cache writes and expect your database
to survive for too long.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: 12 disks raid setup

From
Franck Routier
Date:
Hi,

Le vendredi 29 février 2008 à 23:56 -0500, Greg Smith a écrit :
> Wording is intentional--if you don't have a battery for it, the cache has
> to be turned off (or set to write-through so it's only being used on
> reads) in order for the database to be reliable.  If you can't finish
> writes after a power off, you can't cache writes and expect your database
> to survive for too long.

Well, am I just wrong, or the file system might also heavily rely on
cache, especially as I use XFS ?

So anyway Postgresql has no way to know if the data is really on the
disk, and in case of a brutal outage, the system may definitely lose
data, wether there is another level of caching (Raid controller) or
not...

Right ?



Re: 12 disks raid setup

From
"Scott Marlowe"
Date:
On Sat, Mar 1, 2008 at 4:27 AM, Franck Routier <franck.routier@axege.com> wrote:
> Hi,
>
>  Le vendredi 29 février 2008 à 23:56 -0500, Greg Smith a écrit :
>  > Wording is intentional--if you don't have a battery for it, the cache has
>  > to be turned off (or set to write-through so it's only being used on
>  > reads) in order for the database to be reliable.  If you can't finish
>  > writes after a power off, you can't cache writes and expect your database
>  > to survive for too long.
>
>  Well, am I just wrong, or the file system might also heavily rely on
>  cache, especially as I use XFS ?
>
>  So anyway Postgresql has no way to know if the data is really on the
>  disk, and in case of a brutal outage, the system may definitely lose
>  data, wether there is another level of caching (Raid controller) or
>  not...
>
>  Right ?

nope.  assuming your disk subsystem doesn't lie about write
completion, then postgresql can recover from complete and sudden loss
of power without any data loss.

Re: 12 disks raid setup

From
Tom Lane
Date:
"Scott Marlowe" <scott.marlowe@gmail.com> writes:
> On Sat, Mar 1, 2008 at 4:27 AM, Franck Routier <franck.routier@axege.com> wrote:
>> Well, am I just wrong, or the file system might also heavily rely on
>> cache, especially as I use XFS ?
>>
>> So anyway Postgresql has no way to know if the data is really on the
>> disk, and in case of a brutal outage, the system may definitely lose
>> data, wether there is another level of caching (Raid controller) or
>> not...

> nope.  assuming your disk subsystem doesn't lie about write
> completion, then postgresql can recover from complete and sudden loss
> of power without any data loss.

Franck does have a point here: we are expecting the filesystem to tend
to its own knitting.  If a power failure corrupts the filesystem so
badly that we can't find the WAL files, or their contents are badly
scrambled, then we're screwed.  Most modern filesystems defend
themselves against that using journaling, which is exactly the same
idea as WAL but applied to filesystem metadata.

We do expect that when we fsync a file, by the time the OS reports that
that's done both the file contents and its metadata are safely on disk.
This is part of the specification for fsync, so the OS is clearly broken
if it doesn't get that right.  Whether the OS *can* guarantee it if the
disk drive lies about write completion is something you'd have to ask
the filesystem hackers about.

            regards, tom lane

How to allocate 8 disks

From
Craig James
Date:
We're upgrading to a medium-sized server, a Dell PowerEdge 2950, dual-quad CPU's and 8 GB memory.  This box can hold at
most8 disks (10K SCSI 2.5" 146 GB drives) and has Dell's Perc 6/i RAID controller. 

I'm thinking of this:

  6 disks  RAID 1+0  Postgres data
  1 disk   WAL
  1 disk   Linux

I've often seen RAID 1 recommended for the WAL.  Is that strictly for reliability, or is there a performance advantage
toRAID 1 for the WAL? 

It seems to me separating the OS and WAL on two disks is better than making a single RAID 1 and sharing it, from a
performancepoint of view. 

Thanks,
Craig

Re: How to allocate 8 disks

From
"Scott Marlowe"
Date:
On Sat, Mar 1, 2008 at 12:06 PM, Craig James <craig_james@emolecules.com> wrote:
> We're upgrading to a medium-sized server, a Dell PowerEdge 2950, dual-quad CPU's and 8 GB memory.  This box can hold
atmost 8 disks (10K SCSI 2.5" 146 GB drives) and has Dell's Perc 6/i RAID controller. 
>
>  I'm thinking of this:
>
>   6 disks  RAID 1+0  Postgres data
>   1 disk   WAL
>   1 disk   Linux
>
>  I've often seen RAID 1 recommended for the WAL.  Is that strictly for reliability, or is there a performance
advantageto RAID 1 for the WAL? 
>
>  It seems to me separating the OS and WAL on two disks is better than making a single RAID 1 and sharing it, from a
performancepoint of view. 

It's a trade off.  Remember that if the single disk hold xlog fails
you've just quite possubly lost your database.  I'd be inclined to
either using a RAID-1 of two disks for the OS and xlog, and having
pgsql log to the 6 disk RAID-10 instead of the OS / xlog disk set.

More important, do you have battery backed cache on the controller?  A
good controller with a battery backed cache can usually outrun a
larger array with no write cache when it comes to transactions /
writing to the disks.

Re: How to allocate 8 disks

From
"Joshua D. Drake"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sat, 01 Mar 2008 10:06:54 -0800
Craig James <craig_james@emolecules.com> wrote:

> We're upgrading to a medium-sized server, a Dell PowerEdge 2950,
> dual-quad CPU's and 8 GB memory.  This box can hold at most 8 disks
> (10K SCSI 2.5" 146 GB drives) and has Dell's Perc 6/i RAID controller.
> 
> I'm thinking of this:
> 
>   6 disks  RAID 1+0  Postgres data
>   1 disk   WAL
>   1 disk   Linux
> 
> I've often seen RAID 1 recommended for the WAL.  Is that strictly for
> reliability, or is there a performance advantage to RAID 1 for the
> WAL?
> 
> It seems to me separating the OS and WAL on two disks is better than
> making a single RAID 1 and sharing it, from a performance point of
> view.

This scares me... You lose WAL you are a goner. Combine your OS and
WAL into a RAID 1.

Sincerely,

Joshua D. Drake


- -- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL SPI Liaison | SPI Director |  PostgreSQL political pundit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHycSzATb/zqfZUUQRAs14AJ9pm3huW+z1j7jUIY7FbIZMzz2IxgCgnOhD
yWiDabTYAG+x12JEqrf4q8E=
=gBPs
-----END PGP SIGNATURE-----

Re: How to allocate 8 disks

From
Craig James
Date:
Joshua D. Drake wrote:
> On Sat, 01 Mar 2008 10:06:54 -0800
> Craig James <craig_james@emolecules.com> wrote:
>
>> We're upgrading to a medium-sized server, a Dell PowerEdge 2950,
>> dual-quad CPU's and 8 GB memory.  This box can hold at most 8 disks
>> (10K SCSI 2.5" 146 GB drives) and has Dell's Perc 6/i RAID controller.
>>
>> I'm thinking of this:
>>
>>   6 disks  RAID 1+0  Postgres data
>>   1 disk   WAL
>>   1 disk   Linux
>>
>> I've often seen RAID 1 recommended for the WAL.  Is that strictly for
>> reliability, or is there a performance advantage to RAID 1 for the
>> WAL?
>>
>> It seems to me separating the OS and WAL on two disks is better than
>> making a single RAID 1 and sharing it, from a performance point of
>> view.
>
> This scares me... You lose WAL you are a goner. Combine your OS and
> WAL into a RAID 1.

Right, I do understand that, but reliability is not a top priority in this system.  The database will be replicated,
andcan be reproduced from the raw data.  It's not an accounting system, it finds scientific results.  That's not to say
I*won't* take your advice, we may in fact combine the OS and WAL on one disk.  Reliability is a good thing, but I need
toknow all of the tradeoffs, so that I can weigh performance, reliability, and cost and make the right choice. 

So my question still stands: From a strictly performance point of view, would it be better to separate the OS and the
WALonto two disks?  Is there any performance advantage to RAID 1?  My understanding is that RAID 1 can give 2x seek
performanceduring read, but no advantage during write.  For the WAL, it seems to me that RAID 1 has no performance
benefits,so separating the WAL and OS seems like a peformance advantage. 

Another option would be:

  4 disks   RAID 1+0  Postgres data
  2 disks   RAID 1    WAL
  1 disk    Linux
  1 disk    spare

This would give us reliability, but I think the performance would be considerably worse, since the primary Postgres
datawould come from 4 disks instead of six. 

I guess we could also consider:

  4 disks   RAID 1+0  Postgres data
  4 disks   RAID 1+0  WAL and Linux

Or even

  8 disks   RAID 1+0  Everything

This is a dedicated system and does nothing but Apache/Postgres, so the OS should get very little traffic.  But if
that'sthe case, I guess you could argue that your suggestion of combining OS and WAL on a 2-disk RAID 1 would be the
wayto go, since the OS activity wouldn't affect the WAL very much. 

I suppose the thing to do is get the system, and run bonnie on various configurations.  I've never run bonnie before --
canI get some useful results without a huge learning curve? 

Thanks,
Craig

Re: How to allocate 8 disks

From
"Scott Marlowe"
Date:
On Sat, Mar 1, 2008 at 3:53 PM, Craig James <craig_james@emolecules.com> wrote:
> Joshua D. Drake wrote:
>  > On Sat, 01 Mar 2008 10:06:54 -0800
>  > Craig James <craig_james@emolecules.com> wrote:
>  >
>  >> We're upgrading to a medium-sized server, a Dell PowerEdge 2950,
>  >> dual-quad CPU's and 8 GB memory.  This box can hold at most 8 disks
>  >> (10K SCSI 2.5" 146 GB drives) and has Dell's Perc 6/i RAID controller.
>  >>
>  >> I'm thinking of this:
>  >>
>  >>   6 disks  RAID 1+0  Postgres data
>  >>   1 disk   WAL
>  >>   1 disk   Linux
>  >>
>  >> I've often seen RAID 1 recommended for the WAL.  Is that strictly for
>  >> reliability, or is there a performance advantage to RAID 1 for the
>  >> WAL?
>  >>
>  >> It seems to me separating the OS and WAL on two disks is better than
>  >> making a single RAID 1 and sharing it, from a performance point of
>  >> view.
>  >
>  > This scares me... You lose WAL you are a goner. Combine your OS and
>  > WAL into a RAID 1.
>
>  Right, I do understand that, but reliability is not a top priority in this system.  The database will be replicated,
andcan be reproduced from the raw data.  It's not an accounting system, it finds scientific results.  That's not to say
I*won't* take your advice, we may in fact combine the OS and WAL on one disk.  Reliability is a good thing, but I need
toknow all of the tradeoffs, so that I can weigh performance, reliability, and cost and make the right choice. 

In that case you could always make the data partition a 6 disk RAID-0.

>  So my question still stands: From a strictly performance point of view, would it be better to separate the OS and
theWAL onto two disks?  Is there any performance advantage to RAID 1?  My understanding is that RAID 1 can give 2x seek
performanceduring read, but no advantage during write.  For the WAL, it seems to me that RAID 1 has no performance
benefits,so separating the WAL and OS seems like a peformance advantage. 

Yes, Only on Reads.  Correct.

>  Another option would be:
>
>
>   4 disks   RAID 1+0  Postgres data
>   2 disks   RAID 1    WAL
>   1 disk    Linux
>   1 disk    spare
>
>  This would give us reliability, but I think the performance would be considerably worse, since the primary Postgres
datawould come from 4 disks instead of six. 

Performance-wise, RAID-10 with n disks is about the same as RAID-0
with n/2 disks.  So, you're losing abot 1/3 of your peak performance,
assuming 100% efficient controllers and you aren't bottlenecking I/O
with > 4 disks.

>  I guess we could also consider:
>
>
>   4 disks   RAID 1+0  Postgres data
>   4 disks   RAID 1+0  WAL and Linux
>
>  Or even
>
>   8 disks   RAID 1+0  Everything

It really depends on the controller.  Battery backed write cache?
Then the one big everything is often faster than any other method.  No
BB cache?  Then splitting them up will help.

>  I suppose the thing to do is get the system, and run bonnie on various configurations.  I've never run bonnie before
--can I get some useful results without a huge learning curve? 

Yes, it's fairly easy to drive.  It'll tell you more about your
controller than anything else, which is very useful information.  The
way a different controllers behaves with different configurations can
be very very different from one controller to the next.

Re: 12 disks raid setup

From
Greg Smith
Date:
On Sat, 1 Mar 2008, Franck Routier wrote:

> Well, am I just wrong, or the file system might also heavily rely on
> cache, especially as I use XFS ? So anyway Postgresql has no way to know
> if the data is really on the disk, and in case of a brutal outage, the
> system may definitely lose data, wether there is another level of
> caching (Raid controller) or not...

After PostgreSQL writes to the WAL, it calls fsync.  If your filesystem
doesn't then force a real write to disk at that point and clear whatever
cache it might have, it's broken and unsuitable for database use.  XFS is
smart enough to understand that.

The only thing people typically run into that will hear fsync and lie
about the data actually being written to disk are a) caching controllers
with the write cache turned on and b) cheap hard drives.  In case (a),
having a battery backup for the cache is sufficient to survive most
classes of outage without damage--if the system is without power for
longer than the battery lasts you're in trouble, otherwise is shouldn't be
a problem.  In case (b), you have to turn the disk cache off to get
reliable database operation.

I've put all the interesting trivia on this topic I've ever come across at
http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm if
you're looking for some really exciting reading.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: How to allocate 8 disks

From
Greg Smith
Date:
On Sat, 1 Mar 2008, Craig James wrote:

> So my question still stands: From a strictly performance point of view, would
> it be better to separate the OS and the WAL onto two disks?

You're not getting a more useful answer here because you haven't mentioned
yet a) what the disk controller is or b) how much writing activity is
going on here.  If you can cache writes, most of the advantages to having
a seperate WAL disk aren't important unless you've got an extremely high
write throughput (higher you can likely sustain with only 8 disks) so you
can put the WAL data just about anywhere.

> This is a dedicated system and does nothing but Apache/Postgres, so the OS
> should get very little traffic.  But if that's the case, I guess you could
> argue that your suggestion of combining OS and WAL on a 2-disk RAID 1 would
> be the way to go, since the OS activity wouldn't affect the WAL very much.

The main thing to watch out for if the OS and WAL are on the same disk is
that some random process spewing logs files could fill the disk and now
the database is stalled.

I think there are two configurations that make sense for your situation:

>   8 disks   RAID 1+0  Everything

This maximizes potential sequential and seek throughput for the database,
which is probably going to be your bottleneck unless you're writing lots
of simple data, while still allowing survival of any one disk.  The crazy
log situation I mentioned above is less likely to be a problem because
having so much more disk space available to everything means it's more
likely you'll notice it before the disk actually fills.

     6 disks   RAID 0  Postgres data+WAL
     2 disks   RAID 1  Linux

This puts some redundancy on the base OS, so no single disk loss can
actually take down the system altogether.  You get maximum throughput on
the database.  If you lose a database disk, you replace it and rebuild the
whole database at that point.

> I suppose the thing to do is get the system, and run bonnie on various
> configurations.  I've never run bonnie before -- can I get some useful
> results without a huge learning curve?

I've collected some bonnie++ examples at
http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm you
may find useful.  With only 8 disks you should be able to get useful
results without a learning curve; with significantly more it can be
necessary to run more than one bonnie at once to really saturate the disks
and that's trickier.

I don't think you're going to learn anything useful from that though
(other than figuring out if your disk+controller combination is
fundamentally fast or not).  As you put more disks into the array,
sequential throughput and seeks/second will go up.  This doesn't tell you
anything useful about whether the WAL is going to get enough traffic to be
a bottleneck such that it needs to be on a seperate disk.  To figure that
out, you need to run some simulations of the real database and its
application, and doing that fairly is a more serious benchmarking project.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: How to allocate 8 disks

From
Matthew
Date:
On Sat, 1 Mar 2008, Craig James wrote:
> Right, I do understand that, but reliability is not a top priority in this
> system.  The database will be replicated, and can be reproduced from the raw
> data.

So what you're saying is:

1. Reliability is not important.
2. There's zero write traffic once the database is set up.

If this is true, then RAID-0 is the way to go. I think Greg's options are
good. Either:

2 discs RAID 1: OS
6 discs RAID 0: database + WAL

which is what we're using here (except with more discs), or:

8 discs RAID 10: everything

However, if reliability *really* isn't an issue, and you can accept
reinstalling the system if you lose a disc, then there's a third option:

8 discs RAID 0: Everything

Matthew

--
Heat is work, and work's a curse. All the heat in the universe, it's
going to cool down, because it can't increase, then there'll be no
more work, and there'll be perfect peace.      -- Michael Flanders

Re: How to allocate 8 disks

From
Mark Mielke
Date:
Matthew wrote:
> On Sat, 1 Mar 2008, Craig James wrote:
>> Right, I do understand that, but reliability is not a top priority in
>> this system.  The database will be replicated, and can be reproduced
>> from the raw data.
>
> So what you're saying is:
>
> 1. Reliability is not important.
> 2. There's zero write traffic once the database is set up.
>
> If this is true, then RAID-0 is the way to go. I think Greg's options
> are good. Either:
>
> 2 discs RAID 1: OS
> 6 discs RAID 0: database + WAL
>
> which is what we're using here (except with more discs), or:
>
> 8 discs RAID 10: everything

Has anybody been able to prove to themselves that RAID 0 vs RAID 1+0 is
faster for these sorts of loads? My understanding is that RAID 1+0 *can*
reduce latency for reads, but that it relies on random access, whereas
RAID 0 performs best for sequential scans? Does PostgreSQL ever do
enough random access to make RAID 1+0 shine?

Curious.

Thanks,
mark

--
Mark Mielke <mark@mielke.cc>


Re: How to allocate 8 disks

From
Craig James
Date:
Matthew wrote:
> On Sat, 1 Mar 2008, Craig James wrote:
>> Right, I do understand that, but reliability is not a top priority in
>> this system.  The database will be replicated, and can be reproduced
>> from the raw data.
>
> So what you're saying is:
>
> 1. Reliability is not important.
> 2. There's zero write traffic once the database is set up.

Well, I actually didn't say either of those things, but I appreciate the feedback.  RAID 0 is an interesting
suggestion,but given our constraints, it's not an option.  Reliability is important, but not as important as, say, a
bankingsystem. 

And as far as zero write traffic, I don't know where that came from.  It's a "hitlist" based system, where complex
searchresults are saved for the user in tables, and the write traffic can be quite high. 

> If this is true, then RAID-0 is the way to go. I think Greg's options
> are good. Either:
>
> 2 discs RAID 1: OS
> 6 discs RAID 0: database + WAL
>
> which is what we're using here (except with more discs), or:
>
> 8 discs RAID 10: everything

Right now, an 8-disk RAID 10 is looking like the best choice.  The Dell Perc 6i has configurations that include a
battery-backedcache, so performance should be quite good. 

> However, if reliability *really* isn't an issue, and you can accept
> reinstalling the system if you lose a disc, then there's a third option:
>
> 8 discs RAID 0: Everything

I imagine the MTBF on a system like this would be < 1 year, which is out of the question, even with a backup system
thatcan take over.  A failure completely wipes the system, OS and everything, so you're guaranteed that once or twice a
year,you have to rebuild your system from the ground up.  I'd rather spend that time at the beach! 

Craig

Re: How to allocate 8 disks

From
Matthew
Date:
On Mon, 3 Mar 2008, Mark Mielke wrote:
> Has anybody been able to prove to themselves that RAID 0 vs RAID 1+0 is
> faster for these sorts of loads? My understanding is that RAID 1+0 *can*
> reduce latency for reads, but that it relies on random access, whereas RAID 0
> performs best for sequential scans? Does PostgreSQL ever do enough random
> access to make RAID 1+0 shine?

Theoretically the performance of RAID 0 and RAID 10 should be identical
for reads, both seeks and throughput, assuming you have a sensible
readahead and a good controller. For writes, RAID 10 needs to write to
multiple drives, so is slower. Whether this is true in reality is another
matter, as all sorts of factors come in, not least how good your
controller is at managing the arrangement.

Matthew

--
The only secure computer is one that's unplugged, locked in a safe,
and buried 20 feet under the ground in a secret location...and i'm not
even too sure about that one.                         --Dennis Huges, FBI

Re: How to allocate 8 disks

From
Mark Mielke
Date:
Matthew wrote:
> On Mon, 3 Mar 2008, Mark Mielke wrote:
>> Has anybody been able to prove to themselves that RAID 0 vs RAID 1+0
>> is faster for these sorts of loads? My understanding is that RAID 1+0
>> *can* reduce latency for reads, but that it relies on random access,
>> whereas RAID 0 performs best for sequential scans? Does PostgreSQL
>> ever do enough random access to make RAID 1+0 shine?
> Theoretically the performance of RAID 0 and RAID 10 should be
> identical for reads, both seeks and throughput, assuming you have a
> sensible readahead and a good controller. For writes, RAID 10 needs to
> write to multiple drives, so is slower. Whether this is true in
> reality is another matter, as all sorts of factors come in, not least
> how good your controller is at managing the arrangement.

I don't think your statement that they should be identical is true -
RAID 1+0 can satisfy and given read from at least two drives. A good
controller can satisfy half the reads from one side of the array, and
half the reads from the other side of the array, where the first set
does not have to wait for the second set, before continuing. To
contrast, sequential reads of a RAID 1+0 system is almost always HALF of
the speed of sequential reads of a RAID 0 system. The hardware
read-ahead on the RAID 1+0 system is being wasted as even if you did
leap from one side of the array to the other, each side ends up
"skipping" the data served by the other side, making any caching
ineffective.

The question I have is not whether RAID 1+0 vs RAID 0 show different
characteristics. I know they do based upon my own analysis. My question
is whether PostgreSQL disk access patterns for certain loads ever
benefit from RAID 1+0, or whether RAID 1+0 is always a bad choice for
performance-only (completely ignore reliability) loads.

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>


Re: How to allocate 8 disks

From
"Scott Marlowe"
Date:
On Mon, Mar 3, 2008 at 8:48 AM, Mark Mielke <mark@mark.mielke.cc> wrote:
> Matthew wrote:
>  > On Sat, 1 Mar 2008, Craig James wrote:
>  >> Right, I do understand that, but reliability is not a top priority in
>  >> this system.  The database will be replicated, and can be reproduced
>  >> from the raw data.
>  >
>  > So what you're saying is:
>  >
>  > 1. Reliability is not important.
>  > 2. There's zero write traffic once the database is set up.
>  >
>  > If this is true, then RAID-0 is the way to go. I think Greg's options
>  > are good. Either:
>  >
>  > 2 discs RAID 1: OS
>  > 6 discs RAID 0: database + WAL
>  >
>  > which is what we're using here (except with more discs), or:
>  >
>  > 8 discs RAID 10: everything
>
>  Has anybody been able to prove to themselves that RAID 0 vs RAID 1+0 is
>  faster for these sorts of loads? My understanding is that RAID 1+0 *can*
>  reduce latency for reads, but that it relies on random access, whereas
>  RAID 0 performs best for sequential scans? Does PostgreSQL ever do
>  enough random access to make RAID 1+0 shine?

RAID 1+0 has certain theoretical advantages in parallel access
scenarios that straight RAID-0 wouldn't have.  I.e. if you used n>2
disks in a mirror and built a RAID-0 out of those types of mirrors,
then you could theoretically have n users reading data on the same
"drive"  (the raid-1 underneath the raid-0) at the same time where
RAID-0 would only have the one disk to read from.  The effects of this
advantage are dulled by caching, depending on how much of the data set
you can cache.  With a system that can cache it's whole data set in
memory (not uncommon for transactional systems) or at least a large
percentage, the n>2 RAID-1 sets aren't that big of an advantage.

RAID-0 of n drives should behave pretty similarly to RAID-10 with 2n
drives for most types of access.  I.e. no better or worse for
sequential or random access, if the number of drives is equivalent.

Re: How to allocate 8 disks

From
Ivan Voras
Date:
Joshua D. Drake wrote:

> This scares me... You lose WAL you are a goner. Combine your OS and
> WAL into a RAID 1.

Can someone elaborate on this? From the WAL concept and documentation at
http://www.postgresql.org/docs/8.3/interactive/wal-intro.html I'd say
the only data that should be lost are the transactions currently in the
log but not yet transferred to permanent storage (database files proper).



Re: How to allocate 8 disks

From
Shane Ambler
Date:
Ivan Voras wrote:
> Joshua D. Drake wrote:
>
>> This scares me... You lose WAL you are a goner. Combine your OS and
>> WAL into a RAID 1.
>
> Can someone elaborate on this? From the WAL concept and documentation at
> http://www.postgresql.org/docs/8.3/interactive/wal-intro.html I'd say
> the only data that should be lost are the transactions currently in the
> log but not yet transferred to permanent storage (database files proper).
>

The log records what changes are made to your data files before the data
files are changed. (and gets flushed to disk before the data files are
changed)

In the event of power loss right in the middle of the data files being
updated for a transaction, when power is restored, how do we know what
changes were made to which data files and which changes are incomplete?

Without the log files there is no way to be sure your data files are not
full of "half done transactions"



Chances are that 90% of the time everything is fine but without the log
files how do you check that your data files are as they should be.
(or do you expect to restore from backup after any power outs?)


Keeping them on a raid 1 gives you a level of redundancy to get you past
hardware failures that happen at the wrong time. (as they all do)




--

Shane Ambler
pgSQL (at) Sheeky (dot) Biz

Get Sheeky @ http://Sheeky.Biz