Thread: With 4 disks should I go for RAID 5 or RAID 10

With 4 disks should I go for RAID 5 or RAID 10

From
"Fernando Hevia"
Date:

Hi list,

I am building kind of a poor mans database server:

Pentium D 945 (2 x 3 Ghz cores)

4 GB RAM

4 x 160 GB SATA II 7200 rpm (Intel server motherboard has only 4 SATA ports)

Database will be about 30 GB in size initially and growing 10 GB per year. Data is inserted overnight in two big tables and during the day mostly read-only queries are run. Parallelism is rare.

I have read about different raid levels with Postgres but the advice found seems to apply on 8+ disks systems. With only four disks and performance in mind should I build a RAID 10 or RAID 5 array? Raid 0 is overruled since redundancy is needed.

I am going to use software Raid with Linux (Ubuntu Server 6.06).

Thanks for any hindsight.

Regards,

Fernando.

Re: With 4 disks should I go for RAID 5 or RAID 10

From
Bill Moran
Date:
RAID 10.

I snipped the rest of your message because none of it matters.  Never use
RAID 5 on a database system.  Ever.  There is absolutely NO reason to
every put yourself through that much suffering.  If you hate yourself
that much just commit suicide, it's less drastic.

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

Re: With 4 disks should I go for RAID 5 or RAID 10

From
Mark Mielke
Date:
Fernando Hevia wrote:

Database will be about 30 GB in size initially and growing 10 GB per year. Data is inserted overnight in two big tables and during the day mostly read-only queries are run. Parallelism is rare.

I have read about different raid levels with Postgres but the advice found seems to apply on 8+ disks systems. With only four disks and performance in mind should I build a RAID 10 or RAID 5 array? Raid 0 is overruled since redundancy is needed.

I am going to use software Raid with Linux (Ubuntu Server 6.06).


In my experience, software RAID 5 is horrible. Write performance can decrease below the speed of one disk on its own, and read performance will not be significantly more than RAID 1+0 as the number of stripes has only increased from 2 to 3, and if reading while writing, you will not get 3X as RAID 5 write requires at least two disks to be involved. I believe hardware RAID 5 is also horrible, but since the hardware hides it from the application, a hardware RAID 5 user might not care.

Software RAID 1+0 works fine on Linux with 4 disks. This is the setup I use for my personal server.

Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>

Re: With 4 disks should I go for RAID 5 or RAID 10

From
Greg Smith
Date:
On Wed, 26 Dec 2007, Mark Mielke wrote:

> I believe hardware RAID 5 is also horrible, but since the hardware hides
> it from the application, a hardware RAID 5 user might not care.

Typically anything doing hardware RAID 5 also has a reasonable sized write
cache on the controller, which softens the problem a bit.  As soon as you
exceed what it can buffer you're back to suffering again.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: With 4 disks should I go for RAID 5 or RAID 10

From
"Fernando Hevia"
Date:
> Bill Moran wrote:
>
> RAID 10.
>
> I snipped the rest of your message because none of it matters.  Never use
> RAID 5 on a database system.  Ever.  There is absolutely NO reason to
> every put yourself through that much suffering.  If you hate yourself
> that much just commit suicide, it's less drastic.
>

Well, that's a pretty strong argument. No suicide in my plans, gonna stick
to RAID 10. :)
Thanks.


Re: With 4 disks should I go for RAID 5 or RAID 10

From
"Fernando Hevia"
Date:
Mark Mielke Wrote:

>In my experience, software RAID 5 is horrible. Write performance can
>decrease below the speed of one disk on its own, and read performance will
>not be significantly more than RAID 1+0 as the number of stripes has only
>increased from 2 to 3, and if reading while writing, you will not get 3X as
>RAID 5 write requires at least two disks to be involved. I believe hardware
>RAID 5 is also horrible, but since the hardware hides it from the
>application, a hardware RAID 5 user might not care.

>Software RAID 1+0 works fine on Linux with 4 disks. This is the setup I use
>for my personal server.

I will use software RAID so RAID 1+0 seems to be the obvious choice.
Thanks for the advice!



Re: With 4 disks should I go for RAID 5 or RAID 10

From
david@lang.hm
Date:
On Wed, 26 Dec 2007, Fernando Hevia wrote:

> Mark Mielke Wrote:
>
>> In my experience, software RAID 5 is horrible. Write performance can
>> decrease below the speed of one disk on its own, and read performance will
>> not be significantly more than RAID 1+0 as the number of stripes has only
>> increased from 2 to 3, and if reading while writing, you will not get 3X as
>> RAID 5 write requires at least two disks to be involved. I believe hardware
>> RAID 5 is also horrible, but since the hardware hides it from the
>> application, a hardware RAID 5 user might not care.
>
>> Software RAID 1+0 works fine on Linux with 4 disks. This is the setup I use
>> for my personal server.
>
> I will use software RAID so RAID 1+0 seems to be the obvious choice.
> Thanks for the advice!

to clarify things a bit more.

with only four drives the space difference between raid 1+0 and raid 5
isn't that much, but when you do a write you must write to two drives (the
drive holding the data you are changing, and the drive that holds the
parity data for that stripe, possibly needing to read the old parity data
first, resulting in stalling for seek/read/calculate/seek/write since
the drive moves on after the read), when you read you must read _all_
drives in the set to check the data integrity.

for seek heavy workloads (which almost every database application is) the
extra seeks involved can be murder on your performance. if your workload
is large sequential reads/writes, and you can let the OS buffer things for
you, the performance of raid 5 is much better.

on the other hand, doing raid 6 (instead of raid 5) gives you extra data
protection in exchange for the performance hit, but with only 4 drives
this probably isn't what you are looking for.

Linux software raid can do more then two disks in a mirror, so you may be
able to get the added protection with raid 1 sets (again, probably not
relavent to four drives), although there were bugs in this within the last
six months or so, so you need to be sure your kernel is new enough to have
the fix.

now, if you can afford solid-state drives which don't have noticable seek
times, things are completely different ;-)

David Lang

Re: With 4 disks should I go for RAID 5 or RAID 10

From
"Fernando Hevia"
Date:

> David Lang Wrote:
>
> with only four drives the space difference between raid 1+0 and raid 5
> isn't that much, but when you do a write you must write to two drives (the
> drive holding the data you are changing, and the drive that holds the
> parity data for that stripe, possibly needing to read the old parity data
> first, resulting in stalling for seek/read/calculate/seek/write since
> the drive moves on after the read), when you read you must read _all_
> drives in the set to check the data integrity.

Thanks for the explanation David. It's good to know not only what but also
why. Still I wonder why reads do hit all drives. Shouldn't only 2 disks be
read: the one with the data and the parity disk?

>
> for seek heavy workloads (which almost every database application is) the
> extra seeks involved can be murder on your performance. if your workload
> is large sequential reads/writes, and you can let the OS buffer things for
> you, the performance of raid 5 is much better.

Well, actually most of my application involves large sequential
reads/writes. The memory available for buffering (4GB) isn't bad either, at
least for my scenario. On the other hand I have got such strong posts
against RAID 5 that I doubt to even consider it.

>
> Linux software raid can do more then two disks in a mirror, so you may be
> able to get the added protection with raid 1 sets (again, probably not
> relavent to four drives), although there were bugs in this within the last
> six months or so, so you need to be sure your kernel is new enough to have
> the fix.
>

Well, here rises another doubt. Should I go for a single RAID 1+0 storing OS
+ Data + WAL files or will I be better off with two RAID 1 separating data
from OS + Wal files?

> now, if you can afford solid-state drives which don't have noticable seek
> times, things are completely different ;-)

Ha, sadly budget is very tight. :)

Regards,
Fernando.


Re: With 4 disks should I go for RAID 5 or RAID 10

From
Florian Weimer
Date:
> seek/read/calculate/seek/write since the drive moves on after the
> read), when you read you must read _all_ drives in the set to check
> the data integrity.

I don't know of any RAID implementation that performs consistency
checking on each read operation. 8-(

Re: With 4 disks should I go for RAID 5 or RAID 10

From
Bill Moran
Date:
In response to "Fernando Hevia" <fhevia@ip-tel.com.ar>:
>
> > David Lang Wrote:
> >
> > with only four drives the space difference between raid 1+0 and raid 5
> > isn't that much, but when you do a write you must write to two drives (the
> > drive holding the data you are changing, and the drive that holds the
> > parity data for that stripe, possibly needing to read the old parity data
> > first, resulting in stalling for seek/read/calculate/seek/write since
> > the drive moves on after the read), when you read you must read _all_
> > drives in the set to check the data integrity.
>
> Thanks for the explanation David. It's good to know not only what but also
> why. Still I wonder why reads do hit all drives. Shouldn't only 2 disks be
> read: the one with the data and the parity disk?

In order to recalculate the parity, it has to have data from all disks. Thus,
if you have 4 disks, it has to read 2 (the unknown data blocks included in
the parity calculation) then write 2 (the new data block and the new
parity data)  Caching can help some, but if your data ends up being any
size at all, the cache misses become more frequent than the hits.  Even
when caching helps, you max speed is still only the speed of a single
disk.

> > for seek heavy workloads (which almost every database application is) the
> > extra seeks involved can be murder on your performance. if your workload
> > is large sequential reads/writes, and you can let the OS buffer things for
> > you, the performance of raid 5 is much better.
>
> Well, actually most of my application involves large sequential
> reads/writes.

Will it?  Will you be deleting or updating data?  If so, you'll generate
dead tuples, which vacuum will have to clean up, which means seeks, and
means you new data isn't liable to be sequentially written.

The chance that you actually have a workload that will result in
consistently sequential writes at the disk level is very slim, in my
experience.  When vacuum is taking hours and hours, you'll understand
the pain.

> The memory available for buffering (4GB) isn't bad either, at
> least for my scenario. On the other hand I have got such strong posts
> against RAID 5 that I doubt to even consider it.

If 4G is enough to buffer all your data, then why do you need the extra
space of RAID 5?  If you need the extra space of the RAID 5, then 4G
isn't enough to buffer all your data, and that buffer will be of limited
usefulness.

In any event, even if you've got 300G of RAM to buffer data in, sooner or
later you've got to write it to disk, and no matter how much RAM you have,
your write speed will be limited by how fast your disks can commit.

If you had a database multiple petabytes in size, you could worry about
needing the extra space that RAID 5 gives you, but then you'd realize
that the speed problems associated with RAID 5 will make a petabyte sized
database completely unmanageable.

There's just no scenario where RAID 5 is a win for database work.  Period.
Rationalize all you want.  For those trying to defend RAID 5, I invite you
to try it.  When you're on the verge of suicide because you can't get
any work done, don't say I didn't say so.

> Well, here rises another doubt. Should I go for a single RAID 1+0 storing OS
> + Data + WAL files or will I be better off with two RAID 1 separating data
> from OS + Wal files?

Generally speaking, if you want the absolute best performance, it's
generally recommended to keep the WAL logs on one partition/controller
and the remaining database files on a second one.  However, with only
4 disks, you might get just as much out of a RAID 1+0.

> > now, if you can afford solid-state drives which don't have noticable seek
> > times, things are completely different ;-)
>
> Ha, sadly budget is very tight. :)

Budget is always tight.  That's why you don't want a RAID 5.  Do a RAID 5
now thinking you'll save a few bucks, and you'll be spending twice that
much later trying to fix your mistake.  It's called tripping over a dime
to pick up a nickel.

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

Re: With 4 disks should I go for RAID 5 or RAID 10

From
david@lang.hm
Date:
On Wed, 26 Dec 2007, Fernando Hevia wrote:

>> David Lang Wrote:
>>
>> with only four drives the space difference between raid 1+0 and raid 5
>> isn't that much, but when you do a write you must write to two drives (the
>> drive holding the data you are changing, and the drive that holds the
>> parity data for that stripe, possibly needing to read the old parity data
>> first, resulting in stalling for seek/read/calculate/seek/write since
>> the drive moves on after the read), when you read you must read _all_
>> drives in the set to check the data integrity.
>
> Thanks for the explanation David. It's good to know not only what but also
> why. Still I wonder why reads do hit all drives. Shouldn't only 2 disks be
> read: the one with the data and the parity disk?

no, becouse the parity is of the sort (A+B+C+P) mod X = 0

so if X=10 (which means in practice that only the last decimal digit of
anything matters, very convienient for examples)

A=1, B=2, C=3, A+B+C=6, P=4, A+B+C+P=10=0

if you read B and get 3 and P and get 4 you don't know if this is right or
not unless you also read A and C (at which point you would get
A+B+C+P=11=1=error)

>> for seek heavy workloads (which almost every database application is) the
>> extra seeks involved can be murder on your performance. if your workload
>> is large sequential reads/writes, and you can let the OS buffer things for
>> you, the performance of raid 5 is much better.
>
> Well, actually most of my application involves large sequential
> reads/writes. The memory available for buffering (4GB) isn't bad either, at
> least for my scenario. On the other hand I have got such strong posts
> against RAID 5 that I doubt to even consider it.

in theory a system could get the same performance with a large sequential
read/write on raid5/6 as on a raid0 array of equivilent size (i.e. same
number of data disks, ignoring the parity disks) becouse the OS could read
the entire stripe in at once, do the calculation once, and use all the
data (or when writing, don't write anything until you are ready to write
the entire stripe, calculate the parity and write everything once).

Unfortunantly in practice filesystems don't support this, they don't do
enough readahead to want to keep the entire stripe (so after they read it
all in they throw some of it away), they (mostly) don't know where a
stripe starts (and so intermingle different types of data on one stripe
and spread data across multiple stripes unessasarily), and they tend to do
writes in small, scattered chunks (rather then flushing an entire stripes
worth of data at once)

those who have been around long enough to remember the days of MFM/RLL
(when you could still find the real layout of the drives) may remember
optmizing things to work a track at a time instead of a sector at a time.
this is the exact same logic, just needing to be applied to drive stripes
instead of sectors and tracks on a single drive.

the issue has been raised with the kernel developers, but there's a lot of
work to be done (especially in figuring out how to get all the layers the
info they need in a reasonable way)

>> Linux software raid can do more then two disks in a mirror, so you may be
>> able to get the added protection with raid 1 sets (again, probably not
>> relavent to four drives), although there were bugs in this within the last
>> six months or so, so you need to be sure your kernel is new enough to have
>> the fix.
>>
>
> Well, here rises another doubt. Should I go for a single RAID 1+0 storing OS
> + Data + WAL files or will I be better off with two RAID 1 separating data
> from OS + Wal files?

if you can afford the space, you are almost certinly better seperating the
WAL from the data (I think I've seen debates about which is better
OS+data/Wal or date/OS+Wal, but very little disagreement that either is
better than combining them all)

David Lang

>> now, if you can afford solid-state drives which don't have noticable seek
>> times, things are completely different ;-)
>
> Ha, sadly budget is very tight. :)
>
> Regards,
> Fernando.
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>

Re: With 4 disks should I go for RAID 5 or RAID 10

From
david@lang.hm
Date:
On Wed, 26 Dec 2007, Florian Weimer wrote:

>> seek/read/calculate/seek/write since the drive moves on after the
>> read), when you read you must read _all_ drives in the set to check
>> the data integrity.
>
> I don't know of any RAID implementation that performs consistency
> checking on each read operation. 8-(

I could see a raid 1 array not doing consistancy checking (after all, it
has no way of knowing what's right if it finds an error), but since raid
5/6 can repair the data I would expect them to do the checking each time.

David Lang

Re: With 4 disks should I go for RAID 5 or RAID 10

From
Mark Mielke
Date:
Florian Weimer wrote:
seek/read/calculate/seek/write since the drive moves on after the
read), when you read you must read _all_ drives in the set to check
the data integrity.   
I don't know of any RAID implementation that performs consistency
checking on each read operation. 8-( 

Dave had too much egg nog... :-)

Yep - checking consistency on read would eliminate the performance benefits of RAID under any redundant configuration.

Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>

Re: With 4 disks should I go for RAID 5 or RAID 10

From
david@lang.hm
Date:
On Wed, 26 Dec 2007, Mark Mielke wrote:

> Florian Weimer wrote:
>>> seek/read/calculate/seek/write since the drive moves on after the
>>> read), when you read you must read _all_ drives in the set to check
>>> the data integrity.
>>>
>> I don't know of any RAID implementation that performs consistency
>> checking on each read operation. 8-(
>>
>
> Dave had too much egg nog... :-)
>
> Yep - checking consistency on read would eliminate the performance benefits
> of RAID under any redundant configuration.

except for raid0, raid is primarily a reliability benifit, any performance
benifit is incidental, not the primary purpose.

that said, I have heard of raid1 setups where it only reads off of one of
the drives, but I have not heard of higher raid levels doing so.

David Lang

Re: With 4 disks should I go for RAID 5 or RAID 10

From
Mark Mielke
Date:
david@lang.hm wrote:
>> Thanks for the explanation David. It's good to know not only what but
>> also
>> why. Still I wonder why reads do hit all drives. Shouldn't only 2
>> disks be
>> read: the one with the data and the parity disk?
> no, becouse the parity is of the sort (A+B+C+P) mod X = 0
> so if X=10 (which means in practice that only the last decimal digit
> of anything matters, very convienient for examples)
> A=1, B=2, C=3, A+B+C=6, P=4, A+B+C+P=10=0
> if you read B and get 3 and P and get 4 you don't know if this is
> right or not unless you also read A and C (at which point you would
> get A+B+C+P=11=1=error)
I don't think this is correct. RAID 5 is parity which is XOR. The
property of XOR is such that it doesn't matter what the other drives
are. You can write any block given either: 1) The block you are
overwriting and the parity, or 2) all other blocks except for the block
we are writing and the parity. Now, it might be possible that option 2)
is taken more than option 1) for some complicated reasons, but it is NOT
to check consistency. The array is assumed consistent until proven
otherwise.

> in theory a system could get the same performance with a large
> sequential read/write on raid5/6 as on a raid0 array of equivilent
> size (i.e. same number of data disks, ignoring the parity disks)
> becouse the OS could read the entire stripe in at once, do the
> calculation once, and use all the data (or when writing, don't write
> anything until you are ready to write the entire stripe, calculate the
> parity and write everything once).
For the same number of drives, this cannot be possible. With 10 disks,
on raid5, 9 disks hold data, and 1 holds parity. The theoretical maximum
performance is only 9/10 of the 10/10 performance possible with RAID 0.

> Unfortunantly in practice filesystems don't support this, they don't
> do enough readahead to want to keep the entire stripe (so after they
> read it all in they throw some of it away), they (mostly) don't know
> where a stripe starts (and so intermingle different types of data on
> one stripe and spread data across multiple stripes unessasarily), and
> they tend to do writes in small, scattered chunks (rather then
> flushing an entire stripes worth of data at once)
In my experience, this theoretical maximum is not attainable without
significant write cache, and an intelligent controller, neither of which
Linux software RAID seems to have by default. My situation was a bit
worse in that I used applications that fsync() or journalled metadata
that is ordered, which forces the Linux software RAID to flush far more
than it should - but the same system works very well with RAID 1+0.

>>> Linux software raid can do more then two disks in a mirror, so you
>>> may be
>>> able to get the added protection with raid 1 sets (again, probably not
>>> relavent to four drives), although there were bugs in this within
>>> the last
>>> six months or so, so you need to be sure your kernel is new enough
>>> to have
>>> the fix.
>>>
>> Well, here rises another doubt. Should I go for a single RAID 1+0
>> storing OS
>> + Data + WAL files or will I be better off with two RAID 1 separating
>> data
>> from OS + Wal files?
> if you can afford the space, you are almost certinly better seperating
> the WAL from the data (I think I've seen debates about which is better
> OS+data/Wal or date/OS+Wal, but very little disagreement that either
> is better than combining them all)
I don't think there is a good answer for this question. If you can
afford my drives, you could also afford to make your RAID 1+0 bigger.
Splitting OS/DATA/WAL is only "absolute best" if can arrange your 3
arrays such that there size is relative to their access patterns. For
example, in an overly simplified case, if you use OS 1/4 of DATA, and
WAL 1/2 of DATA, then perhaps "best" is to have a two-disk RAID 1 for
OS, a four-disk RAID 1+0 for WAL, and an eight-disk RAID 1+0 for DATA.
This gives a total of 14 disks. :-)

In practice, if you have four drives, and you try and it into two plus
two, you're going to find that two of the drives are going to be more
idle than the other two.

I have a fun setup - I use RAID 1 across all four drives for the OS,
RAID 1+0 for the database, wal, and other parts, and RAID 0 for a
"build" partition. :-)

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

Re: With 4 disks should I go for RAID 5 or RAID 10

From
Mark Mielke
Date:
david@lang.hm wrote:
> On Wed, 26 Dec 2007, Mark Mielke wrote:
>
>> Florian Weimer wrote:
>>>> seek/read/calculate/seek/write since the drive moves on after the
>>>> read), when you read you must read _all_ drives in the set to check
>>>> the data integrity.
>>> I don't know of any RAID implementation that performs consistency
>>> checking on each read operation. 8-(
>> Dave had too much egg nog... :-)
>> Yep - checking consistency on read would eliminate the performance
>> benefits of RAID under any redundant configuration.
> except for raid0, raid is primarily a reliability benifit, any
> performance benifit is incidental, not the primary purpose.
> that said, I have heard of raid1 setups where it only reads off of one
> of the drives, but I have not heard of higher raid levels doing so.
What do you mean "heard of"? Which raid system do you know of that reads
all drives for RAID 1?

Linux dmraid reads off ONLY the first. Linux mdadm reads off the "best"
one. Neither read from both. Why should it need to read from both? What
will it do if the consistency check fails? It's not like it can tell
which disk is the right one. It only knows that the whole array is
inconsistent. Until it gets an actual hardware failure (read error,
write error), it doesn't know which disk is wrong.

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

Re: With 4 disks should I go for RAID 5 or RAID 10

From
Mark Mielke
Date:
david@lang.hm wrote:
> I could see a raid 1 array not doing consistancy checking (after all,
> it has no way of knowing what's right if it finds an error), but since
> raid 5/6 can repair the data I would expect them to do the checking
> each time.
Your messages are spread across the thread. :-)

RAID 5 cannot repair the data. I don't know much about RAID 6, but I
expect it cannot necessarily repair the data either. It still doesn't
know which drive is wrong. In any case, there is no implementation I am
aware of that performs mandatory consistency checks on read. This would
be silliness.

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

Re: With 4 disks should I go for RAID 5 or RAID 10

From
Mark Mielke
Date:
Bill Moran wrote:
> In order to recalculate the parity, it has to have data from all disks. Thus,
> if you have 4 disks, it has to read 2 (the unknown data blocks included in
> the parity calculation) then write 2 (the new data block and the new
> parity data)  Caching can help some, but if your data ends up being any
> size at all, the cache misses become more frequent than the hits.  Even
> when caching helps, you max speed is still only the speed of a single
> disk.
>
If you have 4 disks, it can do either:

    1) Read the old block, read the parity block, XOR the old block with
the parity block and the new block resulting in the new parity block,
write both the new parity block and the new block.
    2) Read the two unknown blocks, XOR with the new block resulting in
the new parity block, write both the new parity block and the new block.

You are emphasizing 2 - but the scenario is also overly simplistic.
Imagine you had 10 drives on RAID 5. Would it make more sense to read 8
blocks and then write two (option 2, and the one you describe), or read
two blocks and then write two (option 1). Obviously, if option 1 or
option 2 can be satisfied from cache, it is better to not read at all.

I note that you also disagree with Dave, in that you are not claiming it
performs consistency checks on read. No system does this as performance
would go to the crapper.

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

Re: With 4 disks should I go for RAID 5 or RAID 10

From
Bill Moran
Date:
In response to Mark Mielke <mark@mark.mielke.cc>:

> david@lang.hm wrote:
> > On Wed, 26 Dec 2007, Mark Mielke wrote:
> >
> >> Florian Weimer wrote:
> >>>> seek/read/calculate/seek/write since the drive moves on after the
> >>>> read), when you read you must read _all_ drives in the set to check
> >>>> the data integrity.
> >>> I don't know of any RAID implementation that performs consistency
> >>> checking on each read operation. 8-(
> >> Dave had too much egg nog... :-)
> >> Yep - checking consistency on read would eliminate the performance
> >> benefits of RAID under any redundant configuration.
> > except for raid0, raid is primarily a reliability benifit, any
> > performance benifit is incidental, not the primary purpose.
> > that said, I have heard of raid1 setups where it only reads off of one
> > of the drives, but I have not heard of higher raid levels doing so.
> What do you mean "heard of"? Which raid system do you know of that reads
> all drives for RAID 1?

I'm fairly sure that FreeBSD's GEOM does.  Of course, it couldn't be doing
consistency checking at that point.

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

Re: With 4 disks should I go for RAID 5 or RAID 10

From
Bill Moran
Date:
In response to Mark Mielke <mark@mark.mielke.cc>:

> Bill Moran wrote:
> > In order to recalculate the parity, it has to have data from all disks. Thus,
> > if you have 4 disks, it has to read 2 (the unknown data blocks included in
> > the parity calculation) then write 2 (the new data block and the new
> > parity data)  Caching can help some, but if your data ends up being any
> > size at all, the cache misses become more frequent than the hits.  Even
> > when caching helps, you max speed is still only the speed of a single
> > disk.
> >
> If you have 4 disks, it can do either:
>
>     1) Read the old block, read the parity block, XOR the old block with
> the parity block and the new block resulting in the new parity block,
> write both the new parity block and the new block.
>     2) Read the two unknown blocks, XOR with the new block resulting in
> the new parity block, write both the new parity block and the new block.
>
> You are emphasizing 2 - but the scenario is also overly simplistic.
> Imagine you had 10 drives on RAID 5. Would it make more sense to read 8
> blocks and then write two (option 2, and the one you describe), or read
> two blocks and then write two (option 1). Obviously, if option 1 or
> option 2 can be satisfied from cache, it is better to not read at all.

Good point that I wasn't aware of.

> I note that you also disagree with Dave, in that you are not claiming it
> performs consistency checks on read. No system does this as performance
> would go to the crapper.

I call straw man :)

I don't disagree.  I simply don't know.  There's no reason why it _couldn't_
do consistency checking as it ran ... of course, performance would suck.

Generally what you expect out of RAID 5|6 is that it can rebuild a drive
in the event of a failure, so I doubt if anyone does consistency checking
by default, and I wouldn't be surprised if a lot of systems don't have
the option to do it at all.

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

Re: With 4 disks should I go for RAID 5 or RAID 10

From
david@lang.hm
Date:
On Wed, 26 Dec 2007, Mark Mielke wrote:

> david@lang.hm wrote:
>>> Thanks for the explanation David. It's good to know not only what but also
>>> why. Still I wonder why reads do hit all drives. Shouldn't only 2 disks be
>>> read: the one with the data and the parity disk?
>> no, becouse the parity is of the sort (A+B+C+P) mod X = 0
>> so if X=10 (which means in practice that only the last decimal digit of
>> anything matters, very convienient for examples)
>> A=1, B=2, C=3, A+B+C=6, P=4, A+B+C+P=10=0
>> if you read B and get 3 and P and get 4 you don't know if this is right or
>> not unless you also read A and C (at which point you would get
>> A+B+C+P=11=1=error)
> I don't think this is correct. RAID 5 is parity which is XOR. The property of
> XOR is such that it doesn't matter what the other drives are. You can write
> any block given either: 1) The block you are overwriting and the parity, or
> 2) all other blocks except for the block we are writing and the parity. Now,
> it might be possible that option 2) is taken more than option 1) for some
> complicated reasons, but it is NOT to check consistency. The array is assumed
> consistent until proven otherwise.

I was being sloppy in explaining the reason, you are correct that for
writes you don't need to read all the data, you just need the current
parity block, the old data you are going to replace, and the new data to
be able to calculate the new parity block (and note that even with my
checksum example this would be the case).

however I was addressing the point that for reads you can't do any
checking until you have read in all the blocks.

if you never check the consistency, how will it ever be proven otherwise.

>> in theory a system could get the same performance with a large sequential
>> read/write on raid5/6 as on a raid0 array of equivilent size (i.e. same
>> number of data disks, ignoring the parity disks) becouse the OS could read
>> the entire stripe in at once, do the calculation once, and use all the data
>> (or when writing, don't write anything until you are ready to write the
>> entire stripe, calculate the parity and write everything once).
> For the same number of drives, this cannot be possible. With 10 disks, on
> raid5, 9 disks hold data, and 1 holds parity. The theoretical maximum
> performance is only 9/10 of the 10/10 performance possible with RAID 0.

I was saying that a 10 drive raid0 could be the same performance as a 10+1
drive raid 5 or a 10+2 drive raid 6 array.

this is why I said 'same number of data disks, ignoring the parity disks'.

in practice you would probably not do quite this good anyway (you have the
parity calculation to make and the extra drive or two's worth of data
passing over your busses), but it could be a lot closer then any
implementation currently is.

>> Unfortunantly in practice filesystems don't support this, they don't do
>> enough readahead to want to keep the entire stripe (so after they read it
>> all in they throw some of it away), they (mostly) don't know where a stripe
>> starts (and so intermingle different types of data on one stripe and spread
>> data across multiple stripes unessasarily), and they tend to do writes in
>> small, scattered chunks (rather then flushing an entire stripes worth of
>> data at once)
> In my experience, this theoretical maximum is not attainable without
> significant write cache, and an intelligent controller, neither of which
> Linux software RAID seems to have by default. My situation was a bit worse in
> that I used applications that fsync() or journalled metadata that is ordered,
> which forces the Linux software RAID to flush far more than it should - but
> the same system works very well with RAID 1+0.

my statements above apply to any type of raid implementation, hardware or
software.

the thing that saves the hardware implementation is that the data is
written to a battery-backed cache and the controller lies to the system,
telling it that the write is complete, and then it does the write later.

on a journaling filesystem you could get very similar results if you put
the journal on a solid-state drive.

but for your application, the fact that you are doing lots of fsyncs is
what's killing you, becouse the fsync forces a lot of data to be written
out, swamping the caches involved, and requiring that you wait for seeks.
nothing other then a battery backed disk cache of some sort (either on the
controller or a solid-state drive on a journaled filesystem would work)

David Lang


Re: With 4 disks should I go for RAID 5 or RAID 10

From
david@lang.hm
Date:
On Wed, 26 Dec 2007, Mark Mielke wrote:

> david@lang.hm wrote:
>> On Wed, 26 Dec 2007, Mark Mielke wrote:
>>
>>> Florian Weimer wrote:
>>>>> seek/read/calculate/seek/write since the drive moves on after the
>>>>> read), when you read you must read _all_ drives in the set to check
>>>>> the data integrity.
>>>> I don't know of any RAID implementation that performs consistency
>>>> checking on each read operation. 8-(
>>> Dave had too much egg nog... :-)
>>> Yep - checking consistency on read would eliminate the performance
>>> benefits of RAID under any redundant configuration.
>> except for raid0, raid is primarily a reliability benifit, any performance
>> benifit is incidental, not the primary purpose.
>> that said, I have heard of raid1 setups where it only reads off of one of
>> the drives, but I have not heard of higher raid levels doing so.
> What do you mean "heard of"? Which raid system do you know of that reads all
> drives for RAID 1?
>
> Linux dmraid reads off ONLY the first. Linux mdadm reads off the "best" one.
> Neither read from both. Why should it need to read from both? What will it do
> if the consistency check fails? It's not like it can tell which disk is the
> right one. It only knows that the whole array is inconsistent. Until it gets
> an actual hardware failure (read error, write error), it doesn't know which
> disk is wrong.

yes, the two linux software implementations only read from one disk, but I
have seen hardware implementations where it reads from both drives, and if
they disagree it returns a read error rather then possibly invalid data
(it's up to the admin to figure out which drive is bad at that point).

no, I don't remember which card this was. I've been playing around with
things in this space for quite a while.

David Lang

Re: With 4 disks should I go for RAID 5 or RAID 10

From
Mark Mielke
Date:
Bill Moran wrote:

What do you mean "heard of"? Which raid system do you know of that reads 
all drives for RAID 1?   
I'm fairly sure that FreeBSD's GEOM does.  Of course, it couldn't be doing
consistency checking at that point. 
According to this:

http://www.freebsd.org/cgi/man.cgi?query=gmirror&apropos=0&sektion=8&manpath=FreeBSD+6-current&format=html

There is a -b (balance) option that seems pretty clear that it does not read from all drives if it does not have to:

    Create a mirror.                 The order of components is important,                because a component's priority is based on its position                (starting from 0).  The component with the biggest priority                is used by the prefer balance algorithm and is also used as a                master component when resynchronization is needed, e.g. after                a power failure when the device was open for writing.
    Additional options include:
                -b balance  Specifies balance algorithm to use, one of:
                            load         Read from the component with the                                         lowest load.
                            prefer       Read from the component with the                                         biggest priority.
                            round-robin  Use round-robin algorithm when                                         choosing component to read.
                            split        Split read requests, which are big-                                         ger than or equal to slice size on N                                         pieces, where N is the number of                                         active components.  This is the                                         default balance algorithm.


Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>

Re: With 4 disks should I go for RAID 5 or RAID 10

From
david@lang.hm
Date:
On Wed, 26 Dec 2007, Mark Mielke wrote:

> david@lang.hm wrote:
>> I could see a raid 1 array not doing consistancy checking (after all, it
>> has no way of knowing what's right if it finds an error), but since raid
>> 5/6 can repair the data I would expect them to do the checking each time.
> Your messages are spread across the thread. :-)
>
> RAID 5 cannot repair the data. I don't know much about RAID 6, but I expect
> it cannot necessarily repair the data either. It still doesn't know which
> drive is wrong. In any case, there is no implementation I am aware of that
> performs mandatory consistency checks on read. This would be silliness.

sorry, raid 5 can repair data if it knows which chunk is bad (the same way
it can rebuild a drive). Raid 6 does something slightly different for it's
parity, I know it can recover from two drives going bad, but I haven't
looked into the question of it detecting bad data.

David Lang

Re: With 4 disks should I go for RAID 5 or RAID 10

From
Mark Mielke
Date:
david@lang.hm wrote:
> however I was addressing the point that for reads you can't do any
> checking until you have read in all the blocks.
> if you never check the consistency, how will it ever be proven otherwise.
A scheme often used is to mark the disk/slice as "clean" during clean
system shutdown (or RAID device shutdown). When it comes back up, it is
assumed clean. Why wouldn't it be clean?

However, if it comes up "unclean", this does indeed require an EXPENSIVE
resynchronization process. Note, however, that resynchronization usually
reads or writes all disks, whether RAID 1, RAID 5, RAID 6, or RAID 1+0.
My RAID 1+0 does a full resynchronization if shut down uncleanly. There
is nothing specific about RAID 5 here.

Now, technically - none of these RAID levels requires a full
resynchronization, even though it is almost always recommended and
performed by default. There is an option in Linux software RAID (mdadm)
to "skip" the resynchronization process. The danger here is that you
could read one of the blocks this minute and get one block, and read the
same block a different minute, and get a different block. This would
occur in RAID 1 if it did round-robin or disk with the nearest head to
the desired block, or whatever, and it made a different decision before
and after the minute. What is the worst that can happen though? Any
system that does careful journalling / synchronization should usually be
fine. The "risk" is similar to write caching without battery backing, in
that if the drive tells the system "write complete", and the system goes
on to perform other work, but the write is not complete, then corruption
becomes a possibility.

Anyways - point is again that RAID 5 is not special here.

> but for your application, the fact that you are doing lots of fsyncs
> is what's killing you, becouse the fsync forces a lot of data to be
> written out, swamping the caches involved, and requiring that you wait
> for seeks. nothing other then a battery backed disk cache of some sort
> (either on the controller or a solid-state drive on a journaled
> filesystem would work)
Yep. :-)

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

Re: With 4 disks should I go for RAID 5 or RAID 10

From
Greg Smith
Date:
On Wed, 26 Dec 2007, david@lang.hm wrote:

> yes, the two linux software implementations only read from one disk, but I
> have seen hardware implementations where it reads from both drives, and if
> they disagree it returns a read error rather then possibly invalid data (it's
> up to the admin to figure out which drive is bad at that point).

Right, many of the old implementations did that; even the Wikipedia
article on this subject mentions it in the "RAID 1 performance" section:
http://en.wikipedia.org/wiki/Standard_RAID_levels

The thing that changed is on modern drives, the internal error detection
and correction is good enough that if you lose a sector, the drive will
normally figure that out at the firmware level and return a read error
rather than bad data.  That lowers of the odds of one drive becoming
corrupted and returning a bad sector as a result enough that the overhead
of reading from both drives isn't considered as important.  I'm not aware
of a current card that does that but I wouldn't be surprised to discover
one existed.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: With 4 disks should I go for RAID 5 or RAID 10

From
Shane Ambler
Date:
Fernando Hevia wrote:

I'll start a little ways back first -

> Well, here rises another doubt. Should I go for a single RAID 1+0 storing OS
> + Data + WAL files or will I be better off with two RAID 1 separating data
> from OS + Wal files?

earlier you wrote -
> Database will be about 30 GB in size initially and growing 10 GB per year.
> Data is inserted overnight in two big tables and during the day mostly
> read-only queries are run. Parallelism is rare.

Now if the data is added overnight while no-one is using the server then
reading is where you want performance, provided any degradation in
writing doesn't slow down the overnight data loading enough to make it
too long to finish while no-one else is using it.

So in theory the only time you will have an advantage of having WAL on a
separate disk from data is at night when the data is loading itself (I
am assuming this is an automated step)
But *some*? gains can be made from having the OS separate from the data.




(This is for a theoretical discussion challenging the info/rumors that
abounds about RAID setups) not to start a bitch fight or flame war.


So for the guys who know the intricacies of RAID implementation -

I don't have any real world performance measures here.

For a setup that is only reading from disk (Santa sprinkles the data
down the air vent while we are all snug in our bed)

It has been mentioned that raid drivers/controllers can balance the
workload across the different disks - as Mark mentioned from the FreeBSD
6 man pages - the balance option can be set to
load|prefer|round-robin|split

So in theory a modern RAID 1 setup can be configured to get similar read
speeds as RAID 0 but would still drop to single disk speeds (or similar)
when writing, but RAID 0 can get the faster write performance.

So in a perfect setup (probably 1+0) 4x 300MB/s SATA drives could
deliver 1200MB/s of data to RAM, which is also assuming that all 4
channels have their own data path to RAM and aren't sharing.
(anyone know how segregated the on board controllers such as these are?)
(do some pci controllers offer better throughput?)

We all know that doesn't happen in the real world ;-) Let's say we are
restricted to 80% - 1000MB/s - and some of that (10%) gets used by the
system - so we end up with 900MB/s delivered off disk to postgres - that
would still be more than the perfect rate at which 2x 300MB/s drives can
deliver.

So in this situation - if configured correctly with a good controller
(driver for software RAID etc) a single 4 disk RAID 1+0 could outperform
two 2 disk RAID 1 setups with data/OS+WAL split between the two.

Is the real world speeds so different that this theory is real fantasy
or has hardware reached a point performance wise where this is close to
fact??



--

Shane Ambler
pgSQL (at) Sheeky (dot) Biz

Get Sheeky @ http://Sheeky.Biz

Re: With 4 disks should I go for RAID 5 or RAID 10

From
Greg Smith
Date:
On Thu, 27 Dec 2007, Shane Ambler wrote:

> So in theory a modern RAID 1 setup can be configured to get similar read
> speeds as RAID 0 but would still drop to single disk speeds (or similar) when
> writing, but RAID 0 can get the faster write performance.

The trick is, you need a perfect controller that scatters individual reads
evenly across the two disks as sequential reads move along the disk to
pull this off, bouncing between a RAID 1 pair to use all the bandwidth
available.  There are caches inside the disk, read-ahead strategies as
well, and that all has to line up just right for a single client to get
all the bandwidth.  Real-world disks and controllers don't quite behave
well enough for that to predictably deliver what you might expect from
theory.  With RAID 0, getting the full read speed of 2Xsingle drive is
much more likely to actually happen than in RAID 1.

> So in a perfect setup (probably 1+0) 4x 300MB/s SATA drives could
> deliver 1200MB/s of data to RAM, which is also assuming that all 4
> channels have their own data path to RAM and aren't sharing. (anyone
> know how segregated the on board controllers such as these are?) (do
> some pci controllers offer better throughput?)

OK, first off, beyond the occasional trivial burst you'll be hard pressed
to ever sustain over 60MB/s out of any single SATA drive.  So the
theoretical max 4-channel speed is closer to 240MB/s.

A regular PCI bus tops out at a theoretical 133MB/s, and you sure can
saturate one with 4 disks and a good controller.  This is why server
configurations have controller cards that use PCI-X (1024MB/s) or lately
PCI-e aka PCI/Express (250MB/s for each channel with up to 16 being
common).  If your SATA cards are on a motherboard, that's probably using
some integrated controller via the Southbridge AKA the ICH.  That's
probably got 250MB/s or more and in current products can easily outrun
most sets of disks you'll ever connect.  Even on motherboards that support
8 SATA channels it will be difficult for anything else on the system to go
higher than 250MB/s even if the drives could potentially do more, and once
you're dealing with real-world workloads.

If you have multiple SATA controllers each with their own set of disk,
then you're back to having to worry about the bus limits.  So, yes, there
are bus throughput considerations here, but unless you're building a giant
array or using some older bus technology you're unlikely to hit them with
spinning SATA disks.

> We all know that doesn't happen in the real world ;-) Let's say we are
> restricted to 80% - 1000MB/s

Yeah, as mentioned above it's actually closer to 20%.

While your numbers are off by a bunch, the reality for database use means
these computations don't matter much anyway.  The seek related behavior
drives a lot of this more than sequential throughput, and decisions like
whether to split out the OS or WAL or whatever need to factor all that,
rather than just the theoretical I/O.

For example, one reason it's popular to split the WAL onto another disk is
that under normal operation the disk never does a seek.  So if there's a
dedicated disk for that, the disk just writes but never moves much.
Where if the WAL is shared, the disk has to jump between writing that data
and whatever else is going on, and peak possible WAL throughput is waaaay
slower because of those seeks.  (Note that unless you have a bunch of
disks, your WAL is unlikely to be a limiter anyway so you still may not
want to make it separate).

(This topic so badly needs a PostgreSQL specific FAQ)

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: With 4 disks should I go for RAID 5 or RAID 10

From
Mark Mielke
Date:
Shane Ambler wrote:
> So in theory a modern RAID 1 setup can be configured to get similar
> read speeds as RAID 0 but would still drop to single disk speeds (or
> similar) when writing, but RAID 0 can get the faster write performance.

Unfortunately, it's a bit more complicated than that. RAID 1 has a
sequential read problem, as read-ahead is wasted, and you may as well
read from one disk and ignore the others. RAID 1 does, however, allows
for much greater concurrency. 4 processes on a 4 disk RAID 1 system can,
theoretically, each do whatever they want, without impacting each other.
Database loads involving a single active read user will see greater
performance with RAID 0. Database loads involving multiple concurrent
active read users will see greater performance with RAID 1. All of these
assume writes are not being performed to any great significance. Even
single writes cause all disks in a RAID 1 system to synchronize
temporarily eliminating the read benefit. RAID 0 allows some degree of
concurrent reads and writes occurring at the same time (assuming even
distribution of the data across the devices). Of course, RAID 0 systems
have an expected life that decreases as the number of disks in the
system increase.

So, this is where we get to RAID 1+0. Redundancy, good read performance,
good write performance, relatively simple implementation. For a mere
cost of double the number of disk storage,
you can get around the problems of RAID 1 and the problems of RAID 0. :-)

> So in a perfect setup (probably 1+0) 4x 300MB/s SATA drives could
> deliver 1200MB/s of data to RAM, which is also assuming that all 4
> channels have their own data path to RAM and aren't sharing.
> (anyone know how segregated the on board controllers such as these are?)
> (do some pci controllers offer better throughput?)
> We all know that doesn't happen in the real world ;-) Let's say we are
> restricted to 80% - 1000MB/s - and some of that (10%) gets used by the
> system - so we end up with 900MB/s delivered off disk to postgres -
> that would still be more than the perfect rate at which 2x 300MB/s
> drives can deliver.

I expect you would have to have good hardware, and a well tuned system
to see 80%+ theoretical for common work loads. But then, this isn't
unique to RAID. Even in a single disk system, one has trouble achieving
80%+ theoretical. :-)

I achieve something closer to +20% - +60% over the theoretical
performance of a single disk with my four disk RAID 1+0 partitions. Lots
of compromises in my system though that I won't get into. For me, I
value the redundancy, allowing for a single disk to fail and giving me
time to easily recover, but for the cost of two more disks, I am able to
counter the performance cost of redundancy, and actually see a positive
performance effect instead.

> So in this situation - if configured correctly with a good controller
> (driver for software RAID etc) a single 4 disk RAID 1+0 could
> outperform two 2 disk RAID 1 setups with data/OS+WAL split between the
> two.
> Is the real world speeds so different that this theory is real fantasy
> or has hardware reached a point performance wise where this is close
> to fact??
I think it depends on the balance. If every second operation requires a
WAL write, having separate might make sense. However, if the balance is
less than even, one would end up with one of the 2 disk RAID 1 setups
being more idle than the other. It's not an exact science when it comes
to the various compromises being made. :-)

If you can only put 4 disks in to the system (either cost, or because of
the system size), I would suggest RAID 1+0 on all four as sensible
compromise. If you can put more in - start to consider breaking it up. :-)

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

Re: With 4 disks should I go for RAID 5 or RAID 10

From
Shane Ambler
Date:
Mark Mielke wrote:
> Shane Ambler wrote:
>> So in a perfect setup (probably 1+0) 4x 300MB/s SATA drives could
>> deliver 1200MB/s of data to RAM, which is also assuming that all 4
>> channels have their own data path to RAM and aren't sharing.
>> (anyone know how segregated the on board controllers such as these
>> are?)
 >> (do some pci controllers offer better throughput?)
 >> We all know that doesn't happen in the real world ;-) Let's say we
 >> are restricted to 80% - 1000MB/s - and some of that (10%) gets used
 >> by the system - so we end up with 900MB/s delivered off disk to
>> postgres - that would still be more than the perfect rate at which
>> 2x 300MB/s drives can deliver.
>
> I achieve something closer to +20% - +60% over the theoretical
> performance of a single disk with my four disk RAID 1+0 partitions.

If a good 4 disk SATA RAID 1+0 can achieve 60% more throughput than a
single SATA disk, what sort of percentage can be achieved from a good
SCSI controller with 4 disks in RAID 1+0?

Are we still hitting the bus limits at this point or can a SCSI RAID
still outperform in raw data throughput?

I would still think that SCSI would still provide the better reliability
that it always has, but performance wise is it still in front of SATA?



--

Shane Ambler
pgSQL (at) Sheeky (dot) Biz

Get Sheeky @ http://Sheeky.Biz

Re: With 4 disks should I go for RAID 5 or RAID 10

From
Shane Ambler
Date:
Greg Smith wrote:
> On Thu, 27 Dec 2007, Shane Ambler wrote:
>
>> So in theory a modern RAID 1 setup can be configured to get similar
>> read speeds as RAID 0 but would still drop to single disk speeds (or
>> similar) when writing, but RAID 0 can get the faster write performance.
>
> The trick is, you need a perfect controller that scatters individual
> reads evenly across the two disks as sequential reads move along the
> disk to pull this off, bouncing between a RAID 1 pair to use all the
> bandwidth available.  There are caches inside the disk, read-ahead
> strategies as well, and that all has to line up just right for a single
> client to get all the bandwidth.  Real-world disks and controllers don't
> quite behave well enough for that to predictably deliver what you might
> expect from theory.  With RAID 0, getting the full read speed of
> 2Xsingle drive is much more likely to actually happen than in RAID 1.

Kind of makes the point for using 1+0

>> So in a perfect setup (probably 1+0) 4x 300MB/s SATA drives could
>> deliver 1200MB/s of data to RAM, which is also assuming that all 4
>> channels have their own data path to RAM and aren't sharing.

> OK, first off, beyond the occasional trivial burst you'll be hard
> pressed to ever sustain over 60MB/s out of any single SATA drive.  So
> the theoretical max 4-channel speed is closer to 240MB/s.
>
> A regular PCI bus tops out at a theoretical 133MB/s, and you sure can
> saturate one with 4 disks and a good controller.  This is why server
> configurations have controller cards that use PCI-X (1024MB/s) or lately
> PCI-e aka PCI/Express (250MB/s for each channel with up to 16 being
> common).  If your SATA cards are on a motherboard, that's probably using

So I guess as far as performance goes your motherboard will determine
how far you can take it.

(talking from a db only server view on things)

A PCI system will have little benefit from more than 2 disks but would
need 4 to get both reliability and performance.

PCI-X can benefit from up to 17 disks

PCI-e (with 16 channels) can benefit from 66 disks

The trick there will be dividing your db over a large number of disk
sets to balance the load among them (I don't see 66 disks being setup in
one array), so this would be of limited use to anyone but the most
dedicated DBA's.

For most servers these days the number of disks are added to reach a
performance level not a storage requirement.

> While your numbers are off by a bunch, the reality for database use
> means these computations don't matter much anyway.  The seek related
> behavior drives a lot of this more than sequential throughput, and
> decisions like whether to split out the OS or WAL or whatever need to
> factor all that, rather than just the theoretical I/O.
>

So this is where solid state disks come in - lack of seek times
(basically) means they can saturate your bus limits.


--

Shane Ambler
pgSQL (at) Sheeky (dot) Biz

Get Sheeky @ http://Sheeky.Biz

Re: With 4 disks should I go for RAID 5 or RAID 10

From
Bill Moran
Date:
In response to Mark Mielke <mark@mark.mielke.cc>:

> Bill Moran wrote:
> >
> >> What do you mean "heard of"? Which raid system do you know of that reads
> >> all drives for RAID 1?
> >>
> >
> > I'm fairly sure that FreeBSD's GEOM does.  Of course, it couldn't be doing
> > consistency checking at that point.
> >
> According to this:
>
> http://www.freebsd.org/cgi/man.cgi?query=gmirror&apropos=0&sektion=8&manpath=FreeBSD+6-current&format=html
>
> There is a -b (balance) option that seems pretty clear that it does not
> read from all drives if it does not have to:

From where did you draw that conclusion?  Note that the "split" algorithm
(which is the default) divides requests up among multiple drives.  I'm
unclear as to how you reached a conclusion opposite of what the man page
says -- did you test and find it not to work?

>
>     Create a mirror.
>                  The order of components is important,
>                  because a component's priority is based on its position
>                  (starting from 0).  The component with the biggest priority
>                  is used by the prefer balance algorithm and is also used as a
>                  master component when resynchronization is needed, e.g. after
>                  a power failure when the device was open for writing.
>
>     Additional options include:
>
>                  *-b* /balance/  Specifies balance algorithm to use, one of:
>
>                              *load*         Read from the component with the
>                                           lowest load.
>
>                              *prefer*       Read from the component with the
>                                           biggest priority.
>
>                              *round-robin*  Use round-robin algorithm when
>                                           choosing component to read.
>
>                              *split*        Split read requests, which are big-
>                                           ger than or equal to slice size on N
>                                           pieces, where N is the number of
>                                           active components.  This is the
>                                           default balance algorithm.
>
>
>
> Cheers,
> mark
>
> --
> Mark Mielke <mark@mielke.cc>
>
>
>
>
>
>
>
>


--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

****************************************************************
IMPORTANT: This message contains confidential information and is
intended only for the individual named. If the reader of this
message is not an intended recipient (or the individual
responsible for the delivery of this message to an intended
recipient), please be advised that any re-use, dissemination,
distribution or copying of this message is prohibited. Please
notify the sender immediately by e-mail if you have received
this e-mail by mistake and delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or
error-free as information could be intercepted, corrupted, lost,
destroyed, arrive late or incomplete, or contain viruses. The
sender therefore does not accept liability for any errors or
omissions in the contents of this message, which arise as a
result of e-mail transmission.
****************************************************************

Re: With 4 disks should I go for RAID 5 or RAID 10

From
Mark Mielke
Date:
Bill Moran wrote:
In response to Mark Mielke <mark@mark.mielke.cc>:
 
Bill Moran wrote:   
I'm fairly sure that FreeBSD's GEOM does.  Of course, it couldn't be doing
consistency checking at that point.     
According to this:

http://www.freebsd.org/cgi/man.cgi?query=gmirror&apropos=0&sektion=8&manpath=FreeBSD+6-current&format=html

There is a -b (balance) option that seems pretty clear that it does not 
read from all drives if it does not have to:   
>From where did you draw that conclusion?  Note that the "split" algorithm
(which is the default) divides requests up among multiple drives.  I'm
unclear as to how you reached a conclusion opposite of what the man page
says -- did you test and find it not to work? 
Perhaps you and I are speaking slightly different languages? :-) When I say "does not read from all drives", I mean "it will happily read from any of the drives to satisfy the request, and allows some level of configuration as to which drive it will select. It does not need to read all of the drives to satisfy the request."

Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>

Re: With 4 disks should I go for RAID 5 or RAID 10

From
Jean-David Beyer
Date:
Shane Ambler wrote:

>> I achieve something closer to +20% - +60% over the theoretical
>> performance of a single disk with my four disk RAID 1+0 partitions.
>
> If a good 4 disk SATA RAID 1+0 can achieve 60% more throughput than a
> single SATA disk, what sort of percentage can be achieved from a good
> SCSI controller with 4 disks in RAID 1+0?
>
> Are we still hitting the bus limits at this point or can a SCSI RAID
> still outperform in raw data throughput?
>
> I would still think that SCSI would still provide the better reliability
> that it always has, but performance wise is it still in front of SATA?
>
I have a SuperMicro X5DP8-G2 motherboard with two hyperthreaded
microprocessors on it. This motherboard has 5 PCI-X busses (not merely 5
sockets: in fact it has 6 sockets, but also a dual Ultra/320 SCSI controller
chip, a dual gigabit ethernet chip.

So I hook up my 4 10,000 rpm database hard drives on one SCSI controller and
the two other 10,000 rpm hard drives on the other. Nothing else is on the
SCSI controller or its PCI-X bus that goes to the main memory except the
other SCSI controller. These PCI-X busses are 133 MHz, and the memory as 266
MHz but the FSB runs at 533MHz as the memory modules are run in parallel;
i.e., there are 8 modules and they run two at a time.

Nothing else is on the other SCSI controller. Of the two hard drives on the
second controller, one has the WAL on it, but when my database is running
something (it is up all the time, but frequently idle) nothing else uses
that drive much.

So in theory, I should be able to get about 320 megabytes/second through
each SCSI controller, though I have never seen that. I do get over 60
megabytes/second for brief (several second) periods though. I do not run RAID.

I think it is probably very difficult to generalize how things go without a
good knowledge of how the motherboard is organized, the amounts and types of
caching that take place (both sortware and hardware), the speeds of the
various devices and their controllers, the bandwidths of the various
communication paths, and so on.


--
  .~.  Jean-David Beyer          Registered Linux User 85642.
  /V\  PGP-Key: 9A2FC99A         Registered Machine   241939.
 /( )\ Shrewsbury, New Jersey    http://counter.li.org
 ^^-^^ 11:00:01 up 10 days, 11:30, 2 users, load average: 4.20, 4.20, 4.25

Re: With 4 disks should I go for RAID 5 or RAID 10

From
Vivek Khera
Date:
On Dec 26, 2007, at 10:21 AM, Bill Moran wrote:

> I snipped the rest of your message because none of it matters.
> Never use
> RAID 5 on a database system.  Ever.  There is absolutely NO reason to
> every put yourself through that much suffering.  If you hate yourself
> that much just commit suicide, it's less drastic.

Once you hit 14 or more spindles, the difference between RAID10 and
RAID5 (or preferably RAID6) is minimal.

In your 4 disk scenario, I'd vote RAID10.


Re: With 4 disks should I go for RAID 5 or RAID 10

From
Vivek Khera
Date:
On Dec 26, 2007, at 4:28 PM, david@lang.hm wrote:

> now, if you can afford solid-state drives which don't have noticable
> seek times, things are completely different ;-)

Who makes one with "infinite" lifetime?  The only ones I know of are
built using RAM and have disk drive backup with internal monitoring
are *really* expensive.

I've pondered building a raid enclosure using these new SATA flash
drives, but that would be an expensive brick after a short period as
one of my DB servers...


Re: With 4 disks should I go for RAID 5 or RAID 10

From
Bill Moran
Date:
In response to Mark Mielke <mark@mark.mielke.cc>:

> Bill Moran wrote:
> > In response to Mark Mielke <mark@mark.mielke.cc>:
> >
> >
> >> Bill Moran wrote:
> >>
> >>> I'm fairly sure that FreeBSD's GEOM does.  Of course, it couldn't be doing
> >>> consistency checking at that point.
> >>>
> >> According to this:
> >>
> >> http://www.freebsd.org/cgi/man.cgi?query=gmirror&apropos=0&sektion=8&manpath=FreeBSD+6-current&format=html
> >>
> >> There is a -b (balance) option that seems pretty clear that it does not
> >> read from all drives if it does not have to:
> >>
> >
> > >From where did you draw that conclusion?  Note that the "split" algorithm
> > (which is the default) divides requests up among multiple drives.  I'm
> > unclear as to how you reached a conclusion opposite of what the man page
> > says -- did you test and find it not to work?
> >
> Perhaps you and I are speaking slightly different languages? :-) When I
> say "does not read from all drives", I mean "it will happily read from
> any of the drives to satisfy the request, and allows some level of
> configuration as to which drive it will select. It does not need to read
> all of the drives to satisfy the request."

Ahh ... I did misunderstand you.

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023