Thread: With 4 disks should I go for RAID 5 or RAID 10
Hi list,
I am building kind of a poor man’s database server:
Pentium D 945 (2 x 3 Ghz cores)
4 GB RAM
4 x 160 GB SATA II 7200 rpm (Intel server motherboard has only 4 SATA ports)
Database will be about 30 GB in size initially and growing 10 GB per year. Data is inserted overnight in two big tables and during the day mostly read-only queries are run. Parallelism is rare.
I have read about different raid levels with Postgres but the advice found seems to apply on 8+ disks systems. With only four disks and performance in mind should I build a RAID 10 or RAID 5 array? Raid 0 is overruled since redundancy is needed.
I am going to use software Raid with Linux (Ubuntu Server 6.06).
Thanks for any hindsight.
Regards,
Fernando.
RAID 10. I snipped the rest of your message because none of it matters. Never use RAID 5 on a database system. Ever. There is absolutely NO reason to every put yourself through that much suffering. If you hate yourself that much just commit suicide, it's less drastic. -- Bill Moran Collaborative Fusion Inc. http://people.collaborativefusion.com/~wmoran/ wmoran@collaborativefusion.com Phone: 412-422-3463x4023
Database will be about 30 GB in size initially and growing 10 GB per year. Data is inserted overnight in two big tables and during the day mostly read-only queries are run. Parallelism is rare.
I have read about different raid levels with Postgres but the advice found seems to apply on 8+ disks systems. With only four disks and performance in mind should I build a RAID 10 or RAID 5 array? Raid 0 is overruled since redundancy is needed.
I am going to use software Raid with Linux (Ubuntu Server 6.06).
In my experience, software RAID 5 is horrible. Write performance can decrease below the speed of one disk on its own, and read performance will not be significantly more than RAID 1+0 as the number of stripes has only increased from 2 to 3, and if reading while writing, you will not get 3X as RAID 5 write requires at least two disks to be involved. I believe hardware RAID 5 is also horrible, but since the hardware hides it from the application, a hardware RAID 5 user might not care.
Software RAID 1+0 works fine on Linux with 4 disks. This is the setup I use for my personal server.
Cheers,
mark
-- Mark Mielke <mark@mielke.cc>
On Wed, 26 Dec 2007, Mark Mielke wrote: > I believe hardware RAID 5 is also horrible, but since the hardware hides > it from the application, a hardware RAID 5 user might not care. Typically anything doing hardware RAID 5 also has a reasonable sized write cache on the controller, which softens the problem a bit. As soon as you exceed what it can buffer you're back to suffering again. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
> Bill Moran wrote: > > RAID 10. > > I snipped the rest of your message because none of it matters. Never use > RAID 5 on a database system. Ever. There is absolutely NO reason to > every put yourself through that much suffering. If you hate yourself > that much just commit suicide, it's less drastic. > Well, that's a pretty strong argument. No suicide in my plans, gonna stick to RAID 10. :) Thanks.
Mark Mielke Wrote: >In my experience, software RAID 5 is horrible. Write performance can >decrease below the speed of one disk on its own, and read performance will >not be significantly more than RAID 1+0 as the number of stripes has only >increased from 2 to 3, and if reading while writing, you will not get 3X as >RAID 5 write requires at least two disks to be involved. I believe hardware >RAID 5 is also horrible, but since the hardware hides it from the >application, a hardware RAID 5 user might not care. >Software RAID 1+0 works fine on Linux with 4 disks. This is the setup I use >for my personal server. I will use software RAID so RAID 1+0 seems to be the obvious choice. Thanks for the advice!
On Wed, 26 Dec 2007, Fernando Hevia wrote: > Mark Mielke Wrote: > >> In my experience, software RAID 5 is horrible. Write performance can >> decrease below the speed of one disk on its own, and read performance will >> not be significantly more than RAID 1+0 as the number of stripes has only >> increased from 2 to 3, and if reading while writing, you will not get 3X as >> RAID 5 write requires at least two disks to be involved. I believe hardware >> RAID 5 is also horrible, but since the hardware hides it from the >> application, a hardware RAID 5 user might not care. > >> Software RAID 1+0 works fine on Linux with 4 disks. This is the setup I use >> for my personal server. > > I will use software RAID so RAID 1+0 seems to be the obvious choice. > Thanks for the advice! to clarify things a bit more. with only four drives the space difference between raid 1+0 and raid 5 isn't that much, but when you do a write you must write to two drives (the drive holding the data you are changing, and the drive that holds the parity data for that stripe, possibly needing to read the old parity data first, resulting in stalling for seek/read/calculate/seek/write since the drive moves on after the read), when you read you must read _all_ drives in the set to check the data integrity. for seek heavy workloads (which almost every database application is) the extra seeks involved can be murder on your performance. if your workload is large sequential reads/writes, and you can let the OS buffer things for you, the performance of raid 5 is much better. on the other hand, doing raid 6 (instead of raid 5) gives you extra data protection in exchange for the performance hit, but with only 4 drives this probably isn't what you are looking for. Linux software raid can do more then two disks in a mirror, so you may be able to get the added protection with raid 1 sets (again, probably not relavent to four drives), although there were bugs in this within the last six months or so, so you need to be sure your kernel is new enough to have the fix. now, if you can afford solid-state drives which don't have noticable seek times, things are completely different ;-) David Lang
> David Lang Wrote: > > with only four drives the space difference between raid 1+0 and raid 5 > isn't that much, but when you do a write you must write to two drives (the > drive holding the data you are changing, and the drive that holds the > parity data for that stripe, possibly needing to read the old parity data > first, resulting in stalling for seek/read/calculate/seek/write since > the drive moves on after the read), when you read you must read _all_ > drives in the set to check the data integrity. Thanks for the explanation David. It's good to know not only what but also why. Still I wonder why reads do hit all drives. Shouldn't only 2 disks be read: the one with the data and the parity disk? > > for seek heavy workloads (which almost every database application is) the > extra seeks involved can be murder on your performance. if your workload > is large sequential reads/writes, and you can let the OS buffer things for > you, the performance of raid 5 is much better. Well, actually most of my application involves large sequential reads/writes. The memory available for buffering (4GB) isn't bad either, at least for my scenario. On the other hand I have got such strong posts against RAID 5 that I doubt to even consider it. > > Linux software raid can do more then two disks in a mirror, so you may be > able to get the added protection with raid 1 sets (again, probably not > relavent to four drives), although there were bugs in this within the last > six months or so, so you need to be sure your kernel is new enough to have > the fix. > Well, here rises another doubt. Should I go for a single RAID 1+0 storing OS + Data + WAL files or will I be better off with two RAID 1 separating data from OS + Wal files? > now, if you can afford solid-state drives which don't have noticable seek > times, things are completely different ;-) Ha, sadly budget is very tight. :) Regards, Fernando.
> seek/read/calculate/seek/write since the drive moves on after the > read), when you read you must read _all_ drives in the set to check > the data integrity. I don't know of any RAID implementation that performs consistency checking on each read operation. 8-(
In response to "Fernando Hevia" <fhevia@ip-tel.com.ar>: > > > David Lang Wrote: > > > > with only four drives the space difference between raid 1+0 and raid 5 > > isn't that much, but when you do a write you must write to two drives (the > > drive holding the data you are changing, and the drive that holds the > > parity data for that stripe, possibly needing to read the old parity data > > first, resulting in stalling for seek/read/calculate/seek/write since > > the drive moves on after the read), when you read you must read _all_ > > drives in the set to check the data integrity. > > Thanks for the explanation David. It's good to know not only what but also > why. Still I wonder why reads do hit all drives. Shouldn't only 2 disks be > read: the one with the data and the parity disk? In order to recalculate the parity, it has to have data from all disks. Thus, if you have 4 disks, it has to read 2 (the unknown data blocks included in the parity calculation) then write 2 (the new data block and the new parity data) Caching can help some, but if your data ends up being any size at all, the cache misses become more frequent than the hits. Even when caching helps, you max speed is still only the speed of a single disk. > > for seek heavy workloads (which almost every database application is) the > > extra seeks involved can be murder on your performance. if your workload > > is large sequential reads/writes, and you can let the OS buffer things for > > you, the performance of raid 5 is much better. > > Well, actually most of my application involves large sequential > reads/writes. Will it? Will you be deleting or updating data? If so, you'll generate dead tuples, which vacuum will have to clean up, which means seeks, and means you new data isn't liable to be sequentially written. The chance that you actually have a workload that will result in consistently sequential writes at the disk level is very slim, in my experience. When vacuum is taking hours and hours, you'll understand the pain. > The memory available for buffering (4GB) isn't bad either, at > least for my scenario. On the other hand I have got such strong posts > against RAID 5 that I doubt to even consider it. If 4G is enough to buffer all your data, then why do you need the extra space of RAID 5? If you need the extra space of the RAID 5, then 4G isn't enough to buffer all your data, and that buffer will be of limited usefulness. In any event, even if you've got 300G of RAM to buffer data in, sooner or later you've got to write it to disk, and no matter how much RAM you have, your write speed will be limited by how fast your disks can commit. If you had a database multiple petabytes in size, you could worry about needing the extra space that RAID 5 gives you, but then you'd realize that the speed problems associated with RAID 5 will make a petabyte sized database completely unmanageable. There's just no scenario where RAID 5 is a win for database work. Period. Rationalize all you want. For those trying to defend RAID 5, I invite you to try it. When you're on the verge of suicide because you can't get any work done, don't say I didn't say so. > Well, here rises another doubt. Should I go for a single RAID 1+0 storing OS > + Data + WAL files or will I be better off with two RAID 1 separating data > from OS + Wal files? Generally speaking, if you want the absolute best performance, it's generally recommended to keep the WAL logs on one partition/controller and the remaining database files on a second one. However, with only 4 disks, you might get just as much out of a RAID 1+0. > > now, if you can afford solid-state drives which don't have noticable seek > > times, things are completely different ;-) > > Ha, sadly budget is very tight. :) Budget is always tight. That's why you don't want a RAID 5. Do a RAID 5 now thinking you'll save a few bucks, and you'll be spending twice that much later trying to fix your mistake. It's called tripping over a dime to pick up a nickel. -- Bill Moran Collaborative Fusion Inc. http://people.collaborativefusion.com/~wmoran/ wmoran@collaborativefusion.com Phone: 412-422-3463x4023
On Wed, 26 Dec 2007, Fernando Hevia wrote: >> David Lang Wrote: >> >> with only four drives the space difference between raid 1+0 and raid 5 >> isn't that much, but when you do a write you must write to two drives (the >> drive holding the data you are changing, and the drive that holds the >> parity data for that stripe, possibly needing to read the old parity data >> first, resulting in stalling for seek/read/calculate/seek/write since >> the drive moves on after the read), when you read you must read _all_ >> drives in the set to check the data integrity. > > Thanks for the explanation David. It's good to know not only what but also > why. Still I wonder why reads do hit all drives. Shouldn't only 2 disks be > read: the one with the data and the parity disk? no, becouse the parity is of the sort (A+B+C+P) mod X = 0 so if X=10 (which means in practice that only the last decimal digit of anything matters, very convienient for examples) A=1, B=2, C=3, A+B+C=6, P=4, A+B+C+P=10=0 if you read B and get 3 and P and get 4 you don't know if this is right or not unless you also read A and C (at which point you would get A+B+C+P=11=1=error) >> for seek heavy workloads (which almost every database application is) the >> extra seeks involved can be murder on your performance. if your workload >> is large sequential reads/writes, and you can let the OS buffer things for >> you, the performance of raid 5 is much better. > > Well, actually most of my application involves large sequential > reads/writes. The memory available for buffering (4GB) isn't bad either, at > least for my scenario. On the other hand I have got such strong posts > against RAID 5 that I doubt to even consider it. in theory a system could get the same performance with a large sequential read/write on raid5/6 as on a raid0 array of equivilent size (i.e. same number of data disks, ignoring the parity disks) becouse the OS could read the entire stripe in at once, do the calculation once, and use all the data (or when writing, don't write anything until you are ready to write the entire stripe, calculate the parity and write everything once). Unfortunantly in practice filesystems don't support this, they don't do enough readahead to want to keep the entire stripe (so after they read it all in they throw some of it away), they (mostly) don't know where a stripe starts (and so intermingle different types of data on one stripe and spread data across multiple stripes unessasarily), and they tend to do writes in small, scattered chunks (rather then flushing an entire stripes worth of data at once) those who have been around long enough to remember the days of MFM/RLL (when you could still find the real layout of the drives) may remember optmizing things to work a track at a time instead of a sector at a time. this is the exact same logic, just needing to be applied to drive stripes instead of sectors and tracks on a single drive. the issue has been raised with the kernel developers, but there's a lot of work to be done (especially in figuring out how to get all the layers the info they need in a reasonable way) >> Linux software raid can do more then two disks in a mirror, so you may be >> able to get the added protection with raid 1 sets (again, probably not >> relavent to four drives), although there were bugs in this within the last >> six months or so, so you need to be sure your kernel is new enough to have >> the fix. >> > > Well, here rises another doubt. Should I go for a single RAID 1+0 storing OS > + Data + WAL files or will I be better off with two RAID 1 separating data > from OS + Wal files? if you can afford the space, you are almost certinly better seperating the WAL from the data (I think I've seen debates about which is better OS+data/Wal or date/OS+Wal, but very little disagreement that either is better than combining them all) David Lang >> now, if you can afford solid-state drives which don't have noticable seek >> times, things are completely different ;-) > > Ha, sadly budget is very tight. :) > > Regards, > Fernando. > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings >
On Wed, 26 Dec 2007, Florian Weimer wrote: >> seek/read/calculate/seek/write since the drive moves on after the >> read), when you read you must read _all_ drives in the set to check >> the data integrity. > > I don't know of any RAID implementation that performs consistency > checking on each read operation. 8-( I could see a raid 1 array not doing consistancy checking (after all, it has no way of knowing what's right if it finds an error), but since raid 5/6 can repair the data I would expect them to do the checking each time. David Lang
seek/read/calculate/seek/write since the drive moves on after the read), when you read you must read _all_ drives in the set to check the data integrity.I don't know of any RAID implementation that performs consistency checking on each read operation. 8-(
Dave had too much egg nog... :-)
Yep - checking consistency on read would eliminate the performance benefits of RAID under any redundant configuration.
Cheers,
mark
-- Mark Mielke <mark@mielke.cc>
On Wed, 26 Dec 2007, Mark Mielke wrote: > Florian Weimer wrote: >>> seek/read/calculate/seek/write since the drive moves on after the >>> read), when you read you must read _all_ drives in the set to check >>> the data integrity. >>> >> I don't know of any RAID implementation that performs consistency >> checking on each read operation. 8-( >> > > Dave had too much egg nog... :-) > > Yep - checking consistency on read would eliminate the performance benefits > of RAID under any redundant configuration. except for raid0, raid is primarily a reliability benifit, any performance benifit is incidental, not the primary purpose. that said, I have heard of raid1 setups where it only reads off of one of the drives, but I have not heard of higher raid levels doing so. David Lang
david@lang.hm wrote: >> Thanks for the explanation David. It's good to know not only what but >> also >> why. Still I wonder why reads do hit all drives. Shouldn't only 2 >> disks be >> read: the one with the data and the parity disk? > no, becouse the parity is of the sort (A+B+C+P) mod X = 0 > so if X=10 (which means in practice that only the last decimal digit > of anything matters, very convienient for examples) > A=1, B=2, C=3, A+B+C=6, P=4, A+B+C+P=10=0 > if you read B and get 3 and P and get 4 you don't know if this is > right or not unless you also read A and C (at which point you would > get A+B+C+P=11=1=error) I don't think this is correct. RAID 5 is parity which is XOR. The property of XOR is such that it doesn't matter what the other drives are. You can write any block given either: 1) The block you are overwriting and the parity, or 2) all other blocks except for the block we are writing and the parity. Now, it might be possible that option 2) is taken more than option 1) for some complicated reasons, but it is NOT to check consistency. The array is assumed consistent until proven otherwise. > in theory a system could get the same performance with a large > sequential read/write on raid5/6 as on a raid0 array of equivilent > size (i.e. same number of data disks, ignoring the parity disks) > becouse the OS could read the entire stripe in at once, do the > calculation once, and use all the data (or when writing, don't write > anything until you are ready to write the entire stripe, calculate the > parity and write everything once). For the same number of drives, this cannot be possible. With 10 disks, on raid5, 9 disks hold data, and 1 holds parity. The theoretical maximum performance is only 9/10 of the 10/10 performance possible with RAID 0. > Unfortunantly in practice filesystems don't support this, they don't > do enough readahead to want to keep the entire stripe (so after they > read it all in they throw some of it away), they (mostly) don't know > where a stripe starts (and so intermingle different types of data on > one stripe and spread data across multiple stripes unessasarily), and > they tend to do writes in small, scattered chunks (rather then > flushing an entire stripes worth of data at once) In my experience, this theoretical maximum is not attainable without significant write cache, and an intelligent controller, neither of which Linux software RAID seems to have by default. My situation was a bit worse in that I used applications that fsync() or journalled metadata that is ordered, which forces the Linux software RAID to flush far more than it should - but the same system works very well with RAID 1+0. >>> Linux software raid can do more then two disks in a mirror, so you >>> may be >>> able to get the added protection with raid 1 sets (again, probably not >>> relavent to four drives), although there were bugs in this within >>> the last >>> six months or so, so you need to be sure your kernel is new enough >>> to have >>> the fix. >>> >> Well, here rises another doubt. Should I go for a single RAID 1+0 >> storing OS >> + Data + WAL files or will I be better off with two RAID 1 separating >> data >> from OS + Wal files? > if you can afford the space, you are almost certinly better seperating > the WAL from the data (I think I've seen debates about which is better > OS+data/Wal or date/OS+Wal, but very little disagreement that either > is better than combining them all) I don't think there is a good answer for this question. If you can afford my drives, you could also afford to make your RAID 1+0 bigger. Splitting OS/DATA/WAL is only "absolute best" if can arrange your 3 arrays such that there size is relative to their access patterns. For example, in an overly simplified case, if you use OS 1/4 of DATA, and WAL 1/2 of DATA, then perhaps "best" is to have a two-disk RAID 1 for OS, a four-disk RAID 1+0 for WAL, and an eight-disk RAID 1+0 for DATA. This gives a total of 14 disks. :-) In practice, if you have four drives, and you try and it into two plus two, you're going to find that two of the drives are going to be more idle than the other two. I have a fun setup - I use RAID 1 across all four drives for the OS, RAID 1+0 for the database, wal, and other parts, and RAID 0 for a "build" partition. :-) Cheers, mark -- Mark Mielke <mark@mielke.cc>
david@lang.hm wrote: > On Wed, 26 Dec 2007, Mark Mielke wrote: > >> Florian Weimer wrote: >>>> seek/read/calculate/seek/write since the drive moves on after the >>>> read), when you read you must read _all_ drives in the set to check >>>> the data integrity. >>> I don't know of any RAID implementation that performs consistency >>> checking on each read operation. 8-( >> Dave had too much egg nog... :-) >> Yep - checking consistency on read would eliminate the performance >> benefits of RAID under any redundant configuration. > except for raid0, raid is primarily a reliability benifit, any > performance benifit is incidental, not the primary purpose. > that said, I have heard of raid1 setups where it only reads off of one > of the drives, but I have not heard of higher raid levels doing so. What do you mean "heard of"? Which raid system do you know of that reads all drives for RAID 1? Linux dmraid reads off ONLY the first. Linux mdadm reads off the "best" one. Neither read from both. Why should it need to read from both? What will it do if the consistency check fails? It's not like it can tell which disk is the right one. It only knows that the whole array is inconsistent. Until it gets an actual hardware failure (read error, write error), it doesn't know which disk is wrong. Cheers, mark -- Mark Mielke <mark@mielke.cc>
david@lang.hm wrote: > I could see a raid 1 array not doing consistancy checking (after all, > it has no way of knowing what's right if it finds an error), but since > raid 5/6 can repair the data I would expect them to do the checking > each time. Your messages are spread across the thread. :-) RAID 5 cannot repair the data. I don't know much about RAID 6, but I expect it cannot necessarily repair the data either. It still doesn't know which drive is wrong. In any case, there is no implementation I am aware of that performs mandatory consistency checks on read. This would be silliness. Cheers, mark -- Mark Mielke <mark@mielke.cc>
Bill Moran wrote: > In order to recalculate the parity, it has to have data from all disks. Thus, > if you have 4 disks, it has to read 2 (the unknown data blocks included in > the parity calculation) then write 2 (the new data block and the new > parity data) Caching can help some, but if your data ends up being any > size at all, the cache misses become more frequent than the hits. Even > when caching helps, you max speed is still only the speed of a single > disk. > If you have 4 disks, it can do either: 1) Read the old block, read the parity block, XOR the old block with the parity block and the new block resulting in the new parity block, write both the new parity block and the new block. 2) Read the two unknown blocks, XOR with the new block resulting in the new parity block, write both the new parity block and the new block. You are emphasizing 2 - but the scenario is also overly simplistic. Imagine you had 10 drives on RAID 5. Would it make more sense to read 8 blocks and then write two (option 2, and the one you describe), or read two blocks and then write two (option 1). Obviously, if option 1 or option 2 can be satisfied from cache, it is better to not read at all. I note that you also disagree with Dave, in that you are not claiming it performs consistency checks on read. No system does this as performance would go to the crapper. Cheers, mark -- Mark Mielke <mark@mielke.cc>
In response to Mark Mielke <mark@mark.mielke.cc>: > david@lang.hm wrote: > > On Wed, 26 Dec 2007, Mark Mielke wrote: > > > >> Florian Weimer wrote: > >>>> seek/read/calculate/seek/write since the drive moves on after the > >>>> read), when you read you must read _all_ drives in the set to check > >>>> the data integrity. > >>> I don't know of any RAID implementation that performs consistency > >>> checking on each read operation. 8-( > >> Dave had too much egg nog... :-) > >> Yep - checking consistency on read would eliminate the performance > >> benefits of RAID under any redundant configuration. > > except for raid0, raid is primarily a reliability benifit, any > > performance benifit is incidental, not the primary purpose. > > that said, I have heard of raid1 setups where it only reads off of one > > of the drives, but I have not heard of higher raid levels doing so. > What do you mean "heard of"? Which raid system do you know of that reads > all drives for RAID 1? I'm fairly sure that FreeBSD's GEOM does. Of course, it couldn't be doing consistency checking at that point. -- Bill Moran Collaborative Fusion Inc. http://people.collaborativefusion.com/~wmoran/ wmoran@collaborativefusion.com Phone: 412-422-3463x4023
In response to Mark Mielke <mark@mark.mielke.cc>: > Bill Moran wrote: > > In order to recalculate the parity, it has to have data from all disks. Thus, > > if you have 4 disks, it has to read 2 (the unknown data blocks included in > > the parity calculation) then write 2 (the new data block and the new > > parity data) Caching can help some, but if your data ends up being any > > size at all, the cache misses become more frequent than the hits. Even > > when caching helps, you max speed is still only the speed of a single > > disk. > > > If you have 4 disks, it can do either: > > 1) Read the old block, read the parity block, XOR the old block with > the parity block and the new block resulting in the new parity block, > write both the new parity block and the new block. > 2) Read the two unknown blocks, XOR with the new block resulting in > the new parity block, write both the new parity block and the new block. > > You are emphasizing 2 - but the scenario is also overly simplistic. > Imagine you had 10 drives on RAID 5. Would it make more sense to read 8 > blocks and then write two (option 2, and the one you describe), or read > two blocks and then write two (option 1). Obviously, if option 1 or > option 2 can be satisfied from cache, it is better to not read at all. Good point that I wasn't aware of. > I note that you also disagree with Dave, in that you are not claiming it > performs consistency checks on read. No system does this as performance > would go to the crapper. I call straw man :) I don't disagree. I simply don't know. There's no reason why it _couldn't_ do consistency checking as it ran ... of course, performance would suck. Generally what you expect out of RAID 5|6 is that it can rebuild a drive in the event of a failure, so I doubt if anyone does consistency checking by default, and I wouldn't be surprised if a lot of systems don't have the option to do it at all. -- Bill Moran Collaborative Fusion Inc. http://people.collaborativefusion.com/~wmoran/ wmoran@collaborativefusion.com Phone: 412-422-3463x4023
On Wed, 26 Dec 2007, Mark Mielke wrote: > david@lang.hm wrote: >>> Thanks for the explanation David. It's good to know not only what but also >>> why. Still I wonder why reads do hit all drives. Shouldn't only 2 disks be >>> read: the one with the data and the parity disk? >> no, becouse the parity is of the sort (A+B+C+P) mod X = 0 >> so if X=10 (which means in practice that only the last decimal digit of >> anything matters, very convienient for examples) >> A=1, B=2, C=3, A+B+C=6, P=4, A+B+C+P=10=0 >> if you read B and get 3 and P and get 4 you don't know if this is right or >> not unless you also read A and C (at which point you would get >> A+B+C+P=11=1=error) > I don't think this is correct. RAID 5 is parity which is XOR. The property of > XOR is such that it doesn't matter what the other drives are. You can write > any block given either: 1) The block you are overwriting and the parity, or > 2) all other blocks except for the block we are writing and the parity. Now, > it might be possible that option 2) is taken more than option 1) for some > complicated reasons, but it is NOT to check consistency. The array is assumed > consistent until proven otherwise. I was being sloppy in explaining the reason, you are correct that for writes you don't need to read all the data, you just need the current parity block, the old data you are going to replace, and the new data to be able to calculate the new parity block (and note that even with my checksum example this would be the case). however I was addressing the point that for reads you can't do any checking until you have read in all the blocks. if you never check the consistency, how will it ever be proven otherwise. >> in theory a system could get the same performance with a large sequential >> read/write on raid5/6 as on a raid0 array of equivilent size (i.e. same >> number of data disks, ignoring the parity disks) becouse the OS could read >> the entire stripe in at once, do the calculation once, and use all the data >> (or when writing, don't write anything until you are ready to write the >> entire stripe, calculate the parity and write everything once). > For the same number of drives, this cannot be possible. With 10 disks, on > raid5, 9 disks hold data, and 1 holds parity. The theoretical maximum > performance is only 9/10 of the 10/10 performance possible with RAID 0. I was saying that a 10 drive raid0 could be the same performance as a 10+1 drive raid 5 or a 10+2 drive raid 6 array. this is why I said 'same number of data disks, ignoring the parity disks'. in practice you would probably not do quite this good anyway (you have the parity calculation to make and the extra drive or two's worth of data passing over your busses), but it could be a lot closer then any implementation currently is. >> Unfortunantly in practice filesystems don't support this, they don't do >> enough readahead to want to keep the entire stripe (so after they read it >> all in they throw some of it away), they (mostly) don't know where a stripe >> starts (and so intermingle different types of data on one stripe and spread >> data across multiple stripes unessasarily), and they tend to do writes in >> small, scattered chunks (rather then flushing an entire stripes worth of >> data at once) > In my experience, this theoretical maximum is not attainable without > significant write cache, and an intelligent controller, neither of which > Linux software RAID seems to have by default. My situation was a bit worse in > that I used applications that fsync() or journalled metadata that is ordered, > which forces the Linux software RAID to flush far more than it should - but > the same system works very well with RAID 1+0. my statements above apply to any type of raid implementation, hardware or software. the thing that saves the hardware implementation is that the data is written to a battery-backed cache and the controller lies to the system, telling it that the write is complete, and then it does the write later. on a journaling filesystem you could get very similar results if you put the journal on a solid-state drive. but for your application, the fact that you are doing lots of fsyncs is what's killing you, becouse the fsync forces a lot of data to be written out, swamping the caches involved, and requiring that you wait for seeks. nothing other then a battery backed disk cache of some sort (either on the controller or a solid-state drive on a journaled filesystem would work) David Lang
On Wed, 26 Dec 2007, Mark Mielke wrote: > david@lang.hm wrote: >> On Wed, 26 Dec 2007, Mark Mielke wrote: >> >>> Florian Weimer wrote: >>>>> seek/read/calculate/seek/write since the drive moves on after the >>>>> read), when you read you must read _all_ drives in the set to check >>>>> the data integrity. >>>> I don't know of any RAID implementation that performs consistency >>>> checking on each read operation. 8-( >>> Dave had too much egg nog... :-) >>> Yep - checking consistency on read would eliminate the performance >>> benefits of RAID under any redundant configuration. >> except for raid0, raid is primarily a reliability benifit, any performance >> benifit is incidental, not the primary purpose. >> that said, I have heard of raid1 setups where it only reads off of one of >> the drives, but I have not heard of higher raid levels doing so. > What do you mean "heard of"? Which raid system do you know of that reads all > drives for RAID 1? > > Linux dmraid reads off ONLY the first. Linux mdadm reads off the "best" one. > Neither read from both. Why should it need to read from both? What will it do > if the consistency check fails? It's not like it can tell which disk is the > right one. It only knows that the whole array is inconsistent. Until it gets > an actual hardware failure (read error, write error), it doesn't know which > disk is wrong. yes, the two linux software implementations only read from one disk, but I have seen hardware implementations where it reads from both drives, and if they disagree it returns a read error rather then possibly invalid data (it's up to the admin to figure out which drive is bad at that point). no, I don't remember which card this was. I've been playing around with things in this space for quite a while. David Lang
According to this:What do you mean "heard of"? Which raid system do you know of that reads all drives for RAID 1?I'm fairly sure that FreeBSD's GEOM does. Of course, it couldn't be doing consistency checking at that point.
http://www.freebsd.org/cgi/man.cgi?query=gmirror&apropos=0&sektion=8&manpath=FreeBSD+6-current&format=html
There is a -b (balance) option that seems pretty clear that it does not read from all drives if it does not have to:
Create a mirror. The order of components is important, because a component's priority is based on its position (starting from 0). The component with the biggest priority is used by the prefer balance algorithm and is also used as a master component when resynchronization is needed, e.g. after a power failure when the device was open for writing.
Additional options include: -b balance Specifies balance algorithm to use, one of: load Read from the component with the lowest load. prefer Read from the component with the biggest priority. round-robin Use round-robin algorithm when choosing component to read. split Split read requests, which are big- ger than or equal to slice size on N pieces, where N is the number of active components. This is the default balance algorithm.
Cheers,
mark
-- Mark Mielke <mark@mielke.cc>
On Wed, 26 Dec 2007, Mark Mielke wrote: > david@lang.hm wrote: >> I could see a raid 1 array not doing consistancy checking (after all, it >> has no way of knowing what's right if it finds an error), but since raid >> 5/6 can repair the data I would expect them to do the checking each time. > Your messages are spread across the thread. :-) > > RAID 5 cannot repair the data. I don't know much about RAID 6, but I expect > it cannot necessarily repair the data either. It still doesn't know which > drive is wrong. In any case, there is no implementation I am aware of that > performs mandatory consistency checks on read. This would be silliness. sorry, raid 5 can repair data if it knows which chunk is bad (the same way it can rebuild a drive). Raid 6 does something slightly different for it's parity, I know it can recover from two drives going bad, but I haven't looked into the question of it detecting bad data. David Lang
david@lang.hm wrote: > however I was addressing the point that for reads you can't do any > checking until you have read in all the blocks. > if you never check the consistency, how will it ever be proven otherwise. A scheme often used is to mark the disk/slice as "clean" during clean system shutdown (or RAID device shutdown). When it comes back up, it is assumed clean. Why wouldn't it be clean? However, if it comes up "unclean", this does indeed require an EXPENSIVE resynchronization process. Note, however, that resynchronization usually reads or writes all disks, whether RAID 1, RAID 5, RAID 6, or RAID 1+0. My RAID 1+0 does a full resynchronization if shut down uncleanly. There is nothing specific about RAID 5 here. Now, technically - none of these RAID levels requires a full resynchronization, even though it is almost always recommended and performed by default. There is an option in Linux software RAID (mdadm) to "skip" the resynchronization process. The danger here is that you could read one of the blocks this minute and get one block, and read the same block a different minute, and get a different block. This would occur in RAID 1 if it did round-robin or disk with the nearest head to the desired block, or whatever, and it made a different decision before and after the minute. What is the worst that can happen though? Any system that does careful journalling / synchronization should usually be fine. The "risk" is similar to write caching without battery backing, in that if the drive tells the system "write complete", and the system goes on to perform other work, but the write is not complete, then corruption becomes a possibility. Anyways - point is again that RAID 5 is not special here. > but for your application, the fact that you are doing lots of fsyncs > is what's killing you, becouse the fsync forces a lot of data to be > written out, swamping the caches involved, and requiring that you wait > for seeks. nothing other then a battery backed disk cache of some sort > (either on the controller or a solid-state drive on a journaled > filesystem would work) Yep. :-) Cheers, mark -- Mark Mielke <mark@mielke.cc>
On Wed, 26 Dec 2007, david@lang.hm wrote: > yes, the two linux software implementations only read from one disk, but I > have seen hardware implementations where it reads from both drives, and if > they disagree it returns a read error rather then possibly invalid data (it's > up to the admin to figure out which drive is bad at that point). Right, many of the old implementations did that; even the Wikipedia article on this subject mentions it in the "RAID 1 performance" section: http://en.wikipedia.org/wiki/Standard_RAID_levels The thing that changed is on modern drives, the internal error detection and correction is good enough that if you lose a sector, the drive will normally figure that out at the firmware level and return a read error rather than bad data. That lowers of the odds of one drive becoming corrupted and returning a bad sector as a result enough that the overhead of reading from both drives isn't considered as important. I'm not aware of a current card that does that but I wouldn't be surprised to discover one existed. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Fernando Hevia wrote: I'll start a little ways back first - > Well, here rises another doubt. Should I go for a single RAID 1+0 storing OS > + Data + WAL files or will I be better off with two RAID 1 separating data > from OS + Wal files? earlier you wrote - > Database will be about 30 GB in size initially and growing 10 GB per year. > Data is inserted overnight in two big tables and during the day mostly > read-only queries are run. Parallelism is rare. Now if the data is added overnight while no-one is using the server then reading is where you want performance, provided any degradation in writing doesn't slow down the overnight data loading enough to make it too long to finish while no-one else is using it. So in theory the only time you will have an advantage of having WAL on a separate disk from data is at night when the data is loading itself (I am assuming this is an automated step) But *some*? gains can be made from having the OS separate from the data. (This is for a theoretical discussion challenging the info/rumors that abounds about RAID setups) not to start a bitch fight or flame war. So for the guys who know the intricacies of RAID implementation - I don't have any real world performance measures here. For a setup that is only reading from disk (Santa sprinkles the data down the air vent while we are all snug in our bed) It has been mentioned that raid drivers/controllers can balance the workload across the different disks - as Mark mentioned from the FreeBSD 6 man pages - the balance option can be set to load|prefer|round-robin|split So in theory a modern RAID 1 setup can be configured to get similar read speeds as RAID 0 but would still drop to single disk speeds (or similar) when writing, but RAID 0 can get the faster write performance. So in a perfect setup (probably 1+0) 4x 300MB/s SATA drives could deliver 1200MB/s of data to RAM, which is also assuming that all 4 channels have their own data path to RAM and aren't sharing. (anyone know how segregated the on board controllers such as these are?) (do some pci controllers offer better throughput?) We all know that doesn't happen in the real world ;-) Let's say we are restricted to 80% - 1000MB/s - and some of that (10%) gets used by the system - so we end up with 900MB/s delivered off disk to postgres - that would still be more than the perfect rate at which 2x 300MB/s drives can deliver. So in this situation - if configured correctly with a good controller (driver for software RAID etc) a single 4 disk RAID 1+0 could outperform two 2 disk RAID 1 setups with data/OS+WAL split between the two. Is the real world speeds so different that this theory is real fantasy or has hardware reached a point performance wise where this is close to fact?? -- Shane Ambler pgSQL (at) Sheeky (dot) Biz Get Sheeky @ http://Sheeky.Biz
On Thu, 27 Dec 2007, Shane Ambler wrote: > So in theory a modern RAID 1 setup can be configured to get similar read > speeds as RAID 0 but would still drop to single disk speeds (or similar) when > writing, but RAID 0 can get the faster write performance. The trick is, you need a perfect controller that scatters individual reads evenly across the two disks as sequential reads move along the disk to pull this off, bouncing between a RAID 1 pair to use all the bandwidth available. There are caches inside the disk, read-ahead strategies as well, and that all has to line up just right for a single client to get all the bandwidth. Real-world disks and controllers don't quite behave well enough for that to predictably deliver what you might expect from theory. With RAID 0, getting the full read speed of 2Xsingle drive is much more likely to actually happen than in RAID 1. > So in a perfect setup (probably 1+0) 4x 300MB/s SATA drives could > deliver 1200MB/s of data to RAM, which is also assuming that all 4 > channels have their own data path to RAM and aren't sharing. (anyone > know how segregated the on board controllers such as these are?) (do > some pci controllers offer better throughput?) OK, first off, beyond the occasional trivial burst you'll be hard pressed to ever sustain over 60MB/s out of any single SATA drive. So the theoretical max 4-channel speed is closer to 240MB/s. A regular PCI bus tops out at a theoretical 133MB/s, and you sure can saturate one with 4 disks and a good controller. This is why server configurations have controller cards that use PCI-X (1024MB/s) or lately PCI-e aka PCI/Express (250MB/s for each channel with up to 16 being common). If your SATA cards are on a motherboard, that's probably using some integrated controller via the Southbridge AKA the ICH. That's probably got 250MB/s or more and in current products can easily outrun most sets of disks you'll ever connect. Even on motherboards that support 8 SATA channels it will be difficult for anything else on the system to go higher than 250MB/s even if the drives could potentially do more, and once you're dealing with real-world workloads. If you have multiple SATA controllers each with their own set of disk, then you're back to having to worry about the bus limits. So, yes, there are bus throughput considerations here, but unless you're building a giant array or using some older bus technology you're unlikely to hit them with spinning SATA disks. > We all know that doesn't happen in the real world ;-) Let's say we are > restricted to 80% - 1000MB/s Yeah, as mentioned above it's actually closer to 20%. While your numbers are off by a bunch, the reality for database use means these computations don't matter much anyway. The seek related behavior drives a lot of this more than sequential throughput, and decisions like whether to split out the OS or WAL or whatever need to factor all that, rather than just the theoretical I/O. For example, one reason it's popular to split the WAL onto another disk is that under normal operation the disk never does a seek. So if there's a dedicated disk for that, the disk just writes but never moves much. Where if the WAL is shared, the disk has to jump between writing that data and whatever else is going on, and peak possible WAL throughput is waaaay slower because of those seeks. (Note that unless you have a bunch of disks, your WAL is unlikely to be a limiter anyway so you still may not want to make it separate). (This topic so badly needs a PostgreSQL specific FAQ) -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Shane Ambler wrote: > So in theory a modern RAID 1 setup can be configured to get similar > read speeds as RAID 0 but would still drop to single disk speeds (or > similar) when writing, but RAID 0 can get the faster write performance. Unfortunately, it's a bit more complicated than that. RAID 1 has a sequential read problem, as read-ahead is wasted, and you may as well read from one disk and ignore the others. RAID 1 does, however, allows for much greater concurrency. 4 processes on a 4 disk RAID 1 system can, theoretically, each do whatever they want, without impacting each other. Database loads involving a single active read user will see greater performance with RAID 0. Database loads involving multiple concurrent active read users will see greater performance with RAID 1. All of these assume writes are not being performed to any great significance. Even single writes cause all disks in a RAID 1 system to synchronize temporarily eliminating the read benefit. RAID 0 allows some degree of concurrent reads and writes occurring at the same time (assuming even distribution of the data across the devices). Of course, RAID 0 systems have an expected life that decreases as the number of disks in the system increase. So, this is where we get to RAID 1+0. Redundancy, good read performance, good write performance, relatively simple implementation. For a mere cost of double the number of disk storage, you can get around the problems of RAID 1 and the problems of RAID 0. :-) > So in a perfect setup (probably 1+0) 4x 300MB/s SATA drives could > deliver 1200MB/s of data to RAM, which is also assuming that all 4 > channels have their own data path to RAM and aren't sharing. > (anyone know how segregated the on board controllers such as these are?) > (do some pci controllers offer better throughput?) > We all know that doesn't happen in the real world ;-) Let's say we are > restricted to 80% - 1000MB/s - and some of that (10%) gets used by the > system - so we end up with 900MB/s delivered off disk to postgres - > that would still be more than the perfect rate at which 2x 300MB/s > drives can deliver. I expect you would have to have good hardware, and a well tuned system to see 80%+ theoretical for common work loads. But then, this isn't unique to RAID. Even in a single disk system, one has trouble achieving 80%+ theoretical. :-) I achieve something closer to +20% - +60% over the theoretical performance of a single disk with my four disk RAID 1+0 partitions. Lots of compromises in my system though that I won't get into. For me, I value the redundancy, allowing for a single disk to fail and giving me time to easily recover, but for the cost of two more disks, I am able to counter the performance cost of redundancy, and actually see a positive performance effect instead. > So in this situation - if configured correctly with a good controller > (driver for software RAID etc) a single 4 disk RAID 1+0 could > outperform two 2 disk RAID 1 setups with data/OS+WAL split between the > two. > Is the real world speeds so different that this theory is real fantasy > or has hardware reached a point performance wise where this is close > to fact?? I think it depends on the balance. If every second operation requires a WAL write, having separate might make sense. However, if the balance is less than even, one would end up with one of the 2 disk RAID 1 setups being more idle than the other. It's not an exact science when it comes to the various compromises being made. :-) If you can only put 4 disks in to the system (either cost, or because of the system size), I would suggest RAID 1+0 on all four as sensible compromise. If you can put more in - start to consider breaking it up. :-) Cheers, mark -- Mark Mielke <mark@mielke.cc>
Mark Mielke wrote: > Shane Ambler wrote: >> So in a perfect setup (probably 1+0) 4x 300MB/s SATA drives could >> deliver 1200MB/s of data to RAM, which is also assuming that all 4 >> channels have their own data path to RAM and aren't sharing. >> (anyone know how segregated the on board controllers such as these >> are?) >> (do some pci controllers offer better throughput?) >> We all know that doesn't happen in the real world ;-) Let's say we >> are restricted to 80% - 1000MB/s - and some of that (10%) gets used >> by the system - so we end up with 900MB/s delivered off disk to >> postgres - that would still be more than the perfect rate at which >> 2x 300MB/s drives can deliver. > > I achieve something closer to +20% - +60% over the theoretical > performance of a single disk with my four disk RAID 1+0 partitions. If a good 4 disk SATA RAID 1+0 can achieve 60% more throughput than a single SATA disk, what sort of percentage can be achieved from a good SCSI controller with 4 disks in RAID 1+0? Are we still hitting the bus limits at this point or can a SCSI RAID still outperform in raw data throughput? I would still think that SCSI would still provide the better reliability that it always has, but performance wise is it still in front of SATA? -- Shane Ambler pgSQL (at) Sheeky (dot) Biz Get Sheeky @ http://Sheeky.Biz
Greg Smith wrote: > On Thu, 27 Dec 2007, Shane Ambler wrote: > >> So in theory a modern RAID 1 setup can be configured to get similar >> read speeds as RAID 0 but would still drop to single disk speeds (or >> similar) when writing, but RAID 0 can get the faster write performance. > > The trick is, you need a perfect controller that scatters individual > reads evenly across the two disks as sequential reads move along the > disk to pull this off, bouncing between a RAID 1 pair to use all the > bandwidth available. There are caches inside the disk, read-ahead > strategies as well, and that all has to line up just right for a single > client to get all the bandwidth. Real-world disks and controllers don't > quite behave well enough for that to predictably deliver what you might > expect from theory. With RAID 0, getting the full read speed of > 2Xsingle drive is much more likely to actually happen than in RAID 1. Kind of makes the point for using 1+0 >> So in a perfect setup (probably 1+0) 4x 300MB/s SATA drives could >> deliver 1200MB/s of data to RAM, which is also assuming that all 4 >> channels have their own data path to RAM and aren't sharing. > OK, first off, beyond the occasional trivial burst you'll be hard > pressed to ever sustain over 60MB/s out of any single SATA drive. So > the theoretical max 4-channel speed is closer to 240MB/s. > > A regular PCI bus tops out at a theoretical 133MB/s, and you sure can > saturate one with 4 disks and a good controller. This is why server > configurations have controller cards that use PCI-X (1024MB/s) or lately > PCI-e aka PCI/Express (250MB/s for each channel with up to 16 being > common). If your SATA cards are on a motherboard, that's probably using So I guess as far as performance goes your motherboard will determine how far you can take it. (talking from a db only server view on things) A PCI system will have little benefit from more than 2 disks but would need 4 to get both reliability and performance. PCI-X can benefit from up to 17 disks PCI-e (with 16 channels) can benefit from 66 disks The trick there will be dividing your db over a large number of disk sets to balance the load among them (I don't see 66 disks being setup in one array), so this would be of limited use to anyone but the most dedicated DBA's. For most servers these days the number of disks are added to reach a performance level not a storage requirement. > While your numbers are off by a bunch, the reality for database use > means these computations don't matter much anyway. The seek related > behavior drives a lot of this more than sequential throughput, and > decisions like whether to split out the OS or WAL or whatever need to > factor all that, rather than just the theoretical I/O. > So this is where solid state disks come in - lack of seek times (basically) means they can saturate your bus limits. -- Shane Ambler pgSQL (at) Sheeky (dot) Biz Get Sheeky @ http://Sheeky.Biz
In response to Mark Mielke <mark@mark.mielke.cc>: > Bill Moran wrote: > > > >> What do you mean "heard of"? Which raid system do you know of that reads > >> all drives for RAID 1? > >> > > > > I'm fairly sure that FreeBSD's GEOM does. Of course, it couldn't be doing > > consistency checking at that point. > > > According to this: > > http://www.freebsd.org/cgi/man.cgi?query=gmirror&apropos=0&sektion=8&manpath=FreeBSD+6-current&format=html > > There is a -b (balance) option that seems pretty clear that it does not > read from all drives if it does not have to: From where did you draw that conclusion? Note that the "split" algorithm (which is the default) divides requests up among multiple drives. I'm unclear as to how you reached a conclusion opposite of what the man page says -- did you test and find it not to work? > > Create a mirror. > The order of components is important, > because a component's priority is based on its position > (starting from 0). The component with the biggest priority > is used by the prefer balance algorithm and is also used as a > master component when resynchronization is needed, e.g. after > a power failure when the device was open for writing. > > Additional options include: > > *-b* /balance/ Specifies balance algorithm to use, one of: > > *load* Read from the component with the > lowest load. > > *prefer* Read from the component with the > biggest priority. > > *round-robin* Use round-robin algorithm when > choosing component to read. > > *split* Split read requests, which are big- > ger than or equal to slice size on N > pieces, where N is the number of > active components. This is the > default balance algorithm. > > > > Cheers, > mark > > -- > Mark Mielke <mark@mielke.cc> > > > > > > > > -- Bill Moran Collaborative Fusion Inc. http://people.collaborativefusion.com/~wmoran/ wmoran@collaborativefusion.com Phone: 412-422-3463x4023 **************************************************************** IMPORTANT: This message contains confidential information and is intended only for the individual named. If the reader of this message is not an intended recipient (or the individual responsible for the delivery of this message to an intended recipient), please be advised that any re-use, dissemination, distribution or copying of this message is prohibited. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. ****************************************************************
Perhaps you and I are speaking slightly different languages? :-) When I say "does not read from all drives", I mean "it will happily read from any of the drives to satisfy the request, and allows some level of configuration as to which drive it will select. It does not need to read all of the drives to satisfy the request."In response to Mark Mielke <mark@mark.mielke.cc>:Bill Moran wrote:I'm fairly sure that FreeBSD's GEOM does. Of course, it couldn't be doing consistency checking at that point.According to this: http://www.freebsd.org/cgi/man.cgi?query=gmirror&apropos=0&sektion=8&manpath=FreeBSD+6-current&format=html There is a -b (balance) option that seems pretty clear that it does not read from all drives if it does not have to:>From where did you draw that conclusion? Note that the "split" algorithm (which is the default) divides requests up among multiple drives. I'm unclear as to how you reached a conclusion opposite of what the man page says -- did you test and find it not to work?
Cheers,
mark
-- Mark Mielke <mark@mielke.cc>
Shane Ambler wrote: >> I achieve something closer to +20% - +60% over the theoretical >> performance of a single disk with my four disk RAID 1+0 partitions. > > If a good 4 disk SATA RAID 1+0 can achieve 60% more throughput than a > single SATA disk, what sort of percentage can be achieved from a good > SCSI controller with 4 disks in RAID 1+0? > > Are we still hitting the bus limits at this point or can a SCSI RAID > still outperform in raw data throughput? > > I would still think that SCSI would still provide the better reliability > that it always has, but performance wise is it still in front of SATA? > I have a SuperMicro X5DP8-G2 motherboard with two hyperthreaded microprocessors on it. This motherboard has 5 PCI-X busses (not merely 5 sockets: in fact it has 6 sockets, but also a dual Ultra/320 SCSI controller chip, a dual gigabit ethernet chip. So I hook up my 4 10,000 rpm database hard drives on one SCSI controller and the two other 10,000 rpm hard drives on the other. Nothing else is on the SCSI controller or its PCI-X bus that goes to the main memory except the other SCSI controller. These PCI-X busses are 133 MHz, and the memory as 266 MHz but the FSB runs at 533MHz as the memory modules are run in parallel; i.e., there are 8 modules and they run two at a time. Nothing else is on the other SCSI controller. Of the two hard drives on the second controller, one has the WAL on it, but when my database is running something (it is up all the time, but frequently idle) nothing else uses that drive much. So in theory, I should be able to get about 320 megabytes/second through each SCSI controller, though I have never seen that. I do get over 60 megabytes/second for brief (several second) periods though. I do not run RAID. I think it is probably very difficult to generalize how things go without a good knowledge of how the motherboard is organized, the amounts and types of caching that take place (both sortware and hardware), the speeds of the various devices and their controllers, the bandwidths of the various communication paths, and so on. -- .~. Jean-David Beyer Registered Linux User 85642. /V\ PGP-Key: 9A2FC99A Registered Machine 241939. /( )\ Shrewsbury, New Jersey http://counter.li.org ^^-^^ 11:00:01 up 10 days, 11:30, 2 users, load average: 4.20, 4.20, 4.25
On Dec 26, 2007, at 10:21 AM, Bill Moran wrote: > I snipped the rest of your message because none of it matters. > Never use > RAID 5 on a database system. Ever. There is absolutely NO reason to > every put yourself through that much suffering. If you hate yourself > that much just commit suicide, it's less drastic. Once you hit 14 or more spindles, the difference between RAID10 and RAID5 (or preferably RAID6) is minimal. In your 4 disk scenario, I'd vote RAID10.
On Dec 26, 2007, at 4:28 PM, david@lang.hm wrote: > now, if you can afford solid-state drives which don't have noticable > seek times, things are completely different ;-) Who makes one with "infinite" lifetime? The only ones I know of are built using RAM and have disk drive backup with internal monitoring are *really* expensive. I've pondered building a raid enclosure using these new SATA flash drives, but that would be an expensive brick after a short period as one of my DB servers...
In response to Mark Mielke <mark@mark.mielke.cc>: > Bill Moran wrote: > > In response to Mark Mielke <mark@mark.mielke.cc>: > > > > > >> Bill Moran wrote: > >> > >>> I'm fairly sure that FreeBSD's GEOM does. Of course, it couldn't be doing > >>> consistency checking at that point. > >>> > >> According to this: > >> > >> http://www.freebsd.org/cgi/man.cgi?query=gmirror&apropos=0&sektion=8&manpath=FreeBSD+6-current&format=html > >> > >> There is a -b (balance) option that seems pretty clear that it does not > >> read from all drives if it does not have to: > >> > > > > >From where did you draw that conclusion? Note that the "split" algorithm > > (which is the default) divides requests up among multiple drives. I'm > > unclear as to how you reached a conclusion opposite of what the man page > > says -- did you test and find it not to work? > > > Perhaps you and I are speaking slightly different languages? :-) When I > say "does not read from all drives", I mean "it will happily read from > any of the drives to satisfy the request, and allows some level of > configuration as to which drive it will select. It does not need to read > all of the drives to satisfy the request." Ahh ... I did misunderstand you. -- Bill Moran Collaborative Fusion Inc. http://people.collaborativefusion.com/~wmoran/ wmoran@collaborativefusion.com Phone: 412-422-3463x4023