Thread: choosing RAID level for xlogs
Hi,
One simple question. For 125 or more checkpoint segments (checkpoint_timeout is 600 seconds, shared_buffers are at 21760 or 170MB) on a very busy database, what is more suitable, a separate 6 disk RAID5 volume, or a RAID10 volume? Databases will be on separate spindles. Disks are 36GB 15KRPM, 2Gb Fiber Channel. Performance is paramount, but I don’t want to use RAID0.
PG7.4.7 on RHAS 4.0
I can provide more info if needed.
Appreciate some recommendations!
Thanks,
Anjan
---
This email message and any included attachments constitute confidential and privileged information intended exclusively for the listed addressee(s). If you are not the intended recipient, please notify Vantage by immediately telephoning 215-579-8390, extension 1158. In addition, please reply to this message confirming your receipt of the same in error. A copy of your email reply can also be sent to support@vantage.com. Please do not disclose, copy, distribute or take any action in reliance on the contents of this information. Kindly destroy all copies of this message and any attachments. Any other use of this email is prohibited. Thank you for your cooperation. For more information about Vantage, please visit our website at http://www.vantage.com.
---
Quoting Anjan Dave <adave@vantage.com>: > Hi, > > > > One simple question. For 125 or more checkpoint segments > (checkpoint_timeout is 600 seconds, shared_buffers are at 21760 or > 170MB) on a very busy database, what is more suitable, a separate 6 disk > RAID5 volume, or a RAID10 volume? Databases will be on separate > spindles. Disks are 36GB 15KRPM, 2Gb Fiber Channel. Performance is > paramount, but I don't want to use RAID0. > RAID10 -- no question. xlog activity is overwhelmingly sequential 8KB writes. In order for RAID5 to perform a write, the host (or controller) needs to perform extra calculations for parity. This turns into latency. RAID10 does not perform those extra calculations. > > > PG7.4.7 on RHAS 4.0 > > > > I can provide more info if needed. > > > > Appreciate some recommendations! > > > > Thanks, > > Anjan > > > > > --- > This email message and any included attachments constitute confidential > and privileged information intended exclusively for the listed > addressee(s). If you are not the intended recipient, please notify > Vantage by immediately telephoning 215-579-8390, extension 1158. In > addition, please reply to this message confirming your receipt of the > same in error. A copy of your email reply can also be sent to > support@vantage.com. Please do not disclose, copy, distribute or take > any action in reliance on the contents of this information. Kindly > destroy all copies of this message and any attachments. Any other use of > this email is prohibited. Thank you for your cooperation. For more > information about Vantage, please visit our website at > http://www.vantage.com <http://www.vantage.com/> . > --- > > > >
Yes, that's true, though, I am a bit confused because the Clariion array document I am reading talks about how the write cache can eliminate the RAID5 Write Penalty for sequential and large IOs...resulting in better sequential write performance than RAID10. anjan -----Original Message----- From: mudfoot@rawbw.com [mailto:mudfoot@rawbw.com] Sent: Tuesday, August 16, 2005 2:00 PM To: pgsql-performance@postgresql.org Subject: Re: [PERFORM] choosing RAID level for xlogs Quoting Anjan Dave <adave@vantage.com>: > Hi, > > > > One simple question. For 125 or more checkpoint segments > (checkpoint_timeout is 600 seconds, shared_buffers are at 21760 or > 170MB) on a very busy database, what is more suitable, a separate 6 disk > RAID5 volume, or a RAID10 volume? Databases will be on separate > spindles. Disks are 36GB 15KRPM, 2Gb Fiber Channel. Performance is > paramount, but I don't want to use RAID0. > RAID10 -- no question. xlog activity is overwhelmingly sequential 8KB writes. In order for RAID5 to perform a write, the host (or controller) needs to perform extra calculations for parity. This turns into latency. RAID10 does not perform those extra calculations. > > > PG7.4.7 on RHAS 4.0 > > > > I can provide more info if needed. > > > > Appreciate some recommendations! > > > > Thanks, > > Anjan > > > > > --- > This email message and any included attachments constitute confidential > and privileged information intended exclusively for the listed > addressee(s). If you are not the intended recipient, please notify > Vantage by immediately telephoning 215-579-8390, extension 1158. In > addition, please reply to this message confirming your receipt of the > same in error. A copy of your email reply can also be sent to > support@vantage.com. Please do not disclose, copy, distribute or take > any action in reliance on the contents of this information. Kindly > destroy all copies of this message and any attachments. Any other use of > this email is prohibited. Thank you for your cooperation. For more > information about Vantage, please visit our website at > http://www.vantage.com <http://www.vantage.com/> . > --- > > > > ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly
Anjan Dave wrote: > Yes, that's true, though, I am a bit confused because the Clariion array > document I am reading talks about how the write cache can eliminate the > RAID5 Write Penalty for sequential and large IOs...resulting in better > sequential write performance than RAID10. > > anjan > Well, if your stripe size is 128k, and you have N disks in the RAID (N must be even and > 4 for RAID10). With RAID5 you have a stripe across N-1 disks, and 1 parity entry. With RAID10 you have a stripe across N/2 disks, replicated on the second set. So if the average write size is >128k*N/2, then you will generally be using all of the disks during a write, and you can expect a the maximum scale up of about N/2 for RAID10. If your average write size is >128k*(N-1) then you can again write an entire stripe at a time and even the parity since you already know all of the information you don't have to do any reading. So you can get a maximum speed up of N-1. If you are doing infrequent smallish writes, it can be buffered by the write cache, and isn't disk limited at all. And the controller can write it out when it feels like it. So it should be able to do more buffered all-at-once writes. If you are writing a little bit more often (such that the cache fills up), depending on your write pattern, it is possible that all of the stripes are already in the cache, so again there is little penalty for the parity stripe. I suppose the worst case is if you were writing lots of very small chunks, all over the disk in random order. In which case each write encounters a 2x read penalty for a smart controller, or a Nx read penalty if you are going for more safety than speed. (You can read the original value, and the parity, and re-compute the parity with the new value (2x read penalty), but if there is corruption it would not be detected, so you might want to read all of the stripes in the block, and recompute the parity with the new data (Nx read penalty)). I think the issue for Postgres is that it writes 8k pages, which is quite small relative to the stripe size. So you don't tend to build up big buffers to write out the entire stripe at once. So if you aren't filling up your write buffer, RAID5 can do quite well with bulk loads. I also don't know about the penalties for a read followed immediately by a write. Since you will be writing to the same location, you know that you have to wait for the disk to spin back to the same location. At 10k rpm that is a 6ms wait time. For 7200rpm disks, it is 8.3ms. Just to say that there are some specific extra penalties when you are reading the location that you are going to write right away. Now a really smart controller with lots of data to write could read the whole circle on the disk, and then start writing out the entire circle, and not have any spin delay. But you would have to know the size of the circle, and that depends on what block you are on, and the heads arrangement and everything else. Though since hard-drives also have small caches in them, you could hide some of the spin delay, but not a lot, since you have to leave the head there until you are done writing, so while the current command would finish quickly, the next command couldn't start until the first actually finished. Writing large buffers hides all of these seek/spin based latencies, so you can get really good throughput. But a lot of DB action is small buffers randomly distributed, so you really do need low seek time, of which RAID10 is probably better than RAID5. John =:->
Attachment
On Aug 16, 2005, at 2:37 PM, Anjan Dave wrote: > Yes, that's true, though, I am a bit confused because the Clariion > array > document I am reading talks about how the write cache can eliminate > the > RAID5 Write Penalty for sequential and large IOs...resulting in better > sequential write performance than RAID10. > well, then run your own tests and find out :-) if I were using LSI MegaRAID controllers, I'd probalby go RAID10, but I don't see why you need 6 disks for this... perhaps just 4 would be enough? Or are your logs really that big? Vivek Khera, Ph.D. +1-301-869-4449 x806
Anjan Dave wrote: > Yes, that's true, though, I am a bit confused because the Clariion array > document I am reading talks about how the write cache can eliminate the > RAID5 Write Penalty for sequential and large IOs...resulting in better > sequential write performance than RAID10. > > anjan > To give a shorter statement after my long one... If you have enough cache that the controller can write out big chunks to the disk at a time, you can get very good sequential RAID5 performance, because the stripe size is large (so it can do a parallel write to all disks). But for small chunk writes, you suffer the penalty of the read before write, and possible multi-disk read (depends on what is in cache). RAID10 generally handles small writes better, and I would guess that 4disks would perform almost identically to 6disks, since you aren't usually writing enough data to span multiple stripes. If your battery-backed cache is big enough that you don't fill it, they probably perform about the same (superfast) since the cache hides the latency of the disks. If you start filling up your cache, RAID5 probably can do better because of the parallelization. But small writes followed by an fsync do favor RAID10 over RAID5. John =:->
Attachment
Theoretically RAID 5 can perform better than RAID 10 over the same number of drives (more members form the stripe in RAID 5 than in RAID 10). All you have to do is calculate parity faster than the drives can write. Doesn't seem like a hard task really, although most RAID controllers seem incapable of doing so, it is possible that Clariion might be able to acheive it. The other factor is that for partial block writes, the array has to first read the original block in order to recalculate the parity, so small random writes are very slow. If you are writing chunks that are larger than your stripe size*(n-1), then in theory the controller doesn't have to re-read a block, and can just overwrite the parity with the new info. Consider just four drives. in RAID 10, it is a stripe of two mirrors, forming two independant units to write to. in RAID 5, it is a 3 drive stripe with parity giving three independant units to write to. Theoretically the RAID 5 should be faster, but I've yet to benchmark a controler where this holds to be true. Of course if you ever do have a drive failure, your array grinds to a halt because rebuilding a raid 5 requires reading (n-1) blocks to rebuild just one block where n is the number of drives in the array, whereas a mirror only required to read from a single spindle of the RAID. I would suggest running some benchmarks at RAID 5 and RAID 10 to see what the _real_ performance actualy is, thats the only way to really tell. Alex Turner NetEconomist On 8/16/05, Anjan Dave <adave@vantage.com> wrote: > Yes, that's true, though, I am a bit confused because the Clariion array > document I am reading talks about how the write cache can eliminate the > RAID5 Write Penalty for sequential and large IOs...resulting in better > sequential write performance than RAID10. > > anjan > > > -----Original Message----- > From: mudfoot@rawbw.com [mailto:mudfoot@rawbw.com] > Sent: Tuesday, August 16, 2005 2:00 PM > To: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] choosing RAID level for xlogs > > Quoting Anjan Dave <adave@vantage.com>: > > > Hi, > > > > > > > > One simple question. For 125 or more checkpoint segments > > (checkpoint_timeout is 600 seconds, shared_buffers are at 21760 or > > 170MB) on a very busy database, what is more suitable, a separate 6 > disk > > RAID5 volume, or a RAID10 volume? Databases will be on separate > > spindles. Disks are 36GB 15KRPM, 2Gb Fiber Channel. Performance is > > paramount, but I don't want to use RAID0. > > > > RAID10 -- no question. xlog activity is overwhelmingly sequential 8KB > writes. > In order for RAID5 to perform a write, the host (or controller) needs to > perform > extra calculations for parity. This turns into latency. RAID10 does > not > perform those extra calculations. > > > > > > > PG7.4.7 on RHAS 4.0 > > > > > > > > I can provide more info if needed. > > > > > > > > Appreciate some recommendations! > > > > > > > > Thanks, > > > > Anjan > > > > > > > > > > --- > > This email message and any included attachments constitute > confidential > > and privileged information intended exclusively for the listed > > addressee(s). If you are not the intended recipient, please notify > > Vantage by immediately telephoning 215-579-8390, extension 1158. In > > addition, please reply to this message confirming your receipt of the > > same in error. A copy of your email reply can also be sent to > > support@vantage.com. Please do not disclose, copy, distribute or take > > any action in reliance on the contents of this information. Kindly > > destroy all copies of this message and any attachments. Any other use > of > > this email is prohibited. Thank you for your cooperation. For more > > information about Vantage, please visit our website at > > http://www.vantage.com <http://www.vantage.com/> . > > --- > > > > > > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org >
Don't forget that often controlers don't obey fsyncs like a plain drive does. thats the point of having a BBU ;) Alex Turner NetEconomist On 8/16/05, John A Meinel <john@arbash-meinel.com> wrote: > Anjan Dave wrote: > > Yes, that's true, though, I am a bit confused because the Clariion array > > document I am reading talks about how the write cache can eliminate the > > RAID5 Write Penalty for sequential and large IOs...resulting in better > > sequential write performance than RAID10. > > > > anjan > > > > To give a shorter statement after my long one... > If you have enough cache that the controller can write out big chunks to > the disk at a time, you can get very good sequential RAID5 performance, > because the stripe size is large (so it can do a parallel write to all > disks). > > But for small chunk writes, you suffer the penalty of the read before > write, and possible multi-disk read (depends on what is in cache). > > RAID10 generally handles small writes better, and I would guess that > 4disks would perform almost identically to 6disks, since you aren't > usually writing enough data to span multiple stripes. > > If your battery-backed cache is big enough that you don't fill it, they > probably perform about the same (superfast) since the cache hides the > latency of the disks. > > If you start filling up your cache, RAID5 probably can do better because > of the parallelization. > > But small writes followed by an fsync do favor RAID10 over RAID5. > > John > =:-> > > >
I would be very cautious about ever using RAID5, despite manufacturers' claims to the contrary. The link below is authoredby a very knowledgable fellow whose posts I know (and trust) from Informix land. <http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt> Greg Williamson DBA GlobeXplorer LLC -----Original Message----- From: pgsql-performance-owner@postgresql.org on behalf of Anjan Dave Sent: Mon 8/15/2005 1:35 PM To: pgsql-performance@postgresql.org Cc: Subject: [PERFORM] choosing RAID level for xlogs Hi, One simple question. For 125 or more checkpoint segments (checkpoint_timeout is 600 seconds, shared_buffers are at 21760 or 170MB) on a very busy database, what is more suitable, a separate 6 disk RAID5 volume, or a RAID10 volume? Databases will be on separate spindles. Disks are 36GB 15KRPM, 2Gb Fiber Channel. Performance is paramount, but I don't want to use RAID0. PG7.4.7 on RHAS 4.0 I can provide more info if needed. Appreciate some recommendations! Thanks, Anjan --- This email message and any included attachments constitute confidential and privileged information intended exclusively for the listed addressee(s). If you are not the intended recipient, please notify Vantage by immediately telephoning 215-579-8390, extension 1158. In addition, please reply to this message confirming your receipt of the same in error. A copy of your email reply can also be sent to support@vantage.com. Please do not disclose, copy, distribute or take any action in reliance on the contents of this information. Kindly destroy all copies of this message and any attachments. Any other use of this email is prohibited. Thank you for your cooperation. For more information about Vantage, please visit our website at http://www.vantage.com <http://www.vantage.com/> . --- !DSPAM:4300fd35105094125621296!
Thanks, everyone. I got some excellent replies, including some long explanations. Appreciate the time you guys took out forthe responses. The gist of it i take, is to use RAID10. I have 400MB+ of write cache on the controller(s), that the RAID5 LUN(s) could benefitfrom by filling it up and writing out the complete stripe, but come to think of it, it's shared among the two StorageProcessors, all the LUNs, not just the ones holding the pg_xlog directory. The other thing (with Clariion) is thewrite cache mirroring. Write isn't signalled complete to the host until the cache content is mirrored across the otherSP (and vice-versa), which is a good thing, but this operation could potentially become a bottleneck with very highload on the SPs. Also, one would have to fully trust the controller/manufacturer's claim on signalling the write completion. And, performanceis a priority over the drive space lost in RAID10 for me. I can use 4 drives instead of 6. Thanks, Anjan t-----Original Message----- From: Gregory S. Williamson [mailto:gsw@globexplorer.com] Sent: Tue 8/16/2005 6:22 PM To: Anjan Dave; pgsql-performance@postgresql.org Cc: Subject: RE: [PERFORM] choosing RAID level for xlogs I would be very cautious about ever using RAID5, despite manufacturers' claims to the contrary. The link below is authoredby a very knowledgable fellow whose posts I know (and trust) from Informix land. <http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt> Greg Williamson DBA GlobeXplorer LLC -----Original Message----- From: pgsql-performance-owner@postgresql.org on behalf of Anjan Dave Sent: Mon 8/15/2005 1:35 PM To: pgsql-performance@postgresql.org Cc: Subject: [PERFORM] choosing RAID level for xlogs Hi, One simple question. For 125 or more checkpoint segments (checkpoint_timeout is 600 seconds, shared_buffers are at 21760 or 170MB) on a very busy database, what is more suitable, a separate 6 disk RAID5 volume, or a RAID10 volume? Databases will be on separate spindles. Disks are 36GB 15KRPM, 2Gb Fiber Channel. Performance is paramount, but I don't want to use RAID0. PG7.4.7 on RHAS 4.0 I can provide more info if needed. Appreciate some recommendations! Thanks, Anjan --- This email message and any included attachments constitute confidential and privileged information intended exclusively for the listed addressee(s). If you are not the intended recipient, please notify Vantage by immediately telephoning 215-579-8390, extension 1158. In addition, please reply to this message confirming your receipt of the same in error. A copy of your email reply can also be sent to support@vantage.com. Please do not disclose, copy, distribute or take any action in reliance on the contents of this information. Kindly destroy all copies of this message and any attachments. Any other use of this email is prohibited. Thank you for your cooperation. For more information about Vantage, please visit our website at http://www.vantage.com <http://www.vantage.com/> . --- !DSPAM:4300fd35105094125621296!
The other point that is well made is that with enough drives you will max out the PCI bus before you max out the drives. 64-bit 66Mhz can do about 400MB/sec, which can be acheived by two 3 drive stripes (6 drive in RAID 10). A true PCI-X card can do better, but can your controller? Remember, U320 is only 320MB/channel... Alex Turner NetEconomist On 8/16/05, Anjan Dave <adave@vantage.com> wrote: > Thanks, everyone. I got some excellent replies, including some long explanations. Appreciate the time you guys took outfor the responses. > > The gist of it i take, is to use RAID10. I have 400MB+ of write cache on the controller(s), that the RAID5 LUN(s) couldbenefit from by filling it up and writing out the complete stripe, but come to think of it, it's shared among the twoStorage Processors, all the LUNs, not just the ones holding the pg_xlog directory. The other thing (with Clariion) isthe write cache mirroring. Write isn't signalled complete to the host until the cache content is mirrored across the otherSP (and vice-versa), which is a good thing, but this operation could potentially become a bottleneck with very highload on the SPs. > > Also, one would have to fully trust the controller/manufacturer's claim on signalling the write completion. And, performanceis a priority over the drive space lost in RAID10 for me. > > I can use 4 drives instead of 6. > > Thanks, > Anjan > > t-----Original Message----- > From: Gregory S. Williamson [mailto:gsw@globexplorer.com] > Sent: Tue 8/16/2005 6:22 PM > To: Anjan Dave; pgsql-performance@postgresql.org > Cc: > Subject: RE: [PERFORM] choosing RAID level for xlogs > > > > I would be very cautious about ever using RAID5, despite manufacturers' claims to the contrary. The link belowis authored by a very knowledgable fellow whose posts I know (and trust) from Informix land. > > <http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt> > > Greg Williamson > DBA > GlobeXplorer LLC > > > -----Original Message----- > From: pgsql-performance-owner@postgresql.org on behalf of Anjan Dave > Sent: Mon 8/15/2005 1:35 PM > To: pgsql-performance@postgresql.org > Cc: > Subject: [PERFORM] choosing RAID level for xlogs > Hi, > > > > One simple question. For 125 or more checkpoint segments > (checkpoint_timeout is 600 seconds, shared_buffers are at 21760 or > 170MB) on a very busy database, what is more suitable, a separate 6 disk > RAID5 volume, or a RAID10 volume? Databases will be on separate > spindles. Disks are 36GB 15KRPM, 2Gb Fiber Channel. Performance is > paramount, but I don't want to use RAID0. > > > > PG7.4.7 on RHAS 4.0 > > > > I can provide more info if needed. > > > > Appreciate some recommendations! > > > > Thanks, > > Anjan > > > > > --- > This email message and any included attachments constitute confidential > and privileged information intended exclusively for the listed > addressee(s). If you are not the intended recipient, please notify > Vantage by immediately telephoning 215-579-8390, extension 1158. In > addition, please reply to this message confirming your receipt of the > same in error. A copy of your email reply can also be sent to > support@vantage.com. Please do not disclose, copy, distribute or take > any action in reliance on the contents of this information. Kindly > destroy all copies of this message and any attachments. Any other use of > this email is prohibited. Thank you for your cooperation. For more > information about Vantage, please visit our website at > http://www.vantage.com <http://www.vantage.com/> . > --- > > > > > > !DSPAM:4300fd35105094125621296! > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match >