Re: choosing RAID level for xlogs - Mailing list pgsql-performance
From | John A Meinel |
---|---|
Subject | Re: choosing RAID level for xlogs |
Date | |
Msg-id | 430238A5.5000002@arbash-meinel.com Whole thread Raw |
In response to | Re: choosing RAID level for xlogs ("Anjan Dave" <adave@vantage.com>) |
List | pgsql-performance |
Anjan Dave wrote: > Yes, that's true, though, I am a bit confused because the Clariion array > document I am reading talks about how the write cache can eliminate the > RAID5 Write Penalty for sequential and large IOs...resulting in better > sequential write performance than RAID10. > > anjan > Well, if your stripe size is 128k, and you have N disks in the RAID (N must be even and > 4 for RAID10). With RAID5 you have a stripe across N-1 disks, and 1 parity entry. With RAID10 you have a stripe across N/2 disks, replicated on the second set. So if the average write size is >128k*N/2, then you will generally be using all of the disks during a write, and you can expect a the maximum scale up of about N/2 for RAID10. If your average write size is >128k*(N-1) then you can again write an entire stripe at a time and even the parity since you already know all of the information you don't have to do any reading. So you can get a maximum speed up of N-1. If you are doing infrequent smallish writes, it can be buffered by the write cache, and isn't disk limited at all. And the controller can write it out when it feels like it. So it should be able to do more buffered all-at-once writes. If you are writing a little bit more often (such that the cache fills up), depending on your write pattern, it is possible that all of the stripes are already in the cache, so again there is little penalty for the parity stripe. I suppose the worst case is if you were writing lots of very small chunks, all over the disk in random order. In which case each write encounters a 2x read penalty for a smart controller, or a Nx read penalty if you are going for more safety than speed. (You can read the original value, and the parity, and re-compute the parity with the new value (2x read penalty), but if there is corruption it would not be detected, so you might want to read all of the stripes in the block, and recompute the parity with the new data (Nx read penalty)). I think the issue for Postgres is that it writes 8k pages, which is quite small relative to the stripe size. So you don't tend to build up big buffers to write out the entire stripe at once. So if you aren't filling up your write buffer, RAID5 can do quite well with bulk loads. I also don't know about the penalties for a read followed immediately by a write. Since you will be writing to the same location, you know that you have to wait for the disk to spin back to the same location. At 10k rpm that is a 6ms wait time. For 7200rpm disks, it is 8.3ms. Just to say that there are some specific extra penalties when you are reading the location that you are going to write right away. Now a really smart controller with lots of data to write could read the whole circle on the disk, and then start writing out the entire circle, and not have any spin delay. But you would have to know the size of the circle, and that depends on what block you are on, and the heads arrangement and everything else. Though since hard-drives also have small caches in them, you could hide some of the spin delay, but not a lot, since you have to leave the head there until you are done writing, so while the current command would finish quickly, the next command couldn't start until the first actually finished. Writing large buffers hides all of these seek/spin based latencies, so you can get really good throughput. But a lot of DB action is small buffers randomly distributed, so you really do need low seek time, of which RAID10 is probably better than RAID5. John =:->
Attachment
pgsql-performance by date: