Re: choosing RAID level for xlogs - Mailing list pgsql-performance

From John A Meinel
Subject Re: choosing RAID level for xlogs
Date
Msg-id 430238A5.5000002@arbash-meinel.com
Whole thread Raw
In response to Re: choosing RAID level for xlogs  ("Anjan Dave" <adave@vantage.com>)
List pgsql-performance
Anjan Dave wrote:
> Yes, that's true, though, I am a bit confused because the Clariion array
> document I am reading talks about how the write cache can eliminate the
> RAID5 Write Penalty for sequential and large IOs...resulting in better
> sequential write performance than RAID10.
>
> anjan
>

Well, if your stripe size is 128k, and you have N disks in the RAID (N
must be even and > 4 for RAID10).

With RAID5 you have a stripe across N-1 disks, and 1 parity entry.
With RAID10 you have a stripe across N/2 disks, replicated on the second
set.

So if the average write size is >128k*N/2, then you will generally be
using all of the disks during a write, and you can expect a the maximum
scale up of about N/2 for RAID10.

If your average write size is >128k*(N-1) then you can again write an
entire stripe at a time and even the parity since you already know all
of the information you don't have to do any reading. So you can get a
maximum speed up of N-1.

If you are doing infrequent smallish writes, it can be buffered by the
write cache, and isn't disk limited at all. And the controller can write
it out when it feels like it. So it should be able to do more buffered
all-at-once writes.

If you are writing a little bit more often (such that the cache fills
up), depending on your write pattern, it is possible that all of the
stripes are already in the cache, so again there is little penalty for
the parity stripe.

I suppose the worst case is if you were writing lots of very small
chunks, all over the disk in random order. In which case each write
encounters a 2x read penalty for a smart controller, or a Nx read
penalty if you are going for more safety than speed. (You can read the
original value, and the parity, and re-compute the parity with the new
value (2x  read penalty), but if there is corruption it would not be
detected, so you might want to read all of the stripes in the block, and
recompute the parity with the new data (Nx read penalty)).

I think the issue for Postgres is that it writes 8k pages, which is
quite small relative to the stripe size. So you don't tend to build up
big buffers to write out the entire stripe at once.

So if you aren't filling up your write buffer, RAID5 can do quite well
with bulk loads.
I also don't know about the penalties for a read followed immediately by
a write. Since you will be writing to the same location, you know that
you have to wait for the disk to spin back to the same location. At 10k
rpm that is a 6ms wait time. For 7200rpm disks, it is 8.3ms.

Just to say that there are some specific extra penalties when you are
reading the location that you are going to write right away. Now a
really smart controller with lots of data to write could read the whole
circle on the disk, and then start writing out the entire circle, and
not have any spin delay. But you would have to know the size of the
circle, and that depends on what block you are on, and the heads
arrangement and everything else.
Though since hard-drives also have small caches in them, you could hide
some of the spin delay, but not a lot, since you have to leave the head
there until you are done writing, so while the current command would
finish quickly, the next command couldn't start until the first actually
finished.

Writing large buffers hides all of these seek/spin based latencies, so
you can get really good throughput. But a lot of DB action is small
buffers randomly distributed, so you really do need low seek time, of
which RAID10 is probably better than RAID5.

John
=:->

Attachment

pgsql-performance by date:

Previous
From: Dennis Bjorklund
Date:
Subject: Re: Need for speed
Next
From: Vivek Khera
Date:
Subject: Re: choosing RAID level for xlogs