Thread: choosing RAID level for xlogs

From:
"Anjan Dave"
Date:

Hi,

 

One simple question. For 125 or more checkpoint segments (checkpoint_timeout is 600 seconds, shared_buffers are at 21760 or 170MB) on a very busy database, what is more suitable, a separate 6 disk RAID5 volume, or a RAID10 volume? Databases will be on separate spindles. Disks are 36GB 15KRPM, 2Gb Fiber Channel. Performance is paramount, but I don’t want to use RAID0.

 

PG7.4.7 on RHAS 4.0

 

I can provide more info if needed.

 

Appreciate some recommendations!

 

Thanks,

Anjan

 

 
---
This email message and any included attachments constitute confidential and privileged information intended exclusively for the listed addressee(s). If you are not the intended recipient, please notify Vantage by immediately telephoning 215-579-8390, extension 1158. In addition, please reply to this message confirming your receipt of the same in error. A copy of your email reply can also be sent to . Please do not disclose, copy, distribute or take any action in reliance on the contents of this information. Kindly destroy all copies of this message and any attachments. Any other use of this email is prohibited. Thank you for your cooperation. For more information about Vantage, please visit our website at http://www.vantage.com.
---

 

From:
mudfoot@rawbw.com
Date:

Quoting Anjan Dave <>:

> Hi,
>
>
>
> One simple question. For 125 or more checkpoint segments
> (checkpoint_timeout is 600 seconds, shared_buffers are at 21760 or
> 170MB) on a very busy database, what is more suitable, a separate 6 disk
> RAID5 volume, or a RAID10 volume? Databases will be on separate
> spindles. Disks are 36GB 15KRPM, 2Gb Fiber Channel. Performance is
> paramount, but I don't want to use RAID0.
>

RAID10 -- no question.  xlog activity is overwhelmingly sequential 8KB writes.
In order for RAID5 to perform a write, the host (or controller) needs to perform
extra calculations for parity.  This turns into latency.  RAID10 does not
perform those extra calculations.

>
>
> PG7.4.7 on RHAS 4.0
>
>
>
> I can provide more info if needed.
>
>
>
> Appreciate some recommendations!
>
>
>
> Thanks,
>
> Anjan
>
>
>
>
> ---
> This email message and any included attachments constitute confidential
> and privileged information intended exclusively for the listed
> addressee(s). If you are not the intended recipient, please notify
> Vantage by immediately telephoning 215-579-8390, extension 1158. In
> addition, please reply to this message confirming your receipt of the
> same in error. A copy of your email reply can also be sent to
> . Please do not disclose, copy, distribute or take
> any action in reliance on the contents of this information. Kindly
> destroy all copies of this message and any attachments. Any other use of
> this email is prohibited. Thank you for your cooperation. For more
> information about Vantage, please visit our website at
> http://www.vantage.com <http://www.vantage.com/> .
> ---
>
>
>
>



From:
"Anjan Dave"
Date:

Yes, that's true, though, I am a bit confused because the Clariion array
document I am reading talks about how the write cache can eliminate the
RAID5 Write Penalty for sequential and large IOs...resulting in better
sequential write performance than RAID10.

anjan


-----Original Message-----
From:  [mailto:]
Sent: Tuesday, August 16, 2005 2:00 PM
To: 
Subject: Re: [PERFORM] choosing RAID level for xlogs

Quoting Anjan Dave <>:

> Hi,
>
>
>
> One simple question. For 125 or more checkpoint segments
> (checkpoint_timeout is 600 seconds, shared_buffers are at 21760 or
> 170MB) on a very busy database, what is more suitable, a separate 6
disk
> RAID5 volume, or a RAID10 volume? Databases will be on separate
> spindles. Disks are 36GB 15KRPM, 2Gb Fiber Channel. Performance is
> paramount, but I don't want to use RAID0.
>

RAID10 -- no question.  xlog activity is overwhelmingly sequential 8KB
writes.
In order for RAID5 to perform a write, the host (or controller) needs to
perform
extra calculations for parity.  This turns into latency.  RAID10 does
not
perform those extra calculations.

>
>
> PG7.4.7 on RHAS 4.0
>
>
>
> I can provide more info if needed.
>
>
>
> Appreciate some recommendations!
>
>
>
> Thanks,
>
> Anjan
>
>
>
>
> ---
> This email message and any included attachments constitute
confidential
> and privileged information intended exclusively for the listed
> addressee(s). If you are not the intended recipient, please notify
> Vantage by immediately telephoning 215-579-8390, extension 1158. In
> addition, please reply to this message confirming your receipt of the
> same in error. A copy of your email reply can also be sent to
> . Please do not disclose, copy, distribute or take
> any action in reliance on the contents of this information. Kindly
> destroy all copies of this message and any attachments. Any other use
of
> this email is prohibited. Thank you for your cooperation. For more
> information about Vantage, please visit our website at
> http://www.vantage.com <http://www.vantage.com/> .
> ---
>
>
>
>



---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to  so that your
       message can get through to the mailing list cleanly


From:
John A Meinel
Date:

Anjan Dave wrote:
> Yes, that's true, though, I am a bit confused because the Clariion array
> document I am reading talks about how the write cache can eliminate the
> RAID5 Write Penalty for sequential and large IOs...resulting in better
> sequential write performance than RAID10.
>
> anjan
>

Well, if your stripe size is 128k, and you have N disks in the RAID (N
must be even and > 4 for RAID10).

With RAID5 you have a stripe across N-1 disks, and 1 parity entry.
With RAID10 you have a stripe across N/2 disks, replicated on the second
set.

So if the average write size is >128k*N/2, then you will generally be
using all of the disks during a write, and you can expect a the maximum
scale up of about N/2 for RAID10.

If your average write size is >128k*(N-1) then you can again write an
entire stripe at a time and even the parity since you already know all
of the information you don't have to do any reading. So you can get a
maximum speed up of N-1.

If you are doing infrequent smallish writes, it can be buffered by the
write cache, and isn't disk limited at all. And the controller can write
it out when it feels like it. So it should be able to do more buffered
all-at-once writes.

If you are writing a little bit more often (such that the cache fills
up), depending on your write pattern, it is possible that all of the
stripes are already in the cache, so again there is little penalty for
the parity stripe.

I suppose the worst case is if you were writing lots of very small
chunks, all over the disk in random order. In which case each write
encounters a 2x read penalty for a smart controller, or a Nx read
penalty if you are going for more safety than speed. (You can read the
original value, and the parity, and re-compute the parity with the new
value (2x  read penalty), but if there is corruption it would not be
detected, so you might want to read all of the stripes in the block, and
recompute the parity with the new data (Nx read penalty)).

I think the issue for Postgres is that it writes 8k pages, which is
quite small relative to the stripe size. So you don't tend to build up
big buffers to write out the entire stripe at once.

So if you aren't filling up your write buffer, RAID5 can do quite well
with bulk loads.
I also don't know about the penalties for a read followed immediately by
a write. Since you will be writing to the same location, you know that
you have to wait for the disk to spin back to the same location. At 10k
rpm that is a 6ms wait time. For 7200rpm disks, it is 8.3ms.

Just to say that there are some specific extra penalties when you are
reading the location that you are going to write right away. Now a
really smart controller with lots of data to write could read the whole
circle on the disk, and then start writing out the entire circle, and
not have any spin delay. But you would have to know the size of the
circle, and that depends on what block you are on, and the heads
arrangement and everything else.
Though since hard-drives also have small caches in them, you could hide
some of the spin delay, but not a lot, since you have to leave the head
there until you are done writing, so while the current command would
finish quickly, the next command couldn't start until the first actually
finished.

Writing large buffers hides all of these seek/spin based latencies, so
you can get really good throughput. But a lot of DB action is small
buffers randomly distributed, so you really do need low seek time, of
which RAID10 is probably better than RAID5.

John
=:->

From:
Vivek Khera
Date:

On Aug 16, 2005, at 2:37 PM, Anjan Dave wrote:

> Yes, that's true, though, I am a bit confused because the Clariion
> array
> document I am reading talks about how the write cache can eliminate
> the
> RAID5 Write Penalty for sequential and large IOs...resulting in better
> sequential write performance than RAID10.
>

well, then run your own tests and find out :-)

if I were using LSI MegaRAID controllers, I'd probalby go RAID10, but
I don't see why you need 6 disks for this... perhaps just 4 would be
enough?  Or are your logs really that big?


Vivek Khera, Ph.D.
+1-301-869-4449 x806



From:
John A Meinel
Date:

Anjan Dave wrote:
> Yes, that's true, though, I am a bit confused because the Clariion array
> document I am reading talks about how the write cache can eliminate the
> RAID5 Write Penalty for sequential and large IOs...resulting in better
> sequential write performance than RAID10.
>
> anjan
>

To give a shorter statement after my long one...
If you have enough cache that the controller can write out big chunks to
the disk at a time, you can get very good sequential RAID5 performance,
because the stripe size is large (so it can do a parallel write to all
disks).

But for small chunk writes, you suffer the penalty of the read before
write, and possible multi-disk read (depends on what is in cache).

RAID10 generally handles small writes better, and I would guess that
4disks would perform almost identically to 6disks, since you aren't
usually writing enough data to span multiple stripes.

If your battery-backed cache is big enough that you don't fill it, they
probably perform about the same (superfast) since the cache hides the
latency of the disks.

If you start filling up your cache, RAID5 probably can do better because
of the parallelization.

But small writes followed by an fsync do favor RAID10 over RAID5.

John
=:->

From:
Alex Turner
Date:

Theoretically RAID 5 can perform better than RAID 10 over the same
number of drives (more members form the stripe in RAID 5 than in RAID
10).  All you have to do is calculate parity faster than the drives
can write.  Doesn't seem like a hard task really, although most RAID
controllers seem incapable of doing so, it is possible that Clariion
might be able to acheive it.  The other factor is that for partial
block writes, the array has to first read the original block in order
to recalculate the parity, so small random writes are very slow.  If
you are writing chunks that are larger than your stripe size*(n-1),
then in theory the controller doesn't have to re-read a block, and can
just overwrite the parity with the new info.

Consider just four drives.  in RAID 10, it is a stripe of two mirrors,
forming two independant units to write to.  in RAID 5, it is a 3 drive
stripe with parity giving three independant units to write to.
Theoretically the RAID 5 should be faster, but I've yet to benchmark a
controler where this holds to be true.

Of course if you ever do have a drive failure, your array grinds to a
halt because rebuilding a raid 5 requires reading (n-1) blocks to
rebuild just one block where n is the number of drives in the array,
whereas a mirror only required to read from a single spindle of the
RAID.

I would suggest running some benchmarks at RAID 5 and RAID 10 to see
what the _real_ performance actualy is, thats the only way to really
tell.

Alex Turner
NetEconomist

On 8/16/05, Anjan Dave <> wrote:
> Yes, that's true, though, I am a bit confused because the Clariion array
> document I am reading talks about how the write cache can eliminate the
> RAID5 Write Penalty for sequential and large IOs...resulting in better
> sequential write performance than RAID10.
>
> anjan
>
>
> -----Original Message-----
> From:  [mailto:]
> Sent: Tuesday, August 16, 2005 2:00 PM
> To: 
> Subject: Re: [PERFORM] choosing RAID level for xlogs
>
> Quoting Anjan Dave <>:
>
> > Hi,
> >
> >
> >
> > One simple question. For 125 or more checkpoint segments
> > (checkpoint_timeout is 600 seconds, shared_buffers are at 21760 or
> > 170MB) on a very busy database, what is more suitable, a separate 6
> disk
> > RAID5 volume, or a RAID10 volume? Databases will be on separate
> > spindles. Disks are 36GB 15KRPM, 2Gb Fiber Channel. Performance is
> > paramount, but I don't want to use RAID0.
> >
>
> RAID10 -- no question.  xlog activity is overwhelmingly sequential 8KB
> writes.
> In order for RAID5 to perform a write, the host (or controller) needs to
> perform
> extra calculations for parity.  This turns into latency.  RAID10 does
> not
> perform those extra calculations.
>
> >
> >
> > PG7.4.7 on RHAS 4.0
> >
> >
> >
> > I can provide more info if needed.
> >
> >
> >
> > Appreciate some recommendations!
> >
> >
> >
> > Thanks,
> >
> > Anjan
> >
> >
> >
> >
> > ---
> > This email message and any included attachments constitute
> confidential
> > and privileged information intended exclusively for the listed
> > addressee(s). If you are not the intended recipient, please notify
> > Vantage by immediately telephoning 215-579-8390, extension 1158. In
> > addition, please reply to this message confirming your receipt of the
> > same in error. A copy of your email reply can also be sent to
> > . Please do not disclose, copy, distribute or take
> > any action in reliance on the contents of this information. Kindly
> > destroy all copies of this message and any attachments. Any other use
> of
> > this email is prohibited. Thank you for your cooperation. For more
> > information about Vantage, please visit our website at
> > http://www.vantage.com <http://www.vantage.com/> .
> > ---
> >
> >
> >
> >
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to  so that your
>        message can get through to the mailing list cleanly
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>

From:
Alex Turner
Date:

Don't forget that often controlers don't obey fsyncs like a plain
drive does.  thats the point of having a BBU ;)

Alex Turner
NetEconomist

On 8/16/05, John A Meinel <> wrote:
> Anjan Dave wrote:
> > Yes, that's true, though, I am a bit confused because the Clariion array
> > document I am reading talks about how the write cache can eliminate the
> > RAID5 Write Penalty for sequential and large IOs...resulting in better
> > sequential write performance than RAID10.
> >
> > anjan
> >
>
> To give a shorter statement after my long one...
> If you have enough cache that the controller can write out big chunks to
> the disk at a time, you can get very good sequential RAID5 performance,
> because the stripe size is large (so it can do a parallel write to all
> disks).
>
> But for small chunk writes, you suffer the penalty of the read before
> write, and possible multi-disk read (depends on what is in cache).
>
> RAID10 generally handles small writes better, and I would guess that
> 4disks would perform almost identically to 6disks, since you aren't
> usually writing enough data to span multiple stripes.
>
> If your battery-backed cache is big enough that you don't fill it, they
> probably perform about the same (superfast) since the cache hides the
> latency of the disks.
>
> If you start filling up your cache, RAID5 probably can do better because
> of the parallelization.
>
> But small writes followed by an fsync do favor RAID10 over RAID5.
>
> John
> =:->
>
>
>

From:
"Gregory S. Williamson"
Date:

I would be very cautious about ever using RAID5, despite manufacturers' claims to the contrary. The link below is
authoredby a very knowledgable fellow whose posts I know (and trust) from Informix land. 

<http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt>

Greg Williamson
DBA
GlobeXplorer LLC


-----Original Message-----
From:     on behalf of Anjan Dave
Sent:    Mon 8/15/2005 1:35 PM
To:    
Cc:
Subject:    [PERFORM] choosing RAID level for xlogs
Hi,



One simple question. For 125 or more checkpoint segments
(checkpoint_timeout is 600 seconds, shared_buffers are at 21760 or
170MB) on a very busy database, what is more suitable, a separate 6 disk
RAID5 volume, or a RAID10 volume? Databases will be on separate
spindles. Disks are 36GB 15KRPM, 2Gb Fiber Channel. Performance is
paramount, but I don't want to use RAID0.



PG7.4.7 on RHAS 4.0



I can provide more info if needed.



Appreciate some recommendations!



Thanks,

Anjan




---
This email message and any included attachments constitute confidential
and privileged information intended exclusively for the listed
addressee(s). If you are not the intended recipient, please notify
Vantage by immediately telephoning 215-579-8390, extension 1158. In
addition, please reply to this message confirming your receipt of the
same in error. A copy of your email reply can also be sent to
. Please do not disclose, copy, distribute or take
any action in reliance on the contents of this information. Kindly
destroy all copies of this message and any attachments. Any other use of
this email is prohibited. Thank you for your cooperation. For more
information about Vantage, please visit our website at
http://www.vantage.com <http://www.vantage.com/> .
---





!DSPAM:4300fd35105094125621296!




From:
"Anjan Dave"
Date:

Thanks, everyone. I got some excellent replies, including some long explanations. Appreciate the time you guys took out
forthe responses.
 
 
The gist of it i take, is to use RAID10. I have 400MB+ of write cache on the controller(s), that the RAID5 LUN(s) could
benefitfrom by filling it up and writing out the complete stripe, but come to think of it, it's shared among the two
StorageProcessors, all the LUNs, not just the ones holding the pg_xlog directory. The other thing (with Clariion) is
thewrite cache mirroring. Write isn't signalled complete to the host until the cache content is mirrored across the
otherSP (and vice-versa), which is a good thing, but this operation could potentially become a bottleneck with very
highload on the SPs.
 
 
Also, one would have to fully trust the controller/manufacturer's claim on signalling the write completion. And,
performanceis a priority over the drive space lost in RAID10 for me.
 
 
I can use 4 drives instead of 6.
 
Thanks,
Anjan  

    t-----Original Message----- 
    From: Gregory S. Williamson [mailto:] 
    Sent: Tue 8/16/2005 6:22 PM 
    To: Anjan Dave;  
    Cc: 
    Subject: RE: [PERFORM] choosing RAID level for xlogs
    
    

    I would be very cautious about ever using RAID5, despite manufacturers' claims to the contrary. The link below is
authoredby a very knowledgable fellow whose posts I know (and trust) from Informix land.
 

    <http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt> 

    Greg Williamson 
    DBA 
    GlobeXplorer LLC 


    -----Original Message----- 
    From:    on behalf of Anjan Dave 
    Sent:   Mon 8/15/2005 1:35 PM 
    To:      
    Cc:     
    Subject:        [PERFORM] choosing RAID level for xlogs 
    Hi, 

    

    One simple question. For 125 or more checkpoint segments 
    (checkpoint_timeout is 600 seconds, shared_buffers are at 21760 or 
    170MB) on a very busy database, what is more suitable, a separate 6 disk 
    RAID5 volume, or a RAID10 volume? Databases will be on separate 
    spindles. Disks are 36GB 15KRPM, 2Gb Fiber Channel. Performance is 
    paramount, but I don't want to use RAID0. 

    

    PG7.4.7 on RHAS 4.0 

    

    I can provide more info if needed. 

    

    Appreciate some recommendations! 

    

    Thanks, 

    Anjan 

    

    
    --- 
    This email message and any included attachments constitute confidential 
    and privileged information intended exclusively for the listed 
    addressee(s). If you are not the intended recipient, please notify 
    Vantage by immediately telephoning 215-579-8390, extension 1158. In 
    addition, please reply to this message confirming your receipt of the 
    same in error. A copy of your email reply can also be sent to 
    . Please do not disclose, copy, distribute or take 
    any action in reliance on the contents of this information. Kindly 
    destroy all copies of this message and any attachments. Any other use of 
    this email is prohibited. Thank you for your cooperation. For more 
    information about Vantage, please visit our website at 
    http://www.vantage.com <http://www.vantage.com/> . 
    --- 

    



    !DSPAM:4300fd35105094125621296! 




From:
Alex Turner
Date:

The other point that is well made is that with enough drives you will
max out the PCI bus before you max out the drives.  64-bit 66Mhz can
do about 400MB/sec, which can be acheived by two 3 drive stripes (6
drive in RAID 10).  A true PCI-X card can do better, but can your
controller?  Remember, U320 is only 320MB/channel...

Alex Turner
NetEconomist

On 8/16/05, Anjan Dave <> wrote:
> Thanks, everyone. I got some excellent replies, including some long explanations. Appreciate the time you guys took
outfor the responses. 
>
> The gist of it i take, is to use RAID10. I have 400MB+ of write cache on the controller(s), that the RAID5 LUN(s)
couldbenefit from by filling it up and writing out the complete stripe, but come to think of it, it's shared among the
twoStorage Processors, all the LUNs, not just the ones holding the pg_xlog directory. The other thing (with Clariion)
isthe write cache mirroring. Write isn't signalled complete to the host until the cache content is mirrored across the
otherSP (and vice-versa), which is a good thing, but this operation could potentially become a bottleneck with very
highload on the SPs. 
>
> Also, one would have to fully trust the controller/manufacturer's claim on signalling the write completion. And,
performanceis a priority over the drive space lost in RAID10 for me. 
>
> I can use 4 drives instead of 6.
>
> Thanks,
> Anjan
>
>         t-----Original Message-----
>         From: Gregory S. Williamson [mailto:]
>         Sent: Tue 8/16/2005 6:22 PM
>         To: Anjan Dave; 
>         Cc:
>         Subject: RE: [PERFORM] choosing RAID level for xlogs
>
>
>
>         I would be very cautious about ever using RAID5, despite manufacturers' claims to the contrary. The link
belowis authored by a very knowledgable fellow whose posts I know (and trust) from Informix land. 
>
>         <http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt>
>
>         Greg Williamson
>         DBA
>         GlobeXplorer LLC
>
>
>         -----Original Message-----
>         From:    on behalf of Anjan Dave
>         Sent:   Mon 8/15/2005 1:35 PM
>         To:     
>         Cc:
>         Subject:        [PERFORM] choosing RAID level for xlogs
>         Hi,
>
>
>
>         One simple question. For 125 or more checkpoint segments
>         (checkpoint_timeout is 600 seconds, shared_buffers are at 21760 or
>         170MB) on a very busy database, what is more suitable, a separate 6 disk
>         RAID5 volume, or a RAID10 volume? Databases will be on separate
>         spindles. Disks are 36GB 15KRPM, 2Gb Fiber Channel. Performance is
>         paramount, but I don't want to use RAID0.
>
>
>
>         PG7.4.7 on RHAS 4.0
>
>
>
>         I can provide more info if needed.
>
>
>
>         Appreciate some recommendations!
>
>
>
>         Thanks,
>
>         Anjan
>
>
>
>
>         ---
>         This email message and any included attachments constitute confidential
>         and privileged information intended exclusively for the listed
>         addressee(s). If you are not the intended recipient, please notify
>         Vantage by immediately telephoning 215-579-8390, extension 1158. In
>         addition, please reply to this message confirming your receipt of the
>         same in error. A copy of your email reply can also be sent to
>         . Please do not disclose, copy, distribute or take
>         any action in reliance on the contents of this information. Kindly
>         destroy all copies of this message and any attachments. Any other use of
>         this email is prohibited. Thank you for your cooperation. For more
>         information about Vantage, please visit our website at
>         http://www.vantage.com <http://www.vantage.com/> .
>         ---
>
>
>
>
>
>         !DSPAM:4300fd35105094125621296!
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
>