Thread: Advice configuring ServeRAID 8k for performance

Advice configuring ServeRAID 8k for performance

From
"Kenneth Cox"
Date:
I am using PostgreSQL 8.3.7 on a dedicated IBM 3660 with 24GB RAM running
CentOS 5.4 x86_64.  I have a ServeRAID 8k controller with 6 SATA 7500RPM
disks in RAID 6, and for the OLAP workload it feels* slow.  I have 6 more
disks to add, and the RAID has to be rebuilt in any case, but first I
would like to solicit general advice.  I know that's little data to go on,
and I believe in the scientific method, but in this case I don't have the
time to make many iterations.

My questions are simple, but in my reading I have not been able to find
definitive answers:

1) Should I switch to RAID 10 for performance?  I see things like "RAID 5
is bad for a DB" and "RAID 5 is slow with <= 6 drives" but I see little on
RAID 6.  RAID 6 was the original choice for more usable space with good
redundancy.  My current performance is 85MB/s write, 151 MB/s reads (using
dd of 2xRAM per
http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm).

2) Should I configure the ext3 file system with noatime and/or
data=writeback or data=ordered?  My controller has a battery, the logical
drive has write cache enabled (write-back), and the physical devices have
write cache disabled (write-through).

3) Do I just need to spend more time configuring postgresql?  My
non-default settings were largely generated by pgtune-0.9.3:

     max_locks_per_transaction = 128 # manual; avoiding "out of shared
memory"
     default_statistics_target = 100
     maintenance_work_mem = 1GB
     constraint_exclusion = on
     checkpoint_completion_target = 0.9
     effective_cache_size = 16GB
     work_mem = 352MB
     wal_buffers = 32MB
     checkpoint_segments = 64
     shared_buffers = 2316MB
     max_connections = 32

I am happy to take informed opinion.  If you don't have the time to
properly cite all your sources but have suggestions, please send them.

Thanks in advance,
Ken

* I know "feels slow" is not scientific.  What I mean is that any single
query on a fact table, or any 'rm -rf' of a big directory sends disk
utilization to 100% (measured with iostat -x 3).

Re: Advice configuring ServeRAID 8k for performance

From
Alan Hodgson
Date:
On Thursday, August 05, 2010, "Kenneth Cox" <kenstir@gmail.com> wrote:
> 1) Should I switch to RAID 10 for performance?  I see things like "RAID 5
> is bad for a DB" and "RAID 5 is slow with <= 6 drives" but I see little
> on RAID 6.  RAID 6 was the original choice for more usable space with
> good redundancy.  My current performance is 85MB/s write, 151 MB/s reads
> (using dd of 2xRAM per
> http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm).

If you can spare the drive space, go to RAID 10. RAID 5/6 usually look fine
on single-threaded sequential tests (unless your controller really sucks),
but in the real world with multiple processes doing random I/O RAID 10 will
go a lot further on the same drives. Plus your recovery time from disk
failures will be a lot faster.

If you can't spare the drive space ... you should buy more drives.

>
> 2) Should I configure the ext3 file system with noatime and/or
> data=writeback or data=ordered?  My controller has a battery, the logical
> drive has write cache enabled (write-back), and the physical devices have
> write cache disabled (write-through).

noatime is fine but really minor filesystem options rarely show much impact.
My best performance comes from XFS filesystems created with stripe options
matching the underlying RAID array. Anything else is just a bonus.

> * I know "feels slow" is not scientific.  What I mean is that any single
> query on a fact table, or any 'rm -rf' of a big directory sends disk
> utilization to 100% (measured with iostat -x 3).

.. and it should. Any modern system can peg a small disk array without much
effort. Disks are slow.

--
"No animals were harmed in the recording of this episode. We tried but that
damn monkey was just too fast."

Re: Advice configuring ServeRAID 8k for performance

From
Scott Marlowe
Date:
On Thu, Aug 5, 2010 at 12:28 PM, Kenneth Cox <kenstir@gmail.com> wrote:
> I am using PostgreSQL 8.3.7 on a dedicated IBM 3660 with 24GB RAM running
> CentOS 5.4 x86_64.  I have a ServeRAID 8k controller with 6 SATA 7500RPM
> disks in RAID 6, and for the OLAP workload it feels* slow.  I have 6 more
> disks to add, and the RAID has to be rebuilt in any case, but first I would
> like to solicit general advice.  I know that's little data to go on, and I
> believe in the scientific method, but in this case I don't have the time to
> make many iterations.
>
> My questions are simple, but in my reading I have not been able to find
> definitive answers:
>
> 1) Should I switch to RAID 10 for performance?  I see things like "RAID 5 is
> bad for a DB" and "RAID 5 is slow with <= 6 drives" but I see little on RAID
> 6.  RAID 6 was the original choice for more usable space with good
> redundancy.  My current performance is 85MB/s write, 151 MB/s reads (using
> dd of 2xRAM per
> http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm).

Sequential read / write is not very useful for a database benchmark.
It does kind of give you a baseline for throughput, but most db access
is mixed enough that random access becomes the important measurement.

RAID6 is basically RAID5 with a hot spare already built into the
array.  This makes rebuild less of an issue since you can reduce the
spare io used to rebuild the array to something really small.
However, it's in the same performance ballpark as RAID 5 with the
accompanying write performance penalty.

RAID-10 is pretty much the only way to go for a DB, and if you need
more space, you need more or bigger drives, not RAID-5/6

--
To understand recursion, one must first understand recursion.

Re: Advice configuring ServeRAID 8k for performance

From
Greg Smith
Date:
Kenneth Cox wrote:
> 1) Should I switch to RAID 10 for performance?  I see things like
> "RAID 5 is bad for a DB" and "RAID 5 is slow with <= 6 drives" but I
> see little on RAID 6.  RAID 6 was the original choice for more usable
> space with good redundancy.  My current performance is 85MB/s write,
> 151 MB/s reads

RAID6 is no better than RAID5 performance wise, it just has better fault
tolerance.  And the ServeRAID 8k is a pretty underpowered card as RAID
controllers go, so it would not be impossible for it computing RAID
parity and the like to be the bottleneck here.  I'd expect a 6-disk
RAID10 with 7200RPM drives to be closer to 120MB/s on writes, so you're
not getting ideal performance there.  Your read figure is more
competative, but that's usually the RAID5 pattern--decent on reads,
slugging on writes.

> 2) Should I configure the ext3 file system with noatime and/or
> data=writeback or data=ordered?  My controller has a battery, the
> logical drive has write cache enabled (write-back), and the physical
> devices have write cache disabled (write-through).

data=ordered is the ext3 default and usually a reasonable choice.  Using
writeback instead can be dangerous, I wouldn't advise starting there.
noatime is certainly a good thing, but the speedup is pretty minor if
you have a battery-backed write cache.


> 3) Do I just need to spend more time configuring postgresql?  My
> non-default settings were largely generated by pgtune-0.9.3

Those look reasonable enough, except no reason to make wal_buffers
bigger than 16MB.  That work_mem figure might be high too, that's a
known concern with pgtune I need to knock out of it one day soon.  When
you are hitting high I/O wait periods, is the system running out of RAM
and swapping?  That can cause really nasty I/O wait.

Your basic hardware is off a bit, but not so badly that I'd start
there.  Have you turned on slow query logging to see what is hammering
the system when the iowait climbs?  Often tuning those by looking at the
EXPLAIN ANALYZE output can be much more effective than hardware/server
configuration tuning.

> * I know "feels slow" is not scientific.  What I mean is that any
> single query on a fact table, or any 'rm -rf' of a big directory sends
> disk utilization to 100% (measured with iostat -x 3).

"rm -rf" is really slow on ext3 on just about any hardware.  If your
fact tables aren't in RAM and you run a query against them, paging them
back in again will hammer the disks until it's done.  That's not
necessarily indicative of a misconfiguration on its own.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Advice configuring ServeRAID 8k for performance

From
"Pierre C"
Date:
> 1) Should I switch to RAID 10 for performance?  I see things like "RAID
> 5 is bad for a DB" and "RAID 5 is slow with <= 6 drives" but I see
> little on RAID 6.

As others said, RAID6 is RAID5 + a hot spare.

Basically when you UPDATE a row, at some point postgres will write the
page which contains that row.

RAID10 : write the page to all mirrors.
RAID5/6 : write the page to the relevant disk. Read the corresponding page
 from all disks (minus one), compute parity, write parity.

As you can see one small write will need to hog all drives in the array.
RAID5/6 performance for small random writes is really, really bad.

Databases like RAID10 for reads too because when you need some random data
you can get it from any of the mirrors, so you get increased parallelism
on reads too.

> with good redundancy.  My current performance is 85MB/s write, 151 MB/s
> reads

FYI, I get 200 MB/s sequential out of the software RAID5 of 3 cheap
desktop consumer SATA drives in my home multimedia server...


Re: Advice configuring ServeRAID 8k for performance

From
Craig James
Date:
On 8/5/10 11:28 AM, Kenneth Cox wrote:
> I am using PostgreSQL 8.3.7 on a dedicated IBM 3660 with 24GB RAM
> running CentOS 5.4 x86_64. I have a ServeRAID 8k controller with 6 SATA
> 7500RPM disks in RAID 6, and for the OLAP workload it feels* slow....
>  My current performance is 85MB/s write, 151 MB/s reads

I get 193MB/sec write and 450MB/sec read on a RAID10 on 8 SATA 7200 RPM disks.  RAID10 seems to scale linearly -- add
disks,get more speed, to the limit of your controller. 

Craig

Re: Advice configuring ServeRAID 8k for performance

From
Scott Marlowe
Date:
On Thu, Aug 5, 2010 at 4:27 PM, Pierre C <lists@peufeu.com> wrote:
>
>> 1) Should I switch to RAID 10 for performance?  I see things like "RAID 5
>> is bad for a DB" and "RAID 5 is slow with <= 6 drives" but I see little on
>> RAID 6.
>
> As others said, RAID6 is RAID5 + a hot spare.
>
> Basically when you UPDATE a row, at some point postgres will write the page
> which contains that row.
>
> RAID10 : write the page to all mirrors.
> RAID5/6 : write the page to the relevant disk. Read the corresponding page
> from all disks (minus one), compute parity, write parity.

Actually it's not quite that bad.  You only have to read from two
disks, the data disk and the parity disk, then compute new parity and
write to both disks.  Still 2 reads / 2 writes for every write.

> As you can see one small write will need to hog all drives in the array.
> RAID5/6 performance for small random writes is really, really bad.
>
> Databases like RAID10 for reads too because when you need some random data
> you can get it from any of the mirrors, so you get increased parallelism on
> reads too.

Also for sequential access RAID-10 can read both drives in a pair
interleaved so you get 50% of the data you need from each drive and
double the read rate there.  This is even true for linux software md
RAID.

>> with good redundancy.  My current performance is 85MB/s write, 151 MB/s
>> reads
>
> FYI, I get 200 MB/s sequential out of the software RAID5 of 3 cheap desktop
> consumer SATA drives in my home multimedia server...

On a machine NOT configured for max seq throughput (it's used for
mostly OLTP stuff) I get 325M/s both read and write speed with a 26
disk RAID-10.  OTOH, that setup gets ~6000 to 7000 transactions per
second with multi-day runs of pgbench.

Re: Advice configuring ServeRAID 8k for performance

From
Dave Crooke
Date:
Definitely switch to RAID-10 .... it's not merely that it's a fair bit faster on normal operations (less seek contention), it's **WAY** faster than any parity based RAID (RAID-2 through RAID-6) in degraded mode when you lose a disk and have to rebuild it. This is something many people don't test for, and then get bitten badly when they lose a drive under production loads.

Use higher capacity drives if necessary to make your data fit in the number of spindles your controller supports ... the difference in cost is modest compared to an overall setup, especially with SATA. Make sure you still leave at least one hot spare!

In normal operation, RAID-5 has to read and write 2 drives for every write ... not sure about RAID-6 but I suspect it needs to read the entire stripe to recalculate the Hamming parity, and it definitely has to write to 3 drives for each write, which means seeking all 3 of those drives to that position. In degraded mode (a disk rebuilding) with either of those levels, ALL the drives have to seek to that point for every write, and for any reads of the failed drive, so seek contention is horrendous.

RAID-5 and RAID-6 are designed for optimum capacity, protection, and low write performance, which is fine for a general file server.

Parity RAID simply isn't suitable for database use .... anyone who claims otherwise either (a) doesn't understand the failure modes of RAID, or (b) is running in a situation where performance simply doesn't matter.

Cheers
Dave

On Thu, Aug 5, 2010 at 1:28 PM, Kenneth Cox <kenstir@gmail.com> wrote:
My questions are simple, but in my reading I have not been able to find definitive answers:

1) Should I switch to RAID 10 for performance?  I see things like "RAID 5 is bad for a DB" and "RAID 5 is slow with <= 6 drives" but I see little on RAID 6.  RAID 6 was the original choice for more usable space with good redundancy.  My current performance is 85MB/s write, 151 MB/s reads (using dd of 2xRAM per http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm).

Re: Advice configuring ServeRAID 8k for performance

From
Scott Marlowe
Date:
On Thu, Aug 5, 2010 at 5:13 PM, Dave Crooke <dcrooke@gmail.com> wrote:
> Definitely switch to RAID-10 .... it's not merely that it's a fair bit
> faster on normal operations (less seek contention), it's **WAY** faster than
> any parity based RAID (RAID-2 through RAID-6) in degraded mode when you lose
> a disk and have to rebuild it. This is something many people don't test for,
> and then get bitten badly when they lose a drive under production loads.

Had a friend with a 600G x 5 disk RAID-5 and one drive died.  It took
nearly 48 hours to rebuild the array.

> Use higher capacity drives if necessary to make your data fit in the number
> of spindles your controller supports ... the difference in cost is modest
> compared to an overall setup, especially with SATA. Make sure you still
> leave at least one hot spare!

Yeah, a lot of chassis hold an even number of drives, and I wind up
with 2 hot spares because of it.

> Parity RAID simply isn't suitable for database use .... anyone who claims
> otherwise either (a) doesn't understand the failure modes of RAID, or (b) is
> running in a situation where performance simply doesn't matter.

The only time it's acceptable is when you're running something like
low write volume report generation / batch processing, where you're
mostly sequentially reading and writing 100s of gigabytes at a time in
one or maybe two threads.

--
To understand recursion, one must first understand recursion.

Re: Advice configuring ServeRAID 8k for performance

From
Mark Kirkwood
Date:
On 06/08/10 06:28, Kenneth Cox wrote:
> I am using PostgreSQL 8.3.7 on a dedicated IBM 3660 with 24GB RAM
> running CentOS 5.4 x86_64.  I have a ServeRAID 8k controller with 6
> SATA 7500RPM disks in RAID 6, and for the OLAP workload it feels*
> slow.  I have 6 more disks to add, and the RAID has to be rebuilt in
> any case, but first I would like to solicit general advice.  I know
> that's little data to go on, and I believe in the scientific method,
> but in this case I don't have the time to make many iterations.
>
> My questions are simple, but in my reading I have not been able to
> find definitive answers:
>
> 1) Should I switch to RAID 10 for performance?  I see things like
> "RAID 5 is bad for a DB" and "RAID 5 is slow with <= 6 drives" but I
> see little on RAID 6.  RAID 6 was the original choice for more usable
> space with good redundancy.  My current performance is 85MB/s write,
> 151 MB/s reads (using dd of 2xRAM per
> http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm).
>

Normally I'd agree with the others and recommend RAID10 - but you say
you have an OLAP workload - if it is *heavily* read biased you may get
better performance with RAID5 (more effective disks to read from).
Having said that, your sequential read performance right now is pretty
low (151 MB/s  - should be double this), which may point to an issue
with this controller. Unfortunately this *may* be important for an OLAP
workload (seq scans of big tables).



> 2) Should I configure the ext3 file system with noatime and/or
> data=writeback or data=ordered?  My controller has a battery, the
> logical drive has write cache enabled (write-back), and the physical
> devices have write cache disabled (write-through).
>

Probably wise to use noatime. If you have a heavy write workload (i.e so
what I just wrote above does *not* apply), then you might find adjusting
the ext3 commit interval upwards from its default of 5 seconds can help
(I'm doing some testing at the moment and commit=20 seemed to improve
performance by about 5-10%).

> 3) Do I just need to spend more time configuring postgresql?  My
> non-default settings were largely generated by pgtune-0.9.3:
>
>     max_locks_per_transaction = 128 # manual; avoiding "out of shared
> memory"
>     default_statistics_target = 100
>     maintenance_work_mem = 1GB
>     constraint_exclusion = on
>     checkpoint_completion_target = 0.9
>     effective_cache_size = 16GB
>     work_mem = 352MB
>     wal_buffers = 32MB
>     checkpoint_segments = 64
>     shared_buffers = 2316MB
>     max_connections = 32
>

Possibly higher checkpoint_segments and lower wal_buffers (I recall
someone - maybe Greg suggesting that there was no benefit in having the
latter > 10MB). I wonder about setting shared_buffers higher - how large
is the database?

Cheers

Mark


Re: Advice configuring ServeRAID 8k for performance

From
Alan Hodgson
Date:
On Thursday, August 05, 2010, Mark Kirkwood <mark.kirkwood@catalyst.net.nz>
wrote:
> Normally I'd agree with the others and recommend RAID10 - but you say
> you have an OLAP workload - if it is *heavily* read biased you may get
> better performance with RAID5 (more effective disks to read from).
> Having said that, your sequential read performance right now is pretty
> low (151 MB/s  - should be double this), which may point to an issue
> with this controller. Unfortunately this *may* be important for an OLAP
> workload (seq scans of big tables).

Probably a low (default) readahead limitation. ext3 doesn't help but it can
usually get up over 400MB/sec. Doubt it's the controller.

--
"No animals were harmed in the recording of this episode. We tried but that
damn monkey was just too fast."

Re: Advice configuring ServeRAID 8k for performance

From
Mark Kirkwood
Date:
On 06/08/10 11:58, Alan Hodgson wrote:
> On Thursday, August 05, 2010, Mark Kirkwood<mark.kirkwood@catalyst.net.nz>
> wrote:
>
>> Normally I'd agree with the others and recommend RAID10 - but you say
>> you have an OLAP workload - if it is *heavily* read biased you may get
>> better performance with RAID5 (more effective disks to read from).
>> Having said that, your sequential read performance right now is pretty
>> low (151 MB/s  - should be double this), which may point to an issue
>> with this controller. Unfortunately this *may* be important for an OLAP
>> workload (seq scans of big tables).
>>
> Probably a low (default) readahead limitation. ext3 doesn't help but it can
> usually get up over 400MB/sec. Doubt it's the controller.
>
>

Yeah - good suggestion, so cranking up readahead (man blockdev) and
retesting is recommended.

Cheers

Mark

Re: Advice configuring ServeRAID 8k for performance

From
Mark Kirkwood
Date:
On 06/08/10 12:31, Mark Kirkwood wrote:
> On 06/08/10 11:58, Alan Hodgson wrote:
>> On Thursday, August 05, 2010, Mark
>> Kirkwood<mark.kirkwood@catalyst.net.nz>
>> wrote:
>>> Normally I'd agree with the others and recommend RAID10 - but you say
>>> you have an OLAP workload - if it is *heavily* read biased you may get
>>> better performance with RAID5 (more effective disks to read from).
>>> Having said that, your sequential read performance right now is pretty
>>> low (151 MB/s  - should be double this), which may point to an issue
>>> with this controller. Unfortunately this *may* be important for an OLAP
>>> workload (seq scans of big tables).
>> Probably a low (default) readahead limitation. ext3 doesn't help but
>> it can
>> usually get up over 400MB/sec. Doubt it's the controller.
>>
>
> Yeah - good suggestion, so cranking up readahead (man blockdev) and
> retesting is recommended.
>
>

... sorry, it just occurred to wonder about the stripe or chunk size
used in the array, as making this too small can also severely hamper
sequential performance.

Cheers

Mark

Re: Advice configuring ServeRAID 8k for performance

From
Matthew Wakeling
Date:
On Thu, 5 Aug 2010, Scott Marlowe wrote:
> RAID6 is basically RAID5 with a hot spare already built into the
> array.

On Fri, 6 Aug 2010, Pierre C wrote:
> As others said, RAID6 is RAID5 + a hot spare.

No. RAID6 is NOT RAID5 plus a hot spare.

RAID5 uses a single parity datum (XOR) to ensure protection against data
loss if one drive fails.

RAID6 uses two different sets of parity (Reed-Solomon) to ensure
protection against data loss if two drives fail simultaneously.

If you have a RAID5 set with a hot spare, and you lose two drives, then
you have data loss. If the same happens to a RAID6 set, then there is no
data loss.

Matthew

--
 And the lexer will say "Oh look, there's a null string. Oooh, there's
 another. And another.", and will fall over spectacularly when it realises
 there are actually rather a lot.
         - Computer Science Lecturer (edited)

Re: Advice configuring ServeRAID 8k for performance

From
Scott Marlowe
Date:
On Fri, Aug 6, 2010 at 3:17 AM, Matthew Wakeling <matthew@flymine.org> wrote:
> On Thu, 5 Aug 2010, Scott Marlowe wrote:
>>
>> RAID6 is basically RAID5 with a hot spare already built into the
>> array.
>
> On Fri, 6 Aug 2010, Pierre C wrote:
>>
>> As others said, RAID6 is RAID5 + a hot spare.
>
> No. RAID6 is NOT RAID5 plus a hot spare.

The original phrase was that RAID 6 was like RAID 5 with a hot spare
ALREADY BUILT IN.

Re: Advice configuring ServeRAID 8k for performance

From
Scott Marlowe
Date:
On Fri, Aug 6, 2010 at 11:32 AM, Justin Pitts <justinpitts@gmail.com> wrote:
>>>> As others said, RAID6 is RAID5 + a hot spare.
>>>
>>> No. RAID6 is NOT RAID5 plus a hot spare.
>>
>> The original phrase was that RAID 6 was like RAID 5 with a hot spare
>> ALREADY BUILT IN.
>
> Built-in, or not - it is neither. It is more than that, actually. RAID
> 6 is like RAID 5 in that it uses parity for redundancy and pays a
> write cost for maintaining those parity blocks, but will maintain data
> integrity in the face of 2 simultaneous drive failures.

Yes, I know that.  I am very familiar with how RAID6 works.  RAID5
with the hot spare already rebuilt / built in is a good enough answer
for management where big words like parity might scare some PHBs.

> In terms of storage cost, it IS like paying for RAID5 + a hot spare,
> but the protection is better.
>
> A RAID 5 with a hot spare built in could not survive 2 simultaneous
> drive failures.

Exactly.  Which is why I had said with the hot spare already built in
/ rebuilt.  Geeze, pedant much?


--
To understand recursion, one must first understand recursion.

Re: Advice configuring ServeRAID 8k for performance

From
Justin Pitts
Date:
> Yes, I know that.  I am very familiar with how RAID6 works.  RAID5
> with the hot spare already rebuilt / built in is a good enough answer
> for management where big words like parity might scare some PHBs.
>
>> In terms of storage cost, it IS like paying for RAID5 + a hot spare,
>> but the protection is better.
>>
>> A RAID 5 with a hot spare built in could not survive 2 simultaneous
>> drive failures.
>
> Exactly.  Which is why I had said with the hot spare already built in
> / rebuilt.

My apologies. The 'rebuilt' slant escaped me. Thats a fair way to cast it.

> Geeze, pedant much?

Of course!

Re: Advice configuring ServeRAID 8k for performance

From
Scott Carey
Date:
On Aug 5, 2010, at 4:09 PM, Scott Marlowe wrote:

> On Thu, Aug 5, 2010 at 4:27 PM, Pierre C <lists@peufeu.com> wrote:
>>
>>> 1) Should I switch to RAID 10 for performance?  I see things like "RAID 5
>>> is bad for a DB" and "RAID 5 is slow with <= 6 drives" but I see little on
>>> RAID 6.
>>
>> As others said, RAID6 is RAID5 + a hot spare.
>>
>> Basically when you UPDATE a row, at some point postgres will write the page
>> which contains that row.
>>
>> RAID10 : write the page to all mirrors.
>> RAID5/6 : write the page to the relevant disk. Read the corresponding page
>> from all disks (minus one), compute parity, write parity.
>
> Actually it's not quite that bad.  You only have to read from two
> disks, the data disk and the parity disk, then compute new parity and
> write to both disks.  Still 2 reads / 2 writes for every write.
>
>> As you can see one small write will need to hog all drives in the array.
>> RAID5/6 performance for small random writes is really, really bad.
>>
>> Databases like RAID10 for reads too because when you need some random data
>> you can get it from any of the mirrors, so you get increased parallelism on
>> reads too.
>
> Also for sequential access RAID-10 can read both drives in a pair
> interleaved so you get 50% of the data you need from each drive and
> double the read rate there.  This is even true for linux software md
> RAID.


My experience is that it is ONLY true for software RAID and ZFS.  Most hardware raid controllers read both mirrors and
validatethat the data is equal, and thus writing is about as fast as read.  Tested with Adaptec, 3Ware, Dell PERC
4/5/6,and LSI MegaRaid hardware wise.  In all cases it was clear that the hardware raid was not using data from the two
mirrorsto improve read performance for sequential or random I/O. 
>
>>> with good redundancy.  My current performance is 85MB/s write, 151 MB/s
>>> reads
>>
>> FYI, I get 200 MB/s sequential out of the software RAID5 of 3 cheap desktop
>> consumer SATA drives in my home multimedia server...
>
> On a machine NOT configured for max seq throughput (it's used for
> mostly OLTP stuff) I get 325M/s both read and write speed with a 26
> disk RAID-10.  OTOH, that setup gets ~6000 to 7000 transactions per
> second with multi-day runs of pgbench.
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: Advice configuring ServeRAID 8k for performance

From
Scott Marlowe
Date:
On Sun, Aug 8, 2010 at 12:46 AM, Scott Carey <scott@richrelevance.com> wrote:
>
> On Aug 5, 2010, at 4:09 PM, Scott Marlowe wrote:
>
>> On Thu, Aug 5, 2010 at 4:27 PM, Pierre C <lists@peufeu.com> wrote:
>>>
>>>> 1) Should I switch to RAID 10 for performance?  I see things like "RAID 5
>>>> is bad for a DB" and "RAID 5 is slow with <= 6 drives" but I see little on
>>>> RAID 6.
>>>
>>> As others said, RAID6 is RAID5 + a hot spare.
>>>
>>> Basically when you UPDATE a row, at some point postgres will write the page
>>> which contains that row.
>>>
>>> RAID10 : write the page to all mirrors.
>>> RAID5/6 : write the page to the relevant disk. Read the corresponding page
>>> from all disks (minus one), compute parity, write parity.
>>
>> Actually it's not quite that bad.  You only have to read from two
>> disks, the data disk and the parity disk, then compute new parity and
>> write to both disks.  Still 2 reads / 2 writes for every write.
>>
>>> As you can see one small write will need to hog all drives in the array.
>>> RAID5/6 performance for small random writes is really, really bad.
>>>
>>> Databases like RAID10 for reads too because when you need some random data
>>> you can get it from any of the mirrors, so you get increased parallelism on
>>> reads too.
>>
>> Also for sequential access RAID-10 can read both drives in a pair
>> interleaved so you get 50% of the data you need from each drive and
>> double the read rate there.  This is even true for linux software md
>> RAID.
>
>
> My experience is that it is ONLY true for software RAID and ZFS.  Most hardware raid controllers read both mirrors
andvalidate that the data is equal, and thus writing is about as fast as read.  Tested with Adaptec, 3Ware, Dell PERC
4/5/6,and LSI MegaRaid hardware wise.  In all cases it was clear that the hardware raid was not using data from the two
mirrorsto improve read performance for sequential or random I/O. 

Interesting.  I'm using an Areca, I'll have to run some tests and see
if a mirror is reading at > 100% read speed of a single drive or not.

Re: Advice configuring ServeRAID 8k for performance

From
Justin Pitts
Date:
>>> As others said, RAID6 is RAID5 + a hot spare.
>>
>> No. RAID6 is NOT RAID5 plus a hot spare.
>
> The original phrase was that RAID 6 was like RAID 5 with a hot spare
> ALREADY BUILT IN.

Built-in, or not - it is neither. It is more than that, actually. RAID
6 is like RAID 5 in that it uses parity for redundancy and pays a
write cost for maintaining those parity blocks, but will maintain data
integrity in the face of 2 simultaneous drive failures.

In terms of storage cost, it IS like paying for RAID5 + a hot spare,
but the protection is better.

A RAID 5 with a hot spare built in could not survive 2 simultaneous
drive failures.

Re: Advice configuring ServeRAID 8k for performance

From
Bruce Momjian
Date:
Greg Smith wrote:
> > 2) Should I configure the ext3 file system with noatime and/or
> > data=writeback or data=ordered?  My controller has a battery, the
> > logical drive has write cache enabled (write-back), and the physical
> > devices have write cache disabled (write-through).
>
> data=ordered is the ext3 default and usually a reasonable choice.  Using
> writeback instead can be dangerous, I wouldn't advise starting there.
> noatime is certainly a good thing, but the speedup is pretty minor if
> you have a battery-backed write cache.

We recomment 'data=writeback' for ext3 in our docs:

    http://www.postgresql.org/docs/9.0/static/wal-intro.html

    Tip:  Because WAL restores database file contents after a crash,
    journaled file systems are not necessary for reliable storage of the
    data files or WAL files. In fact, journaling overhead can reduce
    performance, especially if journaling causes file system data  to be
    flushed to disk. Fortunately, data flushing during journaling can often
    be disabled with a file system mount option, e.g. data=writeback on a
    Linux ext3 file system. Journaled file systems do improve boot speed
    after a crash.

Should this be changed?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: Advice configuring ServeRAID 8k for performance

From
Greg Smith
Date:
Bruce Momjian wrote:
> We recomment 'data=writeback' for ext3 in our docs
>

Only for the WAL though, which is fine, and I think spelled out clearly
enough in the doc section you quoted.  Ken's system has one big RAID
volume, which means he'd be mounting the data files with 'writeback'
too; that's the thing to avoid.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Advice configuring ServeRAID 8k for performance

From
Scott Carey
Date:
Don't ever have WAL and data on the same OS volume as ext3.

If data=writeback, performance will be fine, data integrity will be ok for WAL, but data integrity will not be
sufficientfor the data partition. 
If data=ordered, performance will be very bad, but data integrity will be OK.

This is because an fsync on ext3 flushes _all dirty pages in the file system_ to disk, not just those for the file
beingfsync'd. 

One partition for WAL, one for data.  If using ext3 this is essentially a performance requirement no matter how your
arrayis set up underneath.  

On Aug 13, 2010, at 11:41 AM, Greg Smith wrote:

> Bruce Momjian wrote:
>> We recomment 'data=writeback' for ext3 in our docs
>>
>
> Only for the WAL though, which is fine, and I think spelled out clearly
> enough in the doc section you quoted.  Ken's system has one big RAID
> volume, which means he'd be mounting the data files with 'writeback'
> too; that's the thing to avoid.
>
> --
> Greg Smith  2ndQuadrant US  Baltimore, MD
> PostgreSQL Training, Services and Support
> greg@2ndQuadrant.com   www.2ndQuadrant.us
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: Advice configuring ServeRAID 8k for performance

From
Greg Smith
Date:
Scott Carey wrote:
> This is because an fsync on ext3 flushes _all dirty pages in the file system_ to disk, not just those for the file
beingfsync'd. 
> One partition for WAL, one for data.  If using ext3 this is essentially a performance requirement no matter how your
arrayis set up underneath.  
>

Unless you want the opposite of course.  Some systems split out the WAL
onto a second disk, only to discover checkpoint I/O spikes become a
problem all of the sudden after that.  The fsync calls for the WAL
writes keep the write cache for the data writes from ever getting too
big.  This slows things down on average, but makes the worst case less
stressful.  Free lunches are so hard to find nowadays...

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Advice configuring ServeRAID 8k for performance

From
Andres Freund
Date:
On Mon, Aug 16, 2010 at 01:46:21PM -0400, Greg Smith wrote:
> Scott Carey wrote:
> >This is because an fsync on ext3 flushes _all dirty pages in the file system_ to disk, not just those for the file
beingfsync'd. 
> >One partition for WAL, one for data.  If using ext3 this is
> >essentially a performance requirement no matter how your array is
> >set up underneath.
>
> Unless you want the opposite of course.  Some systems split out the
> WAL onto a second disk, only to discover checkpoint I/O spikes
> become a problem all of the sudden after that.  The fsync calls for
> the WAL writes keep the write cache for the data writes from ever
> getting too big.  This slows things down on average, but makes the
> worst case less stressful.  Free lunches are so hard to find
> nowadays...
Or use -o sync. Or configure a ridiciuosly low dirty_memory amount
(which has a problem on large systems because 1% can still be too
much. Argh.)...

Andres

Re: Advice configuring ServeRAID 8k for performance

From
Greg Smith
Date:
Andres Freund wrote:
> Or use -o sync. Or configure a ridiciuosly low dirty_memory amount
> (which has a problem on large systems because 1% can still be too
> much. Argh.)...
>

-o sync completely trashes performance, and trying to set the
dirty_ratio values to even 1% doesn't really work due to things like the
"congestion avoidance" code in the kernel.  If you sync a lot more
often, which putting the WAL on the same disk as the database
accidentally does for you, that works surprisingly well at avoiding this
whole class of problem on ext3.  A really good solution is going to take
a full rewrite of the PostgreSQL checkpoint logic though, which will get
sorted out during 9.1 development.  (cue dramatic foreshadowing music here)

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Advice configuring ServeRAID 8k for performance

From
Andres Freund
Date:
On Mon, Aug 16, 2010 at 04:13:22PM -0400, Greg Smith wrote:
> Andres Freund wrote:
> >Or use -o sync. Or configure a ridiciuosly low dirty_memory amount
> >(which has a problem on large systems because 1% can still be too
> >much. Argh.)...
>
> -o sync completely trashes performance, and trying to set the
> dirty_ratio values to even 1% doesn't really work due to things like
> the "congestion avoidance" code in the kernel.  If you sync a lot
> more often, which putting the WAL on the same disk as the database
> accidentally does for you, that works surprisingly well at avoiding
> this whole class of problem on ext3.  A really good solution is
> going to take a full rewrite of the PostgreSQL checkpoint logic
> though, which will get sorted out during 9.1 development.  (cue
> dramatic foreshadowing music here)
-o sync works ok enough for the data partition (surely not the wal) if you make the
background writer less aggressive.

But yes. A new checkpointing logic + a new syncing logic
(prepare_fsync() earlier and then fsync() later) would be a nice
thing. Do you plan to work on that?

Andres

Re: Advice configuring ServeRAID 8k for performance

From
Greg Smith
Date:
Andres Freund wrote:
> A new checkpointing logic + a new syncing logic
> (prepare_fsync() earlier and then fsync() later) would be a nice
> thing. Do you plan to work on that?
>

The background writer already caches fsync calls into a queue, so the
prepare step you're thinking needs to be there is already.  The problem
is that the actual fsync calls happen in a tight loop.  That we're busy
fixing.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Advice configuring ServeRAID 8k for performance

From
Andres Freund
Date:
On Mon, Aug 16, 2010 at 04:54:19PM -0400, Greg Smith wrote:
> Andres Freund wrote:
> >A new checkpointing logic + a new syncing logic
> >(prepare_fsync() earlier and then fsync() later) would be a nice
> >thing. Do you plan to work on that?
> The background writer already caches fsync calls into a queue, so
> the prepare step you're thinking needs to be there is already.  The
> problem is that the actual fsync calls happen in a tight loop.  That
> we're busy fixing.
That doesn't help that much on many systems with a somewhat deep
queue. An fsync() equals a barrier so it has the effect of stopping
reordering around it - especially on systems with larger multi-disk
arrays thats pretty expensive.
You can achieve surprising speedups, at least in my experience, by
forcing the kernel to start writing out pages *without enforcing
barriers* first and then later enforce a barrier to be sure its
actually written out. Which, in a simplified case, turns the earlier
needed multiple barriers into a single one (in practise you want to
call fsync() anyway, but thats not a big problem if its already
written out).

Andres

Re: Advice configuring ServeRAID 8k for performance

From
Bruce Momjian
Date:
Scott Carey wrote:
> Don't ever have WAL and data on the same OS volume as ext3.
>
> If data=writeback, performance will be fine, data integrity will be ok
> for WAL, but data integrity will not be sufficient for the data
> partition.  If data=ordered, performance will be very bad, but data
> integrity will be OK.
>
> This is because an fsync on ext3 flushes _all dirty pages in the file
> system_ to disk, not just those for the file being fsync'd.
>
> One partition for WAL, one for data.  If using ext3 this is essentially
> a performance requirement no matter how your array is set up underneath.

Do we need to document this?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: Advice configuring ServeRAID 8k for performance

From
Greg Smith
Date:
Andres Freund wrote:
> An fsync() equals a barrier so it has the effect of stopping
> reordering around it - especially on systems with larger multi-disk
> arrays thats pretty expensive.
> You can achieve surprising speedups, at least in my experience, by
> forcing the kernel to start writing out pages *without enforcing
> barriers* first and then later enforce a barrier to be sure its
> actually written out.

Standard practice on high performance systems with good filesystems and
a battery-backed controller is to turn off barriers anyway.  That's one
of the first things to tune on XFS for example, when you have a reliable
controller.  I don't have enough data on ext4 to comment on tuning for
it yet.

The sole purpose for the whole Linux write barrier implementation in my
world is to flush the drive's cache, when the database does writes onto
cheap SATA drives that will otherwise cache dangerously.  Barriers don't
have any place on a serious system that I can see.  The battery-backed
RAID controller you have to use to make fsync calls fast anyway can do
some simple write reordering, but the operating system doesn't ever have
enough visibility into what it's doing to make intelligent decisions
about that anyway.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Advice configuring ServeRAID 8k for performance

From
Greg Smith
Date:
Bruce Momjian wrote:
Scott Carey wrote: 
Don't ever have WAL and data on the same OS volume as ext3.

...
One partition for WAL, one for data.  If using ext3 this is essentially
a performance requirement no matter how your array is set up underneath.   
Do we need to document this? 

Not for 9.0.  What Scott is suggesting is often the case, but not always; I can produce a counter example at will now that I know exactly which closets have the skeletons in them.  The underlying situation is more complicated due to some limitations to the whole "spread checkpoint" code that is turning really sour on newer hardware with large amounts of RAM.  I have about 5 pages of written notes on this specific issue so far, and that keeps growing every week.  That's all leading toward a proposed 9.1 change to the specific fsync behavior.  And I expect to dump a large stack of documentation to support that patch that will address this whole area.  I'll put the whole thing onto the wiki as soon as my 9.0 related work settles down.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us

Re: Advice configuring ServeRAID 8k for performance

From
Andres Freund
Date:
On Tuesday 17 August 2010 10:29:10 Greg Smith wrote:
> Andres Freund wrote:
> > An fsync() equals a barrier so it has the effect of stopping
> > reordering around it - especially on systems with larger multi-disk
> > arrays thats pretty expensive.
> > You can achieve surprising speedups, at least in my experience, by
> > forcing the kernel to start writing out pages *without enforcing
> > barriers* first and then later enforce a barrier to be sure its
> > actually written out.
>
> Standard practice on high performance systems with good filesystems and
> a battery-backed controller is to turn off barriers anyway.  That's one
> of the first things to tune on XFS for example, when you have a reliable
> controller.  I don't have enough data on ext4 to comment on tuning for
> it yet.
>
> The sole purpose for the whole Linux write barrier implementation in my
> world is to flush the drive's cache, when the database does writes onto
> cheap SATA drives that will otherwise cache dangerously.  Barriers don't
> have any place on a serious system that I can see.  The battery-backed
> RAID controller you have to use to make fsync calls fast anyway can do
> some simple write reordering, but the operating system doesn't ever have
> enough visibility into what it's doing to make intelligent decisions
> about that anyway.
Even if were not talking about a write barrier in an "ensure its written out
of the cache" way it still stops the io-scheduler from reordering. I
benchmarked it (custom app) and it was very noticeable on a bunch of different
systems (with a good BBUed RAID).

Andres