Thread: Optimal settings for RAID controller - optimized for writes

Optimal settings for RAID controller - optimized for writes

From

Niels Kristian Schjødt

Date:

17 February 2014, 15:03:54

Hi,

I’m kind of a noob when it comes to setting up RAID controllers and tweaking them so I need some advice here.

I’m just about to setup my newly rented DELL R720 12. gen server. It’s running a single Intel Xeon E5-2620 v.2
processorand 64GB ECC ram. I have installed 8 300GB SSDs in it. It has an PERC H710 raid controller (Based on the LSI
SAS2208 dual core ROC).  

Now my database should be optimized for writing. UPDATEs are by far my biggest bottleneck.

Firstly: Should I just put all 8 drives in one single RAID 10 array, or would it be better to have the 6 of them in one
RAID10 array, and then the remaining two in a separate RAID 1 array e.g. for having WAL log dir on it’s own drives? 

Secondly: Now what settings should I pay attention to when setting this up, if I wan’t it to have optimal write
performance(cache behavior, write back etc.)? 

THANKS!

Re: Optimal settings for RAID controller - optimized for writes

From

DFE

Date:

17 February 2014, 15:29:36

Hi,

I configured a similar architecture some months ago and this is the best choice after some pgbench and Bonnie++ tests.

Server: DELL R720d

CPU: dual Xeon 8-core

RAM: 32GB ECC

Controller PERC H710

Disks:

2xSSD (MLC) Raid1 for Operating System (CentOS 6.4)

4xSSD (SLC) Raid10 for WAL archive and a dedicated "fast tablespace", where we have most UPDATE actions (+ Hot spare).

10xHDD 15kRPM Raid5 for "default tablespace" (optimized for space, instead of Raid10) (+ Hot spare).

Our application have above 200 UPDATE /sec. (on the "fast tablespace") and above 15GB per die of records (on the "default tablespace").

After the testing phase I had the following conclusion:

4xSSD (SLC) RAID 10 vs. 10xHDD RAID 5 have comparable I/O performance in the sequential Read and Write, but much more performance on the Random scan (obviously!!), BUT as far I know the postgresql I/O processes are not heavily involved in a random I/O, so at same price I will prefer to buy 10xHDD instead of 4xSSD (SLC) using above 10x of available space at the same price!!

10xHDD RAID 10 vs. 10xHDD RAID 5 : with Bonnie++ I noticed a very small difference in I/O performance so I decided to use RAID 5 + a dedicated Hot Spare instead of a RAID10.

If I could go back, I would have spent the money of the SLC in other HDDs.

regards.

2014-02-17 16:03 GMT+01:00 Niels Kristian Schjødt <nielskristian@autouncle.com>:

Hi,

I’m kind of a noob when it comes to setting up RAID controllers and tweaking them so I need some advice here.

I’m just about to setup my newly rented DELL R720 12. gen server. It’s running a single Intel Xeon E5-2620 v.2 processor and 64GB ECC ram. I have installed 8 300GB SSDs in it. It has an PERC H710 raid controller (Based on the LSI SAS 2208 dual core ROC).

Now my database should be optimized for writing. UPDATEs are by far my biggest bottleneck.

Firstly: Should I just put all 8 drives in one single RAID 10 array, or would it be better to have the 6 of them in one RAID 10 array, and then the remaining two in a separate RAID 1 array e.g. for having WAL log dir on it’s own drives?

Secondly: Now what settings should I pay attention to when setting this up, if I wan’t it to have optimal write performance (cache behavior, write back etc.)?

THANKS!

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Optimal settings for RAID controller - optimized for writes

From

"Tomas Vondra"

Date:

17 February 2014, 15:40:33

On 17 Únor 2014, 16:03, Niels Kristian Schjødt wrote:
> Hi,
>
> I’m kind of a noob when it comes to setting up RAID controllers and
> tweaking them so I need some advice here.
>
> I’m just about to setup my newly rented DELL R720 12. gen server. It’s
> running a single Intel Xeon E5-2620 v.2 processor and 64GB ECC ram. I have
> installed 8 300GB SSDs in it. It has an PERC H710 raid controller (Based
> on the LSI SAS 2208 dual core ROC).
>
> Now my database should be optimized for writing. UPDATEs are by far my
> biggest bottleneck.

I think it's pretty difficult to answer this without a clear idea of how
much data the UPDATEs modify etc. Locating the data may require a lot of
reads too.

> Firstly: Should I just put all 8 drives in one single RAID 10 array, or
> would it be better to have the 6 of them in one RAID 10 array, and then
> the remaining two in a separate RAID 1 array e.g. for having WAL log dir
> on it’s own drives?

This used to be done to separate WAL and data files onto separate disks,
as the workloads are very different (WAL is almost entirely sequential
writes, access to data files is often random). With spinning drives this
made a huge difference as the WAL drives were doing just seq writes, but
with SSDs it's not that important anymore.

If you can do some testing, do it. I'd probably create a RAID10 on all 8
disks.

> Secondly: Now what settings should I pay attention to when setting this
> up, if I want it to have optimal write performance (cache behavior, write
> back etc.)?

I'm wondering whether the controller (H710) actually handles TRIM well or
not. I know a lot of hardware controllers tend not to pass TRIM to the
drives, which results in poor write performance after some time, but I've
been unable to google anything about TRIM on H710.

Tomas

Re: Optimal settings for RAID controller - optimized for writes

From

"Tomas Vondra"

Date:

17 February 2014, 15:54:34

The thing is, it's difficult to transfer these experiences without clear
idea of the workloads.

For example I wouldn't say 200 updates / second is a write-heavy workload.
A single 15k drive should handle that just fine, assuming the data fit
into RAM (which seems to be the case, but maybe I got that wrong).

Niels, what amounts of data are we talking about? What is the total
database size? How much data are you updating? Are those updates random,
or are you updating a lot of data in a sequential manner? How did you
determine UPDATEs are the bottleneck?

Tomas

On 17 Únor 2014, 16:29, DFE wrote:
> Hi,
> I configured a similar architecture some months ago and this is the best
> choice after some pgbench and Bonnie++ tests.
> Server: DELL R720d
> CPU: dual Xeon 8-core
> RAM: 32GB ECC
> Controller PERC H710
> Disks:
> 2xSSD (MLC) Raid1 for Operating System (CentOS 6.4)
> 4xSSD (SLC) Raid10 for WAL archive and a dedicated "fast tablespace",
> where
> we have most UPDATE actions (+ Hot spare).
> 10xHDD 15kRPM Raid5 for "default tablespace" (optimized for space, instead
> of Raid10)  (+ Hot spare).
>
> Our application have above 200 UPDATE /sec. (on the "fast tablespace") and
> above 15GB per die of records (on the "default tablespace").
>
> After the testing phase I had the following conclusion:
> 4xSSD (SLC) RAID 10 vs. 10xHDD RAID 5 have comparable I/O performance in
> the sequential Read and Write, but much more performance on the Random
> scan
> (obviously!!), BUT as far I know the postgresql I/O processes are not
> heavily involved in a random I/O, so at same price I will prefer to buy
> 10xHDD instead of 4xSSD (SLC) using above 10x of available space at the
> same price!!
>
> 10xHDD RAID 10 vs. 10xHDD RAID 5 : with Bonnie++ I noticed a very small
> difference in I/O performance so I decided to use RAID 5 + a dedicated Hot
> Spare instead of a RAID10.
>
> If I could go back,  I would have spent the money of the SLC in other
> HDDs.
>
> regards.

Re: Optimal settings for RAID controller - optimized for writes

From

KONDO Mitsumasa

Date:

18 February 2014, 01:17:29

Hi,

I don't have PERC H710 raid controller, but I think he would like to know raid
striping/chunk size or read/write cache ratio in writeback-cache setting is the
best. I'd like to know it, too:)

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

(2014/02/18 0:54), Tomas Vondra wrote:
> The thing is, it's difficult to transfer these experiences without clear
> idea of the workloads.
>
> For example I wouldn't say 200 updates / second is a write-heavy workload.
> A single 15k drive should handle that just fine, assuming the data fit
> into RAM (which seems to be the case, but maybe I got that wrong).
>
> Niels, what amounts of data are we talking about? What is the total
> database size? How much data are you updating? Are those updates random,
> or are you updating a lot of data in a sequential manner? How did you
> determine UPDATEs are the bottleneck?
>
> Tomas
>
> On 17 Únor 2014, 16:29, DFE wrote:
>> Hi,
>> I configured a similar architecture some months ago and this is the best
>> choice after some pgbench and Bonnie++ tests.
>> Server: DELL R720d
>> CPU: dual Xeon 8-core
>> RAM: 32GB ECC
>> Controller PERC H710
>> Disks:
>> 2xSSD (MLC) Raid1 for Operating System (CentOS 6.4)
>> 4xSSD (SLC) Raid10 for WAL archive and a dedicated "fast tablespace",
>> where
>> we have most UPDATE actions (+ Hot spare).
>> 10xHDD 15kRPM Raid5 for "default tablespace" (optimized for space, instead
>> of Raid10)  (+ Hot spare).
>>
>> Our application have above 200 UPDATE /sec. (on the "fast tablespace") and
>> above 15GB per die of records (on the "default tablespace").
>>
>> After the testing phase I had the following conclusion:
>> 4xSSD (SLC) RAID 10 vs. 10xHDD RAID 5 have comparable I/O performance in
>> the sequential Read and Write, but much more performance on the Random
>> scan
>> (obviously!!), BUT as far I know the postgresql I/O processes are not
>> heavily involved in a random I/O, so at same price I will prefer to buy
>> 10xHDD instead of 4xSSD (SLC) using above 10x of available space at the
>> same price!!
>>
>> 10xHDD RAID 10 vs. 10xHDD RAID 5 : with Bonnie++ I noticed a very small
>> difference in I/O performance so I decided to use RAID 5 + a dedicated Hot
>> Spare instead of a RAID10.
>>
>> If I could go back,  I would have spent the money of the SLC in other
>> HDDs.

Re: Optimal settings for RAID controller - optimized for writes

From

Scott Marlowe

Date:

18 February 2014, 04:04:14

On Mon, Feb 17, 2014 at 8:03 AM, Niels Kristian Schjødt
<nielskristian@autouncle.com> wrote:
> Hi,
>
> I'm kind of a noob when it comes to setting up RAID controllers and tweaking them so I need some advice here.
>
> I'm just about to setup my newly rented DELL R720 12. gen server. It's running a single Intel Xeon E5-2620 v.2
processorand 64GB ECC ram. I have installed 8 300GB SSDs in it. It has an PERC H710 raid controller (Based on the LSI
SAS2208 dual core ROC). 
>
> Now my database should be optimized for writing. UPDATEs are by far my biggest bottleneck.
>
> Firstly: Should I just put all 8 drives in one single RAID 10 array, or would it be better to have the 6 of them in
oneRAID 10 array, and then the remaining two in a separate RAID 1 array e.g. for having WAL log dir on it's own drives? 
>
> Secondly: Now what settings should I pay attention to when setting this up, if I wan't it to have optimal write
performance(cache behavior, write back etc.)? 

Pick a base configuration that's the simplest, i.e. all 8 in a
RAID-10. Benchmark it to get a baseline, using a load similar to your
own. You can use pgbench's ability to run scripts to make some pretty
realistic benchmarks. Once you've got your baseline then start
experimenting. If you can't prove that moving two drives to RAID-1 for
xlog makes it faster then don't do it.

Recently I was testing MCL FusionIO cards (1.2TB) and no matter how I
sliced things up, one big partition was just as fast as or faster than
any other configuration (separate spinners for xlog, etc) I could come
up with. On this machine sequential IO to a RAID-1 pair of those was
~1GB/s. Random access during various pgbench runs was limited to
~200MB/s random throughput. Moving half of that (pg_xlog) onto other
media didn't make things any faster and just made setup more
complicated. I'll be testing 6x600GB SSDs in the next few weeks under
an LSI card, and we'll likely have a spinning drive RAID-1 for pg_xlog
there, at least to compare. If you want I can post what I see from
that benchmark next week etc.

So how many updates / second do you need to push through this thing?

Re: Optimal settings for RAID controller - optimized for writes

From

Michael Stone

Date:

18 February 2014, 14:25:10

On Mon, Feb 17, 2014 at 04:29:10PM +0100, DFE wrote:
>2xSSD (MLC) Raid1 for Operating System (CentOS 6.4)
>4xSSD (SLC) Raid10 for WAL archive and a dedicated "fast tablespace", where we
>have most UPDATE actions (+ Hot spare).
>10xHDD 15kRPM Raid5 for "default tablespace" (optimized for space, instead of
>Raid10)  (+ Hot spare).

That's bascially backwards. The WAL is basically a sequential write-only
workload, and there's generally no particular advantage to having it on
an SSD. Unless you've got a workload that's writing WAL faster than the
sequential write speed of your spinning disk array (fairly unusual).
Putting indexes on the SSD and the WAL on the spinning disk would
probably result in more bang for the buck.

One thing I've found to help performance in some workloads is to move
the xlog to a simple ext2 partition. There's no reason for that data to
be on a journaled fs, and isolating it can keep the synchronous xlog
operations from interfering with the async table operations. (E.g., it
enforces seperate per-filesystem queues, metadata flushes, etc.; note
that there will be essentially no metadata updates on the xlog if there
are sufficient log segments allocated.)

Mike Stone

Re: Optimal settings for RAID controller - optimized for writes

From

Tomas Vondra

Date:

18 February 2014, 20:41:35

On 18.2.2014 02:23, KONDO Mitsumasa wrote:
> Hi,
>
> I don't have PERC H710 raid controller, but I think he would like to
> know raid striping/chunk size or read/write cache ratio in
> writeback-cache setting is the best. I'd like to know it, too:)

We do have dozens of H710 controllers, but not with SSDs. I've been
unable to find reliable answers how it handles TRIM, and how that works
with wearout reporting (using SMART).

The stripe size is actually a very good question. On spinning drives it
usually does not matter too much - unless you have a very specialized
workload, the 'medium size' is the right choice (AFAIK we're using 64kB
on H710, which is the default).

With SSDs this might actually matter much more, as the SSDs work with
"erase blocks" (mostly 512kB), and I suspect using small stripe might
result in repeated writes to the same block - overwriting one block
repeatedly and thus increased wearout. But maybe the controller will
handle that just fine, e.g. by coalescing the writes and sending them to
the drive as a single write. Or maybe the drive can do that in local
write cache (all SSDs have that).

The other thing is filesystem alignment - a few years ago this was a
major issue causing poor write performance. Nowadays this works fine,
most tools are able to create partitions properly aligned to the 512kB
automatically. But if the controller discards this information, it might
be worth messing with the stripe size a bit to get it right.

But those are mostly speculations / curious questions I've been asking
myself recently, as we've been considering SSDs with H710/H710p too.

As for the controller cache - my opinion is that using this for caching
writes is just plain wrong. If you need to cache reads, buy more RAM -
it's much cheaper, so you can buy more of it. Cache on controller (with
a BBU) is designed especially for caching writes safely. (And maybe it
could help with some of the erase-block issues too?)

regards
Tomas

Re: Optimal settings for RAID controller - optimized for writes

From

KONDO Mitsumasa

Date:

19 February 2014, 02:39:47

(2014/02/19 5:41), Tomas Vondra wrote:
> On 18.2.2014 02:23, KONDO Mitsumasa wrote:
>> Hi,
>>
>> I don't have PERC H710 raid controller, but I think he would like to
>> know raid striping/chunk size or read/write cache ratio in
>> writeback-cache setting is the best. I'd like to know it, too:)
>
> The stripe size is actually a very good question. On spinning drives it
> usually does not matter too much - unless you have a very specialized
> workload, the 'medium size' is the right choice (AFAIK we're using 64kB
> on H710, which is the default).
I am interested that raid stripe size of PERC H710 is 64kB. In HP raid card,
default chunk size is 256kB. If we use two disks with raid 0, stripe size will
be 512kB. I think that it might too big, but it might be optimized in raid
card... In actually, it isn't bad in that settings.

I'm interested in raid card internal behavior. Fortunately, linux raid card
driver is open souce, so we might good at looking the source code when we have time.

> With SSDs this might actually matter much more, as the SSDs work with
> "erase blocks" (mostly 512kB), and I suspect using small stripe might
> result in repeated writes to the same block - overwriting one block
> repeatedly and thus increased wearout. But maybe the controller will
> handle that just fine, e.g. by coalescing the writes and sending them to
> the drive as a single write. Or maybe the drive can do that in local
> write cache (all SSDs have that).
I have heard that genuine raid card with genuine ssds are optimized in these
ssds. It is important that using compatible with ssd for performance. If the
worst case, life time of ssd is be short, and will be bad performance.


> But those are mostly speculations / curious questions I've been asking
> myself recently, as we've been considering SSDs with H710/H710p too.
>
> As for the controller cache - my opinion is that using this for caching
> writes is just plain wrong. If you need to cache reads, buy more RAM -
> it's much cheaper, so you can buy more of it. Cache on controller (with
> a BBU) is designed especially for caching writes safely. (And maybe it
> could help with some of the erase-block issues too?)
I'm wondering about effective of readahead in OS and raid card. In general,
readahead data by raid card is stored in raid cache, and not stored in OS caches.
Readahead data by OS is stored in OS cache. I'd like to use all raid cache for
only write cache, because fsync() becomes faster. But then, it cannot use
readahead very much by raid card.. If we hope to use more effectively, we have to
clear it, but it seems difficult:(

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Re: Optimal settings for RAID controller - optimized for writes

From

Merlin Moncure

Date:

19 February 2014, 15:13:23

On Tue, Feb 18, 2014 at 2:41 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> On 18.2.2014 02:23, KONDO Mitsumasa wrote:
>> Hi,
>>
>> I don't have PERC H710 raid controller, but I think he would like to
>> know raid striping/chunk size or read/write cache ratio in
>> writeback-cache setting is the best. I'd like to know it, too:)
>
> We do have dozens of H710 controllers, but not with SSDs. I've been
> unable to find reliable answers how it handles TRIM, and how that works
> with wearout reporting (using SMART).

AFAIK (I haven't looked for a few months), they don't support TRIM.
The only hardware RAID vendor that has even basic TRIM support Intel
and that's no accident; I have a theory that enterprise storage
vendors are deliberately holding back SSD: SSD (at least, the newer,
better ones) destroy the business model for "enterprise storage
equipment" in a large percentage of applications.   A 2u server with,
say, 10 s3700 drives gives *far* superior performance to most SANs
that cost under 100k$.  For about 1/10th of the price.

If you have a server that is i/o constrained as opposed to storage
constrained (AKA: a database) hard drives make zero economic sense.
If your vendor is jerking you around by charging large multiples of
market rates for storage and/or disallowing drives that actually
perform well in their storage gear, choose a new vendor.  And consider
using software raid.

merlin

Re: Optimal settings for RAID controller - optimized for writes

From

Scott Marlowe

Date:

19 February 2014, 18:09:16

On Wed, Feb 19, 2014 at 8:13 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Tue, Feb 18, 2014 at 2:41 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> On 18.2.2014 02:23, KONDO Mitsumasa wrote:
>>> Hi,
>>>
>>> I don't have PERC H710 raid controller, but I think he would like to
>>> know raid striping/chunk size or read/write cache ratio in
>>> writeback-cache setting is the best. I'd like to know it, too:)
>>
>> We do have dozens of H710 controllers, but not with SSDs. I've been
>> unable to find reliable answers how it handles TRIM, and how that works
>> with wearout reporting (using SMART).
>
> AFAIK (I haven't looked for a few months), they don't support TRIM.
> The only hardware RAID vendor that has even basic TRIM support Intel
> and that's no accident; I have a theory that enterprise storage
> vendors are deliberately holding back SSD: SSD (at least, the newer,
> better ones) destroy the business model for "enterprise storage
> equipment" in a large percentage of applications.   A 2u server with,
> say, 10 s3700 drives gives *far* superior performance to most SANs
> that cost under 100k$.  For about 1/10th of the price.
>
> If you have a server that is i/o constrained as opposed to storage
> constrained (AKA: a database) hard drives make zero economic sense.
> If your vendor is jerking you around by charging large multiples of
> market rates for storage and/or disallowing drives that actually
> perform well in their storage gear, choose a new vendor.  And consider
> using software raid.

You can also do the old trick of underprovisioning and / or
underutilizing all the space on SSDs. I.e. put 10 600GB SSDs under a
HW RAID controller in RAID-10, then only parititon out 1/2 the storage
you get from that. so you get 1.5TB os storage and the drives are
underutilized enough to have spare space.

Right now I'm testing on a machine with 2x Intel E5-2690s
(http://ark.intel.com/products/64596/intel-xeon-processor-e5-2690-20m-cache-2_90-ghz-8_00-gts-intel-qpi)
512GB RAM and 6x600GB Intel SSDs (not sure which ones) under a LSI
MegaRAID 9266. I'm able to crank out 6500 to 7200 TPS under pgbench on
a scale 1000 db at 8 to 60 clients on that machine. It's not cheap,
but storage wise it's WAY cheaper than most SANS and very fast.
pg_xlog is on a pair of non-descript SATA spinners btw.

Re: Optimal settings for RAID controller - optimized for writes

From

Tomas Vondra

Date:

20 February 2014, 00:13:41

Hi,

On 19.2.2014 03:45, KONDO Mitsumasa wrote:
> (2014/02/19 5:41), Tomas Vondra wrote:
>> On 18.2.2014 02:23, KONDO Mitsumasa wrote:
>>> Hi,
>>>
>>> I don't have PERC H710 raid controller, but I think he would like to
>>> know raid striping/chunk size or read/write cache ratio in
>>> writeback-cache setting is the best. I'd like to know it, too:)
>>
>> The stripe size is actually a very good question. On spinning drives it
>> usually does not matter too much - unless you have a very specialized
>> workload, the 'medium size' is the right choice (AFAIK we're using 64kB
>> on H710, which is the default).
>
> I am interested that raid stripe size of PERC H710 is 64kB. In HP
> raid card, default chunk size is 256kB. If we use two disks with raid
> 0, stripe size will be 512kB. I think that it might too big, but it
> might be optimized in raid card... In actually, it isn't bad in that
> settings.

With HP controllers this depends on RAID level (and maybe even
controller). Which HP controller are you talking about? I have some
basic experience with P400/P800, and those have 16kB (RAID6), 64kB
(RAID5) or 128kB (RAID10) defaults. None of them has 256kB.

See http://bit.ly/1bN3gIs (P800) and http://bit.ly/MdsEKN (P400).

> I'm interested in raid card internal behavior. Fortunately, linux raid
> card driver is open souce, so we might good at looking the source code
> when we have time.

What do you mean by "linux raid card driver"? Afaik the admin tools may
be available, but the interesting stuff happens inside the controller,
and that's still proprietary.

>> With SSDs this might actually matter much more, as the SSDs work with
>> "erase blocks" (mostly 512kB), and I suspect using small stripe might
>> result in repeated writes to the same block - overwriting one block
>> repeatedly and thus increased wearout. But maybe the controller will
>> handle that just fine, e.g. by coalescing the writes and sending them to
>> the drive as a single write. Or maybe the drive can do that in local
>> write cache (all SSDs have that).
>
> I have heard that genuine raid card with genuine ssds are optimized in
> these ssds. It is important that using compatible with ssd for
> performance. If the worst case, life time of ssd is be short, and will
> be bad performance.

Well, that's the main question here, right? Because if the "worst case"
actually happens to be true, then what's the point of SSDs? You have a
disk that does not provite the performance you expected, died much
sooner than you expected and maybe suddenly so it interrupted the operation.

So instead of paying more for higher performance, you paid more for bad
performance and much shorter life of the disk.

Coincidentally we're currently trying to find the answer to this
question too. That is - how long will the SSD endure in that particular
RAID level? Does that pay off?

BTW what you mean by "genuine raid card" and "genuine ssds"?

> I'm wondering about effective of readahead in OS and raid card. In
> general, readahead data by raid card is stored in raid cache, and
> not stored in OS caches. Readahead data by OS is stored in OS cache.
> I'd like to use all raid cache for only write cache, because fsync()
> becomes faster. But then, it cannot use readahead very much by raid
> card.. If we hope to use more effectively, we have to clear it, but
> it seems difficult:(

I've done a lot of testing of this on H710 in 2012 (~18 months ago),
measuring combinations of

   * read-ahead on controller (adaptive, enabled, disabled)
   * read-ahead in kernel (with various sizes)
   * scheduler

The test was the simplest and most suitable workload for this - just
"dd" with 1MB block size (AFAIK, would have to check the scripts).

In short, my findings are that:

   * read-ahead in kernel matters - tweak this
   * read-ahead on controller sucks - either makes no difference, or
     actually harms performance (adaptive with small values set for
     kernel read-ahead)
   * scheduler made no difference (at least for this workload)

So we disable readahead on the controller, use 24576 for kernel and it
works fine.

I've done the same test with fusionio iodrive (attached to PCIe, not
through controller) - absolutely no difference.

Tomas

Re: Optimal settings for RAID controller - optimized for writes

From

Merlin Moncure

Date:

20 February 2014, 00:18:44

On Wed, Feb 19, 2014 at 12:09 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Wed, Feb 19, 2014 at 8:13 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Tue, Feb 18, 2014 at 2:41 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>> On 18.2.2014 02:23, KONDO Mitsumasa wrote:
>>>> Hi,
>>>>
>>>> I don't have PERC H710 raid controller, but I think he would like to
>>>> know raid striping/chunk size or read/write cache ratio in
>>>> writeback-cache setting is the best. I'd like to know it, too:)
>>>
>>> We do have dozens of H710 controllers, but not with SSDs. I've been
>>> unable to find reliable answers how it handles TRIM, and how that works
>>> with wearout reporting (using SMART).
>>
>> AFAIK (I haven't looked for a few months), they don't support TRIM.
>> The only hardware RAID vendor that has even basic TRIM support Intel
>> and that's no accident; I have a theory that enterprise storage
>> vendors are deliberately holding back SSD: SSD (at least, the newer,
>> better ones) destroy the business model for "enterprise storage
>> equipment" in a large percentage of applications.   A 2u server with,
>> say, 10 s3700 drives gives *far* superior performance to most SANs
>> that cost under 100k$.  For about 1/10th of the price.
>>
>> If you have a server that is i/o constrained as opposed to storage
>> constrained (AKA: a database) hard drives make zero economic sense.
>> If your vendor is jerking you around by charging large multiples of
>> market rates for storage and/or disallowing drives that actually
>> perform well in their storage gear, choose a new vendor.  And consider
>> using software raid.
>
> You can also do the old trick of underprovisioning and / or
> underutilizing all the space on SSDs. I.e. put 10 600GB SSDs under a
> HW RAID controller in RAID-10, then only parititon out 1/2 the storage
> you get from that. so you get 1.5TB os storage and the drives are
> underutilized enough to have spare space.
>
> Right now I'm testing on a machine with 2x Intel E5-2690s
> (http://ark.intel.com/products/64596/intel-xeon-processor-e5-2690-20m-cache-2_90-ghz-8_00-gts-intel-qpi)
> 512GB RAM and 6x600GB Intel SSDs (not sure which ones) under a LSI
> MegaRAID 9266. I'm able to crank out 6500 to 7200 TPS under pgbench on
> a scale 1000 db at 8 to 60 clients on that machine. It's not cheap,
> but storage wise it's WAY cheaper than most SANS and very fast.
> pg_xlog is on a pair of non-descript SATA spinners btw.

Yeah -- underprovisioing certainly helps but for any write heavy
configuration, all else being equal, TRIM support will perform faster
and have less wear.  Those drives are likely the older 320 600gb.  The
newer s3700 are much faster although they cost around twice as much.

merlin

Re: Optimal settings for RAID controller - optimized for writes

From

Tomas Vondra

Date:

20 February 2014, 00:26:51

On 19.2.2014 16:13, Merlin Moncure wrote:
> On Tue, Feb 18, 2014 at 2:41 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> On 18.2.2014 02:23, KONDO Mitsumasa wrote:
>>> Hi,
>>>
>>> I don't have PERC H710 raid controller, but I think he would like to
>>> know raid striping/chunk size or read/write cache ratio in
>>> writeback-cache setting is the best. I'd like to know it, too:)
>>
>> We do have dozens of H710 controllers, but not with SSDs. I've been
>> unable to find reliable answers how it handles TRIM, and how that works
>> with wearout reporting (using SMART).
>
> AFAIK (I haven't looked for a few months), they don't support TRIM.
> The only hardware RAID vendor that has even basic TRIM support Intel
> and that's no accident; I have a theory that enterprise storage
> vendors are deliberately holding back SSD: SSD (at least, the newer,
> better ones) destroy the business model for "enterprise storage
> equipment" in a large percentage of applications.   A 2u server with,
> say, 10 s3700 drives gives *far* superior performance to most SANs
> that cost under 100k$.  For about 1/10th of the price.

Yeah, maybe. I'm generally a bit skeptic when it comes to conspiration
theories like this, but for ~1 year we all know that it might easily
happen to be true. So maybe ...

Nevertheless, I'd guess this is another case of the "Nobody ever got
fired for buying X", where X is a storage product based on spinning
drives, proven to be reliable, with known operational statistics and
pretty good understanding of how it works. While "Y" is a new thing
based on SSDs, that got rather bad reputation initially because of a
hype and premature usage of consumer-grade products for unsuitable
stuff. Also, each vendor of Y uses different tricks, which makes
application of experiences across vendors (or even various generations
of drives of the same vendor) very difficult.

Factor in how conservative DBAs happen to be, and I think it might be
this particular feedback loop, forcing the vendors not to push this.

> If you have a server that is i/o constrained as opposed to storage
> constrained (AKA: a database) hard drives make zero economic sense.
> If your vendor is jerking you around by charging large multiples of
> market rates for storage and/or disallowing drives that actually
> perform well in their storage gear, choose a new vendor.  And consider
> using software raid.

Yeah, exactly.

Tomas

Re: Optimal settings for RAID controller - optimized for writes

From

Tomas Vondra

Date:

20 February 2014, 01:10:18

On 19.2.2014 19:09, Scott Marlowe wrote:
>
> You can also do the old trick of underprovisioning and / or
> underutilizing all the space on SSDs. I.e. put 10 600GB SSDs under a
> HW RAID controller in RAID-10, then only parititon out 1/2 the storage
> you get from that. so you get 1.5TB os storage and the drives are
> underutilized enough to have spare space.

Yeah. AFAIK that's basically what Intel did with S3500 -> S3700.

What I'm trying to find is the 'sweet spot' considering lifespan,
capacity, performance and price. That's why I'm still wondering if there
are some experiences with current generation of SSDs and RAID
controllers, with RAID levels other than RAID-10.

Say I have 8x 400GB SSD, 75k/32k read/write IOPS each (i.e. it's
basically the S3700 from Intel). Assuming the writes are ~25% of the
workload, this is what I get for RAID10 vs. RAID6 (math done using
http://www.wmarow.com/strcalc/)

            | capacity GB | bandwidth MB/s | IOPS
 ---------------------------------------------------
   RAID-10  |        1490 |           2370 | 300k
   RAID-6   |        2230 |           1070 | 130k

Let's say the app can't really generate 130k IOPS (we'll saturate CPU
way before that), so even if the real-world numbers will be less than
50% of this, we're not going to hit disks as the main bottleneck.

So let's assume there's no observable performance difference between
RAID10 and RAID6 in our case. But we could put 1.5x the amount of data
on the RAID6, making it much cheaper (we're talking about non-trivial
numbers of such machines).

The question is - how long will it last before the SSDs die because of
wearout? Will the RAID controller make it worse due to (not) handling
TRIM? Will we know how much time we have left, i.e. will the controller
provide the info the drives provide through SMART?

> Right now I'm testing on a machine with 2x Intel E5-2690s
> (http://ark.intel.com/products/64596/intel-xeon-processor-e5-2690-20m-cache-2_90-ghz-8_00-gts-intel-qpi)
> 512GB RAM and 6x600GB Intel SSDs (not sure which ones) under a LSI

Most likely S3500. S3700 are not offered with 600GB capacity.

> MegaRAID 9266. I'm able to crank out 6500 to 7200 TPS under pgbench on
> a scale 1000 db at 8 to 60 clients on that machine. It's not cheap,
> but storage wise it's WAY cheaper than most SANS and very fast.
> pg_xlog is on a pair of non-descript SATA spinners btw.

Nice. I've done some testing with fusionio iodrive duo (2 devices in
RAID0) ~ year ago, and I got 12k TPS (or ~15k with WAL on SAS RAID). So
considering the price, the 7.2k TPS is really good IMHO.

regards
Tomas

Re: Optimal settings for RAID controller - optimized for writes

From

Scott Marlowe

Date:

20 February 2014, 01:48:08

On Wed, Feb 19, 2014 at 6:10 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> On 19.2.2014 19:09, Scott Marlowe wrote:

>> Right now I'm testing on a machine with 2x Intel E5-2690s
>> (http://ark.intel.com/products/64596/intel-xeon-processor-e5-2690-20m-cache-2_90-ghz-8_00-gts-intel-qpi)
>> 512GB RAM and 6x600GB Intel SSDs (not sure which ones) under a LSI
>
> Most likely S3500. S3700 are not offered with 600GB capacity.
>
>> MegaRAID 9266. I'm able to crank out 6500 to 7200 TPS under pgbench on
>> a scale 1000 db at 8 to 60 clients on that machine. It's not cheap,
>> but storage wise it's WAY cheaper than most SANS and very fast.
>> pg_xlog is on a pair of non-descript SATA spinners btw.
>
> Nice. I've done some testing with fusionio iodrive duo (2 devices in
> RAID0) ~ year ago, and I got 12k TPS (or ~15k with WAL on SAS RAID). So
> considering the price, the 7.2k TPS is really good IMHO.

The part number reported by the LSI is: SSDSC2BB600G4 so I'm assuming
it's an SLC drive. Done some further testing, I keep well over 6k tps
right up to 128 clients. At no time is there any IOWait under vmstat,
and if I turn off fsync speed goes up by some tiny amount, so I'm
guessing I'm CPU bound at this point. This machine has dual 8 core HT
Intels CPUs.

 We have another class of machine running on FusionIO IODrive2 MLC
cards in RAID-1 and 4  6 core non-HT CPUs. It's a bit slower (1366
versus 1600MHz Memory, slower CPU clocks and interconects etc) and it
can do about 5k tps and again, like the ther machine, no IO Wait, all
CPU bound. I'd say once you get to a certain level of IO Subsystem it
gets harder and harder to max it out.

I'd love to have a 64 core 4 socket AMD top of the line system to
compare here. But honestly both class of machines are more than fast
enough for what we need, and our major load is from select statements
so fitting the db into RAM is more important that IOPs for what we do.

Re: Optimal settings for RAID controller - optimized for writes

From

KONDO Mitsumasa

Date:

20 February 2014, 08:23:28

Hi,

(2014/02/20 9:13), Tomas Vondra wrote:
> Hi,
>
> On 19.2.2014 03:45, KONDO Mitsumasa wrote:
>> (2014/02/19 5:41), Tomas Vondra wrote:
>>> On 18.2.2014 02:23, KONDO Mitsumasa wrote:
>>>> Hi,
>>>>
>>>> I don't have PERC H710 raid controller, but I think he would like to
>>>> know raid striping/chunk size or read/write cache ratio in
>>>> writeback-cache setting is the best. I'd like to know it, too:)
>>>
>>> The stripe size is actually a very good question. On spinning drives it
>>> usually does not matter too much - unless you have a very specialized
>>> workload, the 'medium size' is the right choice (AFAIK we're using 64kB
>>> on H710, which is the default).
>>
>> I am interested that raid stripe size of PERC H710 is 64kB. In HP
>> raid card, default chunk size is 256kB. If we use two disks with raid
>> 0, stripe size will be 512kB. I think that it might too big, but it
>> might be optimized in raid card... In actually, it isn't bad in that
>> settings.
>
> With HP controllers this depends on RAID level (and maybe even
> controller). Which HP controller are you talking about? I have some
> basic experience with P400/P800, and those have 16kB (RAID6), 64kB
> (RAID5) or 128kB (RAID10) defaults. None of them has 256kB.
 > See http://bit.ly/1bN3gIs (P800) and http://bit.ly/MdsEKN (P400).
I use P410 and P420 that are equiped in DL360 gen7 and DL360gen8. They
seems relatively latest. I check raid stripe size(RAID1+0) using hpacucli tool,
and it is surely 256kB chunk size. And P420 enables to set higher/smaller chunk
sizes which range is 8KB - 1024kB? higher. But I don't know the best parameter in
postgres:(

>> I'm interested in raid card internal behavior. Fortunately, linux raid
>> card driver is open souce, so we might good at looking the source code
>> when we have time.
>
> What do you mean by "linux raid card driver"? Afaik the admin tools may
> be available, but the interesting stuff happens inside the controller,
> and that's still proprietary.
I said open source driver. HP drivers are in under following url.
http://cciss.sourceforge.net/

However, unless I read driver source code roughly, core part of raid card
programing is in farmware, as you say. It seems that just to drive from OS.
I'm interested in elevetor algorithm, when I read driver source code.
But detail algorithm might be in firmware..

>>> With SSDs this might actually matter much more, as the SSDs work with
>>> "erase blocks" (mostly 512kB), and I suspect using small stripe might
>>> result in repeated writes to the same block - overwriting one block
>>> repeatedly and thus increased wearout. But maybe the controller will
>>> handle that just fine, e.g. by coalescing the writes and sending them to
>>> the drive as a single write. Or maybe the drive can do that in local
>>> write cache (all SSDs have that).
>>
>> I have heard that genuine raid card with genuine ssds are optimized in
>> these ssds. It is important that using compatible with ssd for
>> performance. If the worst case, life time of ssd is be short, and will
>> be bad performance.
>
> Well, that's the main question here, right? Because if the "worst case"
> actually happens to be true, then what's the point of SSDs?
Sorry, this thread topic is SSD stiriping size tuning. I'm interested in magnetic
disk especially. But also interested SSD.

> You have a
> disk that does not provite the performance you expected, died much
> sooner than you expected and maybe suddenly so it interrupted the operation.
> So instead of paying more for higher performance, you paid more for bad
> performance and much shorter life of the disk.
I'm intetested in that changing raid chunk size will be short life. I have not
had this point. It mgiht be true. And I'd like to test it using SMART cheacker if
we have time.


> Coincidentally we're currently trying to find the answer to this
> question too. That is - how long will the SSD endure in that particular
> RAID level? Does that pay off?
>
> BTW what you mean by "genuine raid card" and "genuine ssds"?
I want to say "genuine" as it is same manufacturing maker or vendor.

>> I'm wondering about effective of readahead in OS and raid card. In
>> general, readahead data by raid card is stored in raid cache, and
>> not stored in OS caches. Readahead data by OS is stored in OS cache.
>> I'd like to use all raid cache for only write cache, because fsync()
>> becomes faster. But then, it cannot use readahead very much by raid
>> card.. If we hope to use more effectively, we have to clear it, but
>> it seems difficult:(
>
> I've done a lot of testing of this on H710 in 2012 (~18 months ago),
> measuring combinations of
>
>     * read-ahead on controller (adaptive, enabled, disabled)
>     * read-ahead in kernel (with various sizes)
>     * scheduler
>
> The test was the simplest and most suitable workload for this - just
> "dd" with 1MB block size (AFAIK, would have to check the scripts).
>
> In short, my findings are that:
>
>     * read-ahead in kernel matters - tweak this
>     * read-ahead on controller sucks - either makes no difference, or
>       actually harms performance (adaptive with small values set for
>       kernel read-ahead)
>     * scheduler made no difference (at least for this workload)
>
> So we disable readahead on the controller, use 24576 for kernel and it
> works fine.
>
> I've done the same test with fusionio iodrive (attached to PCIe, not
> through controller) - absolutely no difference.
I'd like to know random access(8kB) performance, it does not seems it..
But this is inteteresting data. What command did you use kernel readahead paramter?
If you use blockdev, value 256 indicate using 256 * 512B(sector size)=128kB
readahaed parameter.
And you set 245676, it will be 245676 * 512B = 120MB readahead parameter.
I think it is too big, but it is optimized in your enviroment.
In the end of the day, is it good for too big readahead, rather than small
readahead or nothing? If we have big RAM, it seems true. But not in the
situations, is it not? It is difficult problem.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Re: Optimal settings for RAID controller - optimized for writes

From

Tomas Vondra

Date:

20 February 2014, 22:58:34

On 20.2.2014 02:47, Scott Marlowe wrote:
> On Wed, Feb 19, 2014 at 6:10 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> On 19.2.2014 19:09, Scott Marlowe wrote:
>
>>> Right now I'm testing on a machine with 2x Intel E5-2690s
>>> (http://ark.intel.com/products/64596/intel-xeon-processor-e5-2690-20m-cache-2_90-ghz-8_00-gts-intel-qpi)
>>> 512GB RAM and 6x600GB Intel SSDs (not sure which ones) under a LSI
>>
>> Most likely S3500. S3700 are not offered with 600GB capacity.
>>
>>> MegaRAID 9266. I'm able to crank out 6500 to 7200 TPS under pgbench on
>>> a scale 1000 db at 8 to 60 clients on that machine. It's not cheap,
>>> but storage wise it's WAY cheaper than most SANS and very fast.
>>> pg_xlog is on a pair of non-descript SATA spinners btw.
>>
>> Nice. I've done some testing with fusionio iodrive duo (2 devices in
>> RAID0) ~ year ago, and I got 12k TPS (or ~15k with WAL on SAS RAID). So
>> considering the price, the 7.2k TPS is really good IMHO.
>
> The part number reported by the LSI is: SSDSC2BB600G4 so I'm assuming
> it's an SLC drive. Done some further testing, I keep well over 6k tps
> right up to 128 clients. At no time is there any IOWait under vmstat,
> and if I turn off fsync speed goes up by some tiny amount, so I'm
> guessing I'm CPU bound at this point. This machine has dual 8 core HT
> Intels CPUs.

No, it's S3500, which is a MLC drive.

http://ark.intel.com/products/74944/Intel-SSD-DC-S3500-Series-600GB-2_5in-SATA-6Gbs-20nm-MLC

Tomas