Re: Optimal settings for RAID controller - optimized for writes - Mailing list pgsql-performance

From KONDO Mitsumasa
Subject Re: Optimal settings for RAID controller - optimized for writes
Date
Msg-id 5305BCD8.7060205@lab.ntt.co.jp
Whole thread Raw
In response to Re: Optimal settings for RAID controller - optimized for writes  (Tomas Vondra <tv@fuzzy.cz>)
List pgsql-performance
Hi,

(2014/02/20 9:13), Tomas Vondra wrote:
> Hi,
>
> On 19.2.2014 03:45, KONDO Mitsumasa wrote:
>> (2014/02/19 5:41), Tomas Vondra wrote:
>>> On 18.2.2014 02:23, KONDO Mitsumasa wrote:
>>>> Hi,
>>>>
>>>> I don't have PERC H710 raid controller, but I think he would like to
>>>> know raid striping/chunk size or read/write cache ratio in
>>>> writeback-cache setting is the best. I'd like to know it, too:)
>>>
>>> The stripe size is actually a very good question. On spinning drives it
>>> usually does not matter too much - unless you have a very specialized
>>> workload, the 'medium size' is the right choice (AFAIK we're using 64kB
>>> on H710, which is the default).
>>
>> I am interested that raid stripe size of PERC H710 is 64kB. In HP
>> raid card, default chunk size is 256kB. If we use two disks with raid
>> 0, stripe size will be 512kB. I think that it might too big, but it
>> might be optimized in raid card... In actually, it isn't bad in that
>> settings.
>
> With HP controllers this depends on RAID level (and maybe even
> controller). Which HP controller are you talking about? I have some
> basic experience with P400/P800, and those have 16kB (RAID6), 64kB
> (RAID5) or 128kB (RAID10) defaults. None of them has 256kB.
 > See http://bit.ly/1bN3gIs (P800) and http://bit.ly/MdsEKN (P400).
I use P410 and P420 that are equiped in DL360 gen7 and DL360gen8. They
seems relatively latest. I check raid stripe size(RAID1+0) using hpacucli tool,
and it is surely 256kB chunk size. And P420 enables to set higher/smaller chunk
sizes which range is 8KB - 1024kB? higher. But I don't know the best parameter in
postgres:(

>> I'm interested in raid card internal behavior. Fortunately, linux raid
>> card driver is open souce, so we might good at looking the source code
>> when we have time.
>
> What do you mean by "linux raid card driver"? Afaik the admin tools may
> be available, but the interesting stuff happens inside the controller,
> and that's still proprietary.
I said open source driver. HP drivers are in under following url.
http://cciss.sourceforge.net/

However, unless I read driver source code roughly, core part of raid card
programing is in farmware, as you say. It seems that just to drive from OS.
I'm interested in elevetor algorithm, when I read driver source code.
But detail algorithm might be in firmware..

>>> With SSDs this might actually matter much more, as the SSDs work with
>>> "erase blocks" (mostly 512kB), and I suspect using small stripe might
>>> result in repeated writes to the same block - overwriting one block
>>> repeatedly and thus increased wearout. But maybe the controller will
>>> handle that just fine, e.g. by coalescing the writes and sending them to
>>> the drive as a single write. Or maybe the drive can do that in local
>>> write cache (all SSDs have that).
>>
>> I have heard that genuine raid card with genuine ssds are optimized in
>> these ssds. It is important that using compatible with ssd for
>> performance. If the worst case, life time of ssd is be short, and will
>> be bad performance.
>
> Well, that's the main question here, right? Because if the "worst case"
> actually happens to be true, then what's the point of SSDs?
Sorry, this thread topic is SSD stiriping size tuning. I'm interested in magnetic
disk especially. But also interested SSD.

> You have a
> disk that does not provite the performance you expected, died much
> sooner than you expected and maybe suddenly so it interrupted the operation.
> So instead of paying more for higher performance, you paid more for bad
> performance and much shorter life of the disk.
I'm intetested in that changing raid chunk size will be short life. I have not
had this point. It mgiht be true. And I'd like to test it using SMART cheacker if
we have time.


> Coincidentally we're currently trying to find the answer to this
> question too. That is - how long will the SSD endure in that particular
> RAID level? Does that pay off?
>
> BTW what you mean by "genuine raid card" and "genuine ssds"?
I want to say "genuine" as it is same manufacturing maker or vendor.

>> I'm wondering about effective of readahead in OS and raid card. In
>> general, readahead data by raid card is stored in raid cache, and
>> not stored in OS caches. Readahead data by OS is stored in OS cache.
>> I'd like to use all raid cache for only write cache, because fsync()
>> becomes faster. But then, it cannot use readahead very much by raid
>> card.. If we hope to use more effectively, we have to clear it, but
>> it seems difficult:(
>
> I've done a lot of testing of this on H710 in 2012 (~18 months ago),
> measuring combinations of
>
>     * read-ahead on controller (adaptive, enabled, disabled)
>     * read-ahead in kernel (with various sizes)
>     * scheduler
>
> The test was the simplest and most suitable workload for this - just
> "dd" with 1MB block size (AFAIK, would have to check the scripts).
>
> In short, my findings are that:
>
>     * read-ahead in kernel matters - tweak this
>     * read-ahead on controller sucks - either makes no difference, or
>       actually harms performance (adaptive with small values set for
>       kernel read-ahead)
>     * scheduler made no difference (at least for this workload)
>
> So we disable readahead on the controller, use 24576 for kernel and it
> works fine.
>
> I've done the same test with fusionio iodrive (attached to PCIe, not
> through controller) - absolutely no difference.
I'd like to know random access(8kB) performance, it does not seems it..
But this is inteteresting data. What command did you use kernel readahead paramter?
If you use blockdev, value 256 indicate using 256 * 512B(sector size)=128kB
readahaed parameter.
And you set 245676, it will be 245676 * 512B = 120MB readahead parameter.
I think it is too big, but it is optimized in your enviroment.
In the end of the day, is it good for too big readahead, rather than small
readahead or nothing? If we have big RAM, it seems true. But not in the
situations, is it not? It is difficult problem.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


pgsql-performance by date:

Previous
From: Niels Kristian Schjødt
Date:
Subject: Optimal settings for RAID controller - optimized for writes
Next
From: Tomas Vondra
Date:
Subject: Re: Optimal settings for RAID controller - optimized for writes