Re: RAID and SSD configuration question - Mailing list pgsql-general

From Merlin Moncure
Subject Re: RAID and SSD configuration question
Date
Msg-id CAHyXU0xjKkcwQGm_3oSM82g5JFWEFF7K5SB6hM6fsEsJJj=Unw@mail.gmail.com
Whole thread Raw
In response to Re: RAID and SSD configuration question  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: RAID and SSD configuration question  (Scott Marlowe <scott.marlowe@gmail.com>)
List pgsql-general
On Tue, Oct 20, 2015 at 10:14 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Hi,
>
> On 10/20/2015 03:30 PM, Merlin Moncure wrote:
>>
>> On Tue, Oct 20, 2015 at 3:14 AM, Birta Levente <blevi.linux@gmail.com>
>> wrote:
>>>
>>> Hi
>>>
>>> I have a supermicro SYS-1028R-MCTR, LSI3108 integrated with SuperCap
>>> module
>>> (BTR-TFM8G-LSICVM02)
>>> - 2x300GB 10k spin drive, as raid 1 (OS)
>>> - 2x300GB 15k spin drive, as raid 1 (for xlog)
>>> - 2x200GB Intel DC S3710 SSD (for DB), as raid 1
>>>
>>> So how is better for the SSDs: mdraid or controller's raid?
>>
>>
>> I personally always prefer mdraid if given a choice, especially when
>> you have a dedicated boot drive.  It's better in DR scenarios and for
>> hardware migrations.  Personally I find dedicated RAID controllers to
>> be baroque.  Flash SSDs (at least the good ones) are basically big
>> RAID 0s with their own dedicated cache, supercap, and controller
>> optimized to the underlying storage peculiarities.
>
> I don't know - I've always treated mdraid with a bit of suspicion as it does
> not have any "global" write cache, which might be allowing failure modes
> akin to the RAID5 write hole (similar issues exist for non-parity RAID
> levels like RAID-1 or RAID-10).

mdadm is pretty smart.  it knows when its shutdown unclean and
recalculates parity as needed.  There are some theoretical edge case
failure scenarios, but they are well understood.  This is md's main
advantage really, it's transparency and the huge body of lore around
it.  I have tiny data recovery side business (cost 0$, invitation
only) of DR on NAS systems that in some cases commercial DR companies
said were irrecoverable. By simply googling and following guides I was
able to come up with the data, or at least most of it, every time.
Good luck with that on proprietary RAID systems.  In fact, there is no
reason to believe that proprietary systems cover the write hole even
if they have a centralized cache.   They may claim it does and in fact
do so 99 times out of 100 but how do you know it's really covered?
Basically, you don't.  I kind of trust Intel (now, it's been a
journey), but I don't have a lot of confidence in certain enterprise
gear vendors.

On Tue, Oct 20, 2015 at 9:33 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> We're running LSI MegaRAIDs at work with 10 SSD RAID-5 arrays, and we
> can get ~5k to 7k tps on a -s 10000 pgbench with the write cache on.
>
> When we turn the write cache off, we get 15k to 20k tps. This is on a
> 120GB pgbench db that fits in memory, so it's all writes.

This is my findings exactly.  I'll double down on my statement;
caching raid controllers are essentially obsolete technology.  They
are designed to solve a problem that simply doesn't exist any more
because of SSDs.  Unless your database is very, very, busy it's pretty
hard to saturate a single low-mid tier SSD with zero engineering
effort.  It's time to let go:  spinning drives are obsolete in the
database world, at least in any scenario where you're measuring IOPS.

merlin


pgsql-general by date:

Previous
From: Josip Rodin
Date:
Subject: Re: ERROR: tablespace "archive2" is not empty
Next
From: Merlin Moncure
Date:
Subject: Re: My first PL/pgSQL function