Re: raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?) - Mailing list pgsql-admin

From Graeme B. Bell
Subject Re: raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?)
Date
Msg-id 5F6C17BC-7487-4E24-B322-39FEC6373C92@skogoglandskap.no
Whole thread Raw
In response to Re: raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?)  (Bruce Momjian <bruce@momjian.us>)
Responses Re: raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?)
List pgsql-admin
Hi Bruce

I'm *extremely* certain of what I say when I say WB+BBU=good and direct WT=bad.

WB on the controller uses the battery backed RAID controller cache for writes to ensure all writes do eventually get
writtento the disk in the event of a power failure.  

WT on the controller bypasses the battery backed cache and sends writes directly to the SSD. If the SSD doesn't have
itsown sufficient capacitor backing, they're gone. 
See the manual page quote below.

- With H710 WT, ssd cache enabled, the SSDs I tested were proven to lose data that was meant to have been already
fsync'd.The capacitor was insufficient and the firmware lied about performing an fsync. 

- With H710 WB, ssd cache enabled, the SSDs didn't lose writes, I have yet to see a failed fsync in any of the many
dozensof tests I ran on several machines and disks*.  

- Without H710 and ssd cache enabled i.e. WT direct to drive, I always lost writes that were meant to be fsync'd.

- Without H710 and ssd cache disabled, I never lost writes.


There are two possible reasons the writes always hit the drive successfully in every test with controller WB and ssd
diskcache enabled.  

Either a) SSDs perhaps did lose fsync'd data, but the controller didn't. The battery backed raid controller
512MB-1024MBcache ensured fsync'd writes were completed by reinitiating the write after power up, as it would with a
harddrive after power loss. I have not been able to find sufficiently detailed technical documentation for these cards
tofind out exactly what they do after power-loss in terms of disk communication, replaying writes, etc. I only have my
measuredresults. 

however it's also possible that  ... b) From extensive plug-pull testing it appears the capacitors in the Crucial
drivesare just *slightly* too small to save all data in flight, there is always a very tiny number of fsync'd writes
thatdon't make it to disk. So it is entirely possible that the *fairly slow* writeback cache on the Dell controller,
whichsubstantially *reduces* the IOPS of the ssd, is consistently limiting the amount of data held the cache on the
ssd,such that all the data can be saved using the caps on the ssd disk. By effectively running at half-speed with the
raidcontroller cache as a choke point, you are never in a situation where the last few writes just don't quite make it
todisk, because half the ssd cache is sitting empty. 

Also, keep in mind I am reporting results from many dozens of runs of diskchecker.pl. It is possible that writes may be
lostin a way that diskchecker.pl does not detect. It is also possible that there is e.g. a 1 in 1000 situation I
haven'tfound yet. For example, really heavy sequential writes with interspersed fsyncs rather than just heaps of
fsyncs. 


For the avoidance of doubt:

In several afternoons of testing, I have *never* managed to lose fsync'd data from crucial m500/m550 disks combined
witha battery backed raid controller in writeback mode. (WB). 
In several afternoons of testing, I have *always* lost a small amount of fsync'd data from crucial m500/m550 disks
combinedwith a battery backed raid controller in write-through mode. (WT). 


I and others bought the M500/M550 on the back of the advertised capacitor backed cache, but I wouldn't ever trust
manufacturercapacitor claims in future. Drive power failure recovery really is something that needs testing by many
customersto ascertain the truth of the matter. (e.g. long story short, I recommend people to buy an intel model now
sinceit has proven most trustworthy in terms of manufacturer claims).  


Graeme Bell


https://www.dell.com/downloads/global/products/pvaul/en/perc-technical-guidebook.pdf

"DELL PERC H700 and H800 Technical Guide 20
4.10 Virtual Disk Write Cache Policies
The write cache policy of a virtual disk determines how the controller handles writes to that virtual disk. Write-Back
andWrite-Through are the two write cache policies and can be set on virtual disks individually. 
All RAID volumes will be presented as Write-Through (WT) to the operating system (Windows and Linux) independent of the
actualwrite cache policy of the virtual disk. The PERC cards manage the data in cache independently of the operating
systemor any applications. You can use OpenManage or the BIOS configuration utility to view and manage virtual disk
cachesettings. 
In Write-Through caching, the controller sends a data-transfer completion signal to the host system when the disk
subsystemhas received all the data in a transaction. In Write-Back caching, the controller sends a data transfer
completionsignal to the host when the controller cache has received all the data in a transaction. The controller then
writesthe cached data to the storage device in the background. 
The risk of using Write-Back cache is that the cached data can be lost if there is a power failure before it is written
tothe storage device. This risk is mitigated by using a BBU on PERC H700 or H800 cards. Write-Back caching has a
performanceadvantage over Write-Through caching. The default cache setting for virtual disks is Write-Back caching.
Certaindata patterns and configurations perform better with a Write-Through cache policy. 
Write-Back caching is used under all conditions in which the battery is present and in good condition."






On 28 May 2015, at 01:20, Bruce Momjian <bruce@momjian.us> wrote:

> On Thu, May 21, 2015 at 11:21:49AM +0000, Graeme B. Bell wrote:
>>> Not using your raid controllers write cache then?  Not sure just how
>>> important that is with SSDs these days, but if you've got a BBU set
>>> it to "WriteBack". Also change "Cache if Bad BBU" to "No Write Cache
>>> if Bad BBU" if you do that.
>>
>> I did quite a few tests with WB and WT last year.
>>
>> - WT should be OK with e.g. Intel SSDs.  From memory I saw write
>> performance gains of about 20-30% with Crucial M500/M550 writes on a
>> Dell H710 RAID controller. BUT that controller didn't have WT fastpath
>> though which is absolutely essential to see substantial gains WT. I
>> expect with WT and a fastpath enabled RAID you'd see much higher
>> numbers, e.g. 100%+ higher IOPS.
>>
>> (So, if you don't have fastpath on your controller, you might as
>> well plan to leave WB on and just buy cheaper SSD drives rather than
>> expensive ones - the raid controller will be your choke point for
>> performance on WT and it's a source of risk).
>>
>> - WT with most SSDs will likely corrupt your postgres database the
>> first time you lose power. (on all the drives I've tested)
>>
>> - WB is the only safe option unless you have done lots of plug pull
>> tests on a drive that is guaranteed to protect data "in flight" during
>> power loss (Intel disks + maybe the new samsung pcie).
>
> I think you have WT (write-through) and WB (write-back) reversed above.
>
> --
>  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
>  EnterpriseDB                             http://enterprisedb.com
>
>  + Everyone has their own god. +



pgsql-admin by date:

Previous
From: girish R G peetle
Date:
Subject: Re: PostgreSQL Dump based backup using pipe
Next
From: Bruce Momjian
Date:
Subject: Re: raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?)