Re: raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?) - Mailing list pgsql-admin
From | Graeme B. Bell |
---|---|
Subject | Re: raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?) |
Date | |
Msg-id | 5F6C17BC-7487-4E24-B322-39FEC6373C92@skogoglandskap.no Whole thread Raw |
In response to | Re: raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?) (Bruce Momjian <bruce@momjian.us>) |
Responses |
Re: raid writethrough mode (WT), ssds and your DB. (was
Performances issues with SSD volume ?)
|
List | pgsql-admin |
Hi Bruce I'm *extremely* certain of what I say when I say WB+BBU=good and direct WT=bad. WB on the controller uses the battery backed RAID controller cache for writes to ensure all writes do eventually get writtento the disk in the event of a power failure. WT on the controller bypasses the battery backed cache and sends writes directly to the SSD. If the SSD doesn't have itsown sufficient capacitor backing, they're gone. See the manual page quote below. - With H710 WT, ssd cache enabled, the SSDs I tested were proven to lose data that was meant to have been already fsync'd.The capacitor was insufficient and the firmware lied about performing an fsync. - With H710 WB, ssd cache enabled, the SSDs didn't lose writes, I have yet to see a failed fsync in any of the many dozensof tests I ran on several machines and disks*. - Without H710 and ssd cache enabled i.e. WT direct to drive, I always lost writes that were meant to be fsync'd. - Without H710 and ssd cache disabled, I never lost writes. There are two possible reasons the writes always hit the drive successfully in every test with controller WB and ssd diskcache enabled. Either a) SSDs perhaps did lose fsync'd data, but the controller didn't. The battery backed raid controller 512MB-1024MBcache ensured fsync'd writes were completed by reinitiating the write after power up, as it would with a harddrive after power loss. I have not been able to find sufficiently detailed technical documentation for these cards tofind out exactly what they do after power-loss in terms of disk communication, replaying writes, etc. I only have my measuredresults. however it's also possible that ... b) From extensive plug-pull testing it appears the capacitors in the Crucial drivesare just *slightly* too small to save all data in flight, there is always a very tiny number of fsync'd writes thatdon't make it to disk. So it is entirely possible that the *fairly slow* writeback cache on the Dell controller, whichsubstantially *reduces* the IOPS of the ssd, is consistently limiting the amount of data held the cache on the ssd,such that all the data can be saved using the caps on the ssd disk. By effectively running at half-speed with the raidcontroller cache as a choke point, you are never in a situation where the last few writes just don't quite make it todisk, because half the ssd cache is sitting empty. Also, keep in mind I am reporting results from many dozens of runs of diskchecker.pl. It is possible that writes may be lostin a way that diskchecker.pl does not detect. It is also possible that there is e.g. a 1 in 1000 situation I haven'tfound yet. For example, really heavy sequential writes with interspersed fsyncs rather than just heaps of fsyncs. For the avoidance of doubt: In several afternoons of testing, I have *never* managed to lose fsync'd data from crucial m500/m550 disks combined witha battery backed raid controller in writeback mode. (WB). In several afternoons of testing, I have *always* lost a small amount of fsync'd data from crucial m500/m550 disks combinedwith a battery backed raid controller in write-through mode. (WT). I and others bought the M500/M550 on the back of the advertised capacitor backed cache, but I wouldn't ever trust manufacturercapacitor claims in future. Drive power failure recovery really is something that needs testing by many customersto ascertain the truth of the matter. (e.g. long story short, I recommend people to buy an intel model now sinceit has proven most trustworthy in terms of manufacturer claims). Graeme Bell https://www.dell.com/downloads/global/products/pvaul/en/perc-technical-guidebook.pdf "DELL PERC H700 and H800 Technical Guide 20 4.10 Virtual Disk Write Cache Policies The write cache policy of a virtual disk determines how the controller handles writes to that virtual disk. Write-Back andWrite-Through are the two write cache policies and can be set on virtual disks individually. All RAID volumes will be presented as Write-Through (WT) to the operating system (Windows and Linux) independent of the actualwrite cache policy of the virtual disk. The PERC cards manage the data in cache independently of the operating systemor any applications. You can use OpenManage or the BIOS configuration utility to view and manage virtual disk cachesettings. In Write-Through caching, the controller sends a data-transfer completion signal to the host system when the disk subsystemhas received all the data in a transaction. In Write-Back caching, the controller sends a data transfer completionsignal to the host when the controller cache has received all the data in a transaction. The controller then writesthe cached data to the storage device in the background. The risk of using Write-Back cache is that the cached data can be lost if there is a power failure before it is written tothe storage device. This risk is mitigated by using a BBU on PERC H700 or H800 cards. Write-Back caching has a performanceadvantage over Write-Through caching. The default cache setting for virtual disks is Write-Back caching. Certaindata patterns and configurations perform better with a Write-Through cache policy. Write-Back caching is used under all conditions in which the battery is present and in good condition." On 28 May 2015, at 01:20, Bruce Momjian <bruce@momjian.us> wrote: > On Thu, May 21, 2015 at 11:21:49AM +0000, Graeme B. Bell wrote: >>> Not using your raid controllers write cache then? Not sure just how >>> important that is with SSDs these days, but if you've got a BBU set >>> it to "WriteBack". Also change "Cache if Bad BBU" to "No Write Cache >>> if Bad BBU" if you do that. >> >> I did quite a few tests with WB and WT last year. >> >> - WT should be OK with e.g. Intel SSDs. From memory I saw write >> performance gains of about 20-30% with Crucial M500/M550 writes on a >> Dell H710 RAID controller. BUT that controller didn't have WT fastpath >> though which is absolutely essential to see substantial gains WT. I >> expect with WT and a fastpath enabled RAID you'd see much higher >> numbers, e.g. 100%+ higher IOPS. >> >> (So, if you don't have fastpath on your controller, you might as >> well plan to leave WB on and just buy cheaper SSD drives rather than >> expensive ones - the raid controller will be your choke point for >> performance on WT and it's a source of risk). >> >> - WT with most SSDs will likely corrupt your postgres database the >> first time you lose power. (on all the drives I've tested) >> >> - WB is the only safe option unless you have done lots of plug pull >> tests on a drive that is guaranteed to protect data "in flight" during >> power loss (Intel disks + maybe the new samsung pcie). > > I think you have WT (write-through) and WB (write-back) reversed above. > > -- > Bruce Momjian <bruce@momjian.us> http://momjian.us > EnterpriseDB http://enterprisedb.com > > + Everyone has their own god. +
pgsql-admin by date: