Thread: raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?)
raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?)
From
"Graeme B. Bell"
Date:
> Not using your raid controllers write cache then? Not sure just how important that is with SSDs these days, but if you'vegot a BBU set it to "WriteBack". Also change "Cache if Bad BBU" to "No Write Cache if Bad BBU" if you do that. I did quite a few tests with WB and WT last year. - WT should be OK with e.g. Intel SSDs. From memory I saw write performance gains of about 20-30% with Crucial M500/M550writes on a Dell H710 RAID controller. BUT that controller didn't have WT fastpath though which is absolutely essentialto see substantial gains WT. I expect with WT and a fastpath enabled RAID you'd see much higher numbers, e.g. 100%+higher IOPS. (So, if you don't have fastpath on your controller, you might as well plan to leave WB on and just buy cheaper SSD drivesrather than expensive ones - the raid controller will be your choke point for performance on WT and it's a source ofrisk). - WT with most SSDs will likely corrupt your postgres database the first time you lose power. (on all the drives I've tested) - WB is the only safe option unless you have done lots of plug pull tests on a drive that is guaranteed to protect data "inflight" during power loss (Intel disks + maybe the new samsung pcie). A relevant anecdatum... A certain company makes ssd drives, for talking sake let's call two of their models the XY00 and the XY50. These were popularSSD drives that were advertised everywhere as having power loss protection throughout 2013-2014. We bought them lotsof them here because of that 'power loss protection' aspect + a good price + performance + good reliability record +the good name of the company. When I tested with the famous 'diskchecker.pl' tool (* link at end), I found that they don't actually provide full powerloss protection. Some data in flight (even fsyncs!) was lost. I tested using several computers, several copies of each disk model, with "XY00" and "XY50" models, and with and withoutRAID controllers. The only way I could keep the data safe for fsyncs and DB use with these drives during power failure was either a) use aRAID controller with WB or b) disable ssd cache, which is horrifyingly bad for performance. So I wrote to the company's engineering in early August 2014 about this (because we had spent quite a lot of money on thesedisks) and corresponded with a QA engineer to show them my results and show them how to reproduce the data loss problem,asking if maybe they could produce a firmware patch or some other fix. At first they were extremely interested to know more. Then once they had the information to fully reproduce the bug, theywent silent and wouldn't reply to any emails. About 1-2 months later, articles started appearing on enthusiast tech sites. Not new firmware, just company product repsexplaining that "power loss protection" doesn't really mean all your data is protected from power loss, and that it'sunreasonable to expect the drive to do what it says on the box. :-( Lessons to takeaway: --- WT + many SSDs + power loss = likely DB corruption. --- No raid card + many SSDs + power loss = likely DB corruption. --- WB + many SSDs + power loss = should be fine but you must test it a few times. --- Never use WT mode on any production system until you've run a ton of tests on the drive's ability to honor fsyncs. --- Never trust any vendor to provide correctly working equipment regardless of how often they make promises in advertising.Buy the smallest amount possible and test it first yourself in the most realistic environment possible. Thatgoes for RAID controllers advertised as having fastpath , which actually didn't, and ssds heavily advertised as havingpower loss protection, which actually didn't protect all data from power loss. --- oh and NEVER do a power loss test by holding the power button. On every machine I've tested with SSDs, a power buttonshutdown (e.g. hold power for 5 seconds till it turns off) did not create lost data, whereas a plug pull test (yankthe power out of the power supply) always produced lost data. The plug pull test reproduces a real life power failuremore accurately. The power button test will only give you an illusion of safety. Graeme Bell p.s. https://gist.github.com/bradfitz/3172656 - diskchecker.pl
Re: raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?)
From
Bruce Momjian
Date:
On Thu, May 21, 2015 at 11:21:49AM +0000, Graeme B. Bell wrote: > > Not using your raid controllers write cache then? Not sure just how > > important that is with SSDs these days, but if you've got a BBU set > > it to "WriteBack". Also change "Cache if Bad BBU" to "No Write Cache > > if Bad BBU" if you do that. > > I did quite a few tests with WB and WT last year. > > - WT should be OK with e.g. Intel SSDs. From memory I saw write > performance gains of about 20-30% with Crucial M500/M550 writes on a > Dell H710 RAID controller. BUT that controller didn't have WT fastpath > though which is absolutely essential to see substantial gains WT. I > expect with WT and a fastpath enabled RAID you'd see much higher > numbers, e.g. 100%+ higher IOPS. > > (So, if you don't have fastpath on your controller, you might as > well plan to leave WB on and just buy cheaper SSD drives rather than > expensive ones - the raid controller will be your choke point for > performance on WT and it's a source of risk). > > - WT with most SSDs will likely corrupt your postgres database the > first time you lose power. (on all the drives I've tested) > > - WB is the only safe option unless you have done lots of plug pull > tests on a drive that is guaranteed to protect data "in flight" during > power loss (Intel disks + maybe the new samsung pcie). I think you have WT (write-through) and WB (write-back) reversed above. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Re: raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?)
From
"Graeme B. Bell"
Date:
Hi Bruce I'm *extremely* certain of what I say when I say WB+BBU=good and direct WT=bad. WB on the controller uses the battery backed RAID controller cache for writes to ensure all writes do eventually get writtento the disk in the event of a power failure. WT on the controller bypasses the battery backed cache and sends writes directly to the SSD. If the SSD doesn't have itsown sufficient capacitor backing, they're gone. See the manual page quote below. - With H710 WT, ssd cache enabled, the SSDs I tested were proven to lose data that was meant to have been already fsync'd.The capacitor was insufficient and the firmware lied about performing an fsync. - With H710 WB, ssd cache enabled, the SSDs didn't lose writes, I have yet to see a failed fsync in any of the many dozensof tests I ran on several machines and disks*. - Without H710 and ssd cache enabled i.e. WT direct to drive, I always lost writes that were meant to be fsync'd. - Without H710 and ssd cache disabled, I never lost writes. There are two possible reasons the writes always hit the drive successfully in every test with controller WB and ssd diskcache enabled. Either a) SSDs perhaps did lose fsync'd data, but the controller didn't. The battery backed raid controller 512MB-1024MBcache ensured fsync'd writes were completed by reinitiating the write after power up, as it would with a harddrive after power loss. I have not been able to find sufficiently detailed technical documentation for these cards tofind out exactly what they do after power-loss in terms of disk communication, replaying writes, etc. I only have my measuredresults. however it's also possible that ... b) From extensive plug-pull testing it appears the capacitors in the Crucial drivesare just *slightly* too small to save all data in flight, there is always a very tiny number of fsync'd writes thatdon't make it to disk. So it is entirely possible that the *fairly slow* writeback cache on the Dell controller, whichsubstantially *reduces* the IOPS of the ssd, is consistently limiting the amount of data held the cache on the ssd,such that all the data can be saved using the caps on the ssd disk. By effectively running at half-speed with the raidcontroller cache as a choke point, you are never in a situation where the last few writes just don't quite make it todisk, because half the ssd cache is sitting empty. Also, keep in mind I am reporting results from many dozens of runs of diskchecker.pl. It is possible that writes may be lostin a way that diskchecker.pl does not detect. It is also possible that there is e.g. a 1 in 1000 situation I haven'tfound yet. For example, really heavy sequential writes with interspersed fsyncs rather than just heaps of fsyncs. For the avoidance of doubt: In several afternoons of testing, I have *never* managed to lose fsync'd data from crucial m500/m550 disks combined witha battery backed raid controller in writeback mode. (WB). In several afternoons of testing, I have *always* lost a small amount of fsync'd data from crucial m500/m550 disks combinedwith a battery backed raid controller in write-through mode. (WT). I and others bought the M500/M550 on the back of the advertised capacitor backed cache, but I wouldn't ever trust manufacturercapacitor claims in future. Drive power failure recovery really is something that needs testing by many customersto ascertain the truth of the matter. (e.g. long story short, I recommend people to buy an intel model now sinceit has proven most trustworthy in terms of manufacturer claims). Graeme Bell https://www.dell.com/downloads/global/products/pvaul/en/perc-technical-guidebook.pdf "DELL PERC H700 and H800 Technical Guide 20 4.10 Virtual Disk Write Cache Policies The write cache policy of a virtual disk determines how the controller handles writes to that virtual disk. Write-Back andWrite-Through are the two write cache policies and can be set on virtual disks individually. All RAID volumes will be presented as Write-Through (WT) to the operating system (Windows and Linux) independent of the actualwrite cache policy of the virtual disk. The PERC cards manage the data in cache independently of the operating systemor any applications. You can use OpenManage or the BIOS configuration utility to view and manage virtual disk cachesettings. In Write-Through caching, the controller sends a data-transfer completion signal to the host system when the disk subsystemhas received all the data in a transaction. In Write-Back caching, the controller sends a data transfer completionsignal to the host when the controller cache has received all the data in a transaction. The controller then writesthe cached data to the storage device in the background. The risk of using Write-Back cache is that the cached data can be lost if there is a power failure before it is written tothe storage device. This risk is mitigated by using a BBU on PERC H700 or H800 cards. Write-Back caching has a performanceadvantage over Write-Through caching. The default cache setting for virtual disks is Write-Back caching. Certaindata patterns and configurations perform better with a Write-Through cache policy. Write-Back caching is used under all conditions in which the battery is present and in good condition." On 28 May 2015, at 01:20, Bruce Momjian <bruce@momjian.us> wrote: > On Thu, May 21, 2015 at 11:21:49AM +0000, Graeme B. Bell wrote: >>> Not using your raid controllers write cache then? Not sure just how >>> important that is with SSDs these days, but if you've got a BBU set >>> it to "WriteBack". Also change "Cache if Bad BBU" to "No Write Cache >>> if Bad BBU" if you do that. >> >> I did quite a few tests with WB and WT last year. >> >> - WT should be OK with e.g. Intel SSDs. From memory I saw write >> performance gains of about 20-30% with Crucial M500/M550 writes on a >> Dell H710 RAID controller. BUT that controller didn't have WT fastpath >> though which is absolutely essential to see substantial gains WT. I >> expect with WT and a fastpath enabled RAID you'd see much higher >> numbers, e.g. 100%+ higher IOPS. >> >> (So, if you don't have fastpath on your controller, you might as >> well plan to leave WB on and just buy cheaper SSD drives rather than >> expensive ones - the raid controller will be your choke point for >> performance on WT and it's a source of risk). >> >> - WT with most SSDs will likely corrupt your postgres database the >> first time you lose power. (on all the drives I've tested) >> >> - WB is the only safe option unless you have done lots of plug pull >> tests on a drive that is guaranteed to protect data "in flight" during >> power loss (Intel disks + maybe the new samsung pcie). > > I think you have WT (write-through) and WB (write-back) reversed above. > > -- > Bruce Momjian <bruce@momjian.us> http://momjian.us > EnterpriseDB http://enterprisedb.com > > + Everyone has their own god. +
Re: raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?)
From
Bruce Momjian
Date:
On Thu, May 28, 2015 at 01:37:48PM +0000, Graeme B. Bell wrote: > Hi Bruce > > I'm *extremely* certain of what I say when I say WB+BBU=good and direct WT=bad. It is my understanding that write-through is always safe as it writes through to the layer below and waits for acnoledgement. Write-back doesn't, so when you say: > WT should be OK with e.g. Intel SSDs. I assume you mean Write-Back is OK because the drive has a BBU, while Write-Through is always safe. You also say: > >> - WT with most SSDs will likely corrupt your postgres database the > >> first time you lose power. (on all the drives I've tested) which contradicts what you said above, and I assume you mean write-back here. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Re: raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?)
From
"Graeme B. Bell"
Date:
Hi Bruce, > It is my understanding that write-through is always safe as it writes > through to the layer below and waits for acnoledgement. Write-back > doesn't, so when you say: I said WT twice, and I assure you for a third time - in the tests I've carried out, Crucial SSD disks, **WT** was not safeor reliable whereas BATTERY-BACKED WB was. I believe it is confusing/surprising to you because your beliefs about WT and WB and about persistence of fsyncs to SSD disksare inconsistent with what happens in reality. Specifically: ssds tell lies to the sata/raid controller about what they're really doing, but unfortunately WT mode truststhe ssd to be honest. Hopefully this will help explain what's happening: 1. Historically, RAID writeback caches were unreliable for fsyncs because with no battery to persist data (and no nonvolatileWB cache, as we see on modern raid controllers), anything in the WB cache would be lost during a power fail. So,historically your safe options were: WB cache with battery (safe since the cache 'never loses power'), or WT to disk (safeif the disk can be trusted to persist writes through a power loss). 2. If you use WT at the raid controller level, then 'in principle' an fsync call should not return until the data is safelyon the drive. For postgres, the fsync call is the most important thing. Regular data writes can fail and crash recoveryis possible via WAL replay. But if fsync's don't work properly, crash recovery is probably not possible. 3. In reality, on most SSDs, if you use WT on the RAID controller to go direct to disk, and make an fsync call, then thedrive's controller tells you the data is persisted while behind the scenes the data is actually held in a volatile writebackcache and is vulnerable to catastrophic loss during a power failure. 4. So basically: the SSD drive's controller lies about fsync and does not honor the 'fsync contract'. 5. It's not just SSDs. If you are using e.g. a NAS system , perhaps storing your DB over NFS, then chances are the NAS isalmost certainly lying to you when you make fsync calls. 6. It's not just SSDs and NAS systems. If you use a virtual server / VMware, then to assist performance, the virtualiseddisk may be lying about fsync persistence. 7. Why do SSDs, NAS systems and VMs lie about fsync persistence? Because it improves the apparent performance, benchmarksand so on, at the expense of corruption which happens only rarely during power failures. In some applications thisis acceptable e.g. network file shares for powerpoints and word documents, mailservers... but for postgres, it's notacceptable. 8. The only way to get a real idea about WT and WB and what is happening under the hood, is to fire up diskchecker.pl andmeasure the result in the real world when you pull the plug (many times). Then you'll know for sure what happens in termsof performance and write persistence. You should not trust anything you read online written by other people especiallyby drive manufacturers - test for yourself if your database matters to you. Remember to use a 'real' plug pull- yank the cord out of the back - don't simply power off with the button or you'll get incorrect results. That said,the Intel drives have a great reputation and every report I've read indicates they work correctly with WT or directconnections. 9.That said - with non-Intel SSD drives, you may be able to use WT if you turn off all caching on the SSD, which may stopthe SSD lying about what it's doing. On the disks I've tested, this will allow fsync writes to hit the disk correctlyin WT mode. However! the costs are enormous - substantial increase in ssd disk wear (e.g. much earlier failure) and massive decreasein performance (96% drop e.g. you get 4% of the performance out of your disk that you expect - it can actually beslower than an HDD!). Example (from memory but I'm very certain about these numbers): I measured ~100 disk operations per second with diskchecker.pland crucial M500s when all disk cache was disabled and WT used. This compared with ~2000-2500 diskchecker operationsper second with a *battery-backed* WB cache and disk cache enabled. In both cases, data persisted correctly andwas not corrupted during repeated plug-pull testing. For what it's worth: A similar problem comes up with raid controllers and raid and probabiities of raid failure, and themassive gap between theory and practice. You can read any statistic you like about the likelihood of e.g. RAID6 failing due to consecutive disk failures (e.g. thisACM paper taking into account UBE errors with RAID5/6 suggests 0.00639% risk of loss per year, http://queue.acm.org/detail.cfm?id=1670144),but in practice what kills your database is probably the raid controller cardfailing or having a bug or unreliable battery.... (e.g. http://www.webhostingtalk.com/showthread.php?t=1075803, e.g.25% failure rates on some models!). Graeme Bell