Thread: raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?)

> Not using your raid controllers write cache then?  Not sure just how important that is with SSDs these days, but if
you'vegot a BBU set it to "WriteBack". Also change "Cache if Bad BBU" to "No Write Cache if Bad BBU" if you do that. 

I did quite a few tests with WB and WT last year.

- WT should be OK with e.g. Intel SSDs.    From memory I saw write performance gains of about 20-30% with Crucial
M500/M550writes on a Dell H710 RAID controller. BUT that controller didn't have WT fastpath though which is absolutely
essentialto see substantial gains WT. I expect with WT and a fastpath enabled RAID you'd see much higher numbers, e.g.
100%+higher IOPS.  

(So, if you don't have fastpath on your controller, you might as well plan to leave WB on and just buy cheaper SSD
drivesrather than expensive ones - the raid controller will be your choke point for performance on WT and it's a source
ofrisk).  

- WT with most SSDs will likely corrupt your postgres database the first time you lose power. (on all the drives I've
tested)

- WB is the only safe option unless you have done lots of plug pull tests on a drive that is guaranteed to protect data
"inflight" during power loss (Intel disks +  maybe the new samsung pcie).  


A relevant anecdatum...


A certain company makes ssd drives, for talking sake let's call two of their models the XY00 and the XY50. These were
popularSSD drives that were advertised everywhere as having power loss protection throughout 2013-2014. We bought them
lotsof them here because of that 'power loss protection' aspect + a good price + performance + good reliability record
+the good name of the company. 

When I tested with the famous 'diskchecker.pl' tool (* link at end), I found that they don't actually provide full
powerloss protection. Some data in flight (even fsyncs!) was lost.  
I tested using several computers, several copies of each disk model, with "XY00" and "XY50" models, and with and
withoutRAID controllers.  

The only way I could keep the data safe for fsyncs and DB use with these drives during power failure was either a) use
aRAID controller with WB   or b) disable ssd cache, which is horrifyingly bad for performance. 

So I wrote to the company's engineering in early August 2014 about this (because we had spent quite a lot of money on
thesedisks) and corresponded with a QA engineer to show them my results and show them how to reproduce the data loss
problem,asking if maybe they could produce a firmware patch or some other fix. 

At first they were extremely interested to know more. Then once they had the information to fully reproduce the bug,
theywent silent and wouldn't reply to any emails. 

About 1-2 months later, articles started appearing on enthusiast tech sites. Not new firmware, just company product
repsexplaining that "power loss protection" doesn't really mean all your data is protected from power loss, and that
it'sunreasonable to expect the drive to do what it says on the box.  

:-(


Lessons to takeaway:

--- WT + many SSDs + power loss = likely DB corruption.

--- No raid card + many SSDs + power loss = likely DB corruption.

--- WB + many SSDs + power loss = should be fine but you must test it a few times.

--- Never use WT mode on any production system until you've run a ton of tests on the drive's ability to honor fsyncs.

--- Never trust any vendor to provide correctly working equipment regardless of how often they make promises in
advertising.Buy the smallest amount possible and test it first yourself in the most realistic environment possible.
Thatgoes for RAID controllers advertised as having fastpath , which actually didn't, and ssds heavily advertised as
havingpower loss protection, which actually didn't protect all data from power loss.  

--- oh and NEVER do a power loss test by holding the power button. On every machine I've tested with SSDs, a power
buttonshutdown (e.g. hold power for 5 seconds till it turns off) did not create lost data, whereas a plug pull test
(yankthe power out of the power supply) always produced lost data. The plug pull test reproduces a real life power
failuremore accurately. The power button test will only give you an illusion of safety.  

Graeme Bell



p.s. https://gist.github.com/bradfitz/3172656      - diskchecker.pl





On Thu, May 21, 2015 at 11:21:49AM +0000, Graeme B. Bell wrote:
> > Not using your raid controllers write cache then?  Not sure just how
> > important that is with SSDs these days, but if you've got a BBU set
> > it to "WriteBack". Also change "Cache if Bad BBU" to "No Write Cache
> > if Bad BBU" if you do that.
>
> I did quite a few tests with WB and WT last year.
>
> - WT should be OK with e.g. Intel SSDs.  From memory I saw write
> performance gains of about 20-30% with Crucial M500/M550 writes on a
> Dell H710 RAID controller. BUT that controller didn't have WT fastpath
> though which is absolutely essential to see substantial gains WT. I
> expect with WT and a fastpath enabled RAID you'd see much higher
> numbers, e.g. 100%+ higher IOPS.
>
> (So, if you don't have fastpath on your controller, you might as
> well plan to leave WB on and just buy cheaper SSD drives rather than
> expensive ones - the raid controller will be your choke point for
> performance on WT and it's a source of risk).
>
> - WT with most SSDs will likely corrupt your postgres database the
> first time you lose power. (on all the drives I've tested)
>
> - WB is the only safe option unless you have done lots of plug pull
> tests on a drive that is guaranteed to protect data "in flight" during
> power loss (Intel disks + maybe the new samsung pcie).

I think you have WT (write-through) and WB (write-back) reversed above.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +


Hi Bruce

I'm *extremely* certain of what I say when I say WB+BBU=good and direct WT=bad.

WB on the controller uses the battery backed RAID controller cache for writes to ensure all writes do eventually get
writtento the disk in the event of a power failure.  

WT on the controller bypasses the battery backed cache and sends writes directly to the SSD. If the SSD doesn't have
itsown sufficient capacitor backing, they're gone. 
See the manual page quote below.

- With H710 WT, ssd cache enabled, the SSDs I tested were proven to lose data that was meant to have been already
fsync'd.The capacitor was insufficient and the firmware lied about performing an fsync. 

- With H710 WB, ssd cache enabled, the SSDs didn't lose writes, I have yet to see a failed fsync in any of the many
dozensof tests I ran on several machines and disks*.  

- Without H710 and ssd cache enabled i.e. WT direct to drive, I always lost writes that were meant to be fsync'd.

- Without H710 and ssd cache disabled, I never lost writes.


There are two possible reasons the writes always hit the drive successfully in every test with controller WB and ssd
diskcache enabled.  

Either a) SSDs perhaps did lose fsync'd data, but the controller didn't. The battery backed raid controller
512MB-1024MBcache ensured fsync'd writes were completed by reinitiating the write after power up, as it would with a
harddrive after power loss. I have not been able to find sufficiently detailed technical documentation for these cards
tofind out exactly what they do after power-loss in terms of disk communication, replaying writes, etc. I only have my
measuredresults. 

however it's also possible that  ... b) From extensive plug-pull testing it appears the capacitors in the Crucial
drivesare just *slightly* too small to save all data in flight, there is always a very tiny number of fsync'd writes
thatdon't make it to disk. So it is entirely possible that the *fairly slow* writeback cache on the Dell controller,
whichsubstantially *reduces* the IOPS of the ssd, is consistently limiting the amount of data held the cache on the
ssd,such that all the data can be saved using the caps on the ssd disk. By effectively running at half-speed with the
raidcontroller cache as a choke point, you are never in a situation where the last few writes just don't quite make it
todisk, because half the ssd cache is sitting empty. 

Also, keep in mind I am reporting results from many dozens of runs of diskchecker.pl. It is possible that writes may be
lostin a way that diskchecker.pl does not detect. It is also possible that there is e.g. a 1 in 1000 situation I
haven'tfound yet. For example, really heavy sequential writes with interspersed fsyncs rather than just heaps of
fsyncs. 


For the avoidance of doubt:

In several afternoons of testing, I have *never* managed to lose fsync'd data from crucial m500/m550 disks combined
witha battery backed raid controller in writeback mode. (WB). 
In several afternoons of testing, I have *always* lost a small amount of fsync'd data from crucial m500/m550 disks
combinedwith a battery backed raid controller in write-through mode. (WT). 


I and others bought the M500/M550 on the back of the advertised capacitor backed cache, but I wouldn't ever trust
manufacturercapacitor claims in future. Drive power failure recovery really is something that needs testing by many
customersto ascertain the truth of the matter. (e.g. long story short, I recommend people to buy an intel model now
sinceit has proven most trustworthy in terms of manufacturer claims).  


Graeme Bell


https://www.dell.com/downloads/global/products/pvaul/en/perc-technical-guidebook.pdf

"DELL PERC H700 and H800 Technical Guide 20
4.10 Virtual Disk Write Cache Policies
The write cache policy of a virtual disk determines how the controller handles writes to that virtual disk. Write-Back
andWrite-Through are the two write cache policies and can be set on virtual disks individually. 
All RAID volumes will be presented as Write-Through (WT) to the operating system (Windows and Linux) independent of the
actualwrite cache policy of the virtual disk. The PERC cards manage the data in cache independently of the operating
systemor any applications. You can use OpenManage or the BIOS configuration utility to view and manage virtual disk
cachesettings. 
In Write-Through caching, the controller sends a data-transfer completion signal to the host system when the disk
subsystemhas received all the data in a transaction. In Write-Back caching, the controller sends a data transfer
completionsignal to the host when the controller cache has received all the data in a transaction. The controller then
writesthe cached data to the storage device in the background. 
The risk of using Write-Back cache is that the cached data can be lost if there is a power failure before it is written
tothe storage device. This risk is mitigated by using a BBU on PERC H700 or H800 cards. Write-Back caching has a
performanceadvantage over Write-Through caching. The default cache setting for virtual disks is Write-Back caching.
Certaindata patterns and configurations perform better with a Write-Through cache policy. 
Write-Back caching is used under all conditions in which the battery is present and in good condition."






On 28 May 2015, at 01:20, Bruce Momjian <bruce@momjian.us> wrote:

> On Thu, May 21, 2015 at 11:21:49AM +0000, Graeme B. Bell wrote:
>>> Not using your raid controllers write cache then?  Not sure just how
>>> important that is with SSDs these days, but if you've got a BBU set
>>> it to "WriteBack". Also change "Cache if Bad BBU" to "No Write Cache
>>> if Bad BBU" if you do that.
>>
>> I did quite a few tests with WB and WT last year.
>>
>> - WT should be OK with e.g. Intel SSDs.  From memory I saw write
>> performance gains of about 20-30% with Crucial M500/M550 writes on a
>> Dell H710 RAID controller. BUT that controller didn't have WT fastpath
>> though which is absolutely essential to see substantial gains WT. I
>> expect with WT and a fastpath enabled RAID you'd see much higher
>> numbers, e.g. 100%+ higher IOPS.
>>
>> (So, if you don't have fastpath on your controller, you might as
>> well plan to leave WB on and just buy cheaper SSD drives rather than
>> expensive ones - the raid controller will be your choke point for
>> performance on WT and it's a source of risk).
>>
>> - WT with most SSDs will likely corrupt your postgres database the
>> first time you lose power. (on all the drives I've tested)
>>
>> - WB is the only safe option unless you have done lots of plug pull
>> tests on a drive that is guaranteed to protect data "in flight" during
>> power loss (Intel disks + maybe the new samsung pcie).
>
> I think you have WT (write-through) and WB (write-back) reversed above.
>
> --
>  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
>  EnterpriseDB                             http://enterprisedb.com
>
>  + Everyone has their own god. +



On Thu, May 28, 2015 at 01:37:48PM +0000, Graeme B. Bell wrote:
> Hi Bruce
>
> I'm *extremely* certain of what I say when I say WB+BBU=good and direct WT=bad.

It is my understanding that write-through is always safe as it writes
through to the layer below and waits for acnoledgement.  Write-back
doesn't, so when you say:

> WT should be OK with e.g. Intel SSDs.

I assume you mean Write-Back is OK because the drive has a BBU, while
Write-Through is always safe.

You also say:

> >> - WT with most SSDs will likely corrupt your postgres database the
> >> first time you lose power. (on all the drives I've tested)

which contradicts what you said above, and I assume you mean write-back
here.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +



Hi Bruce,

> It is my understanding that write-through is always safe as it writes
> through to the layer below and waits for acnoledgement.  Write-back
> doesn't, so when you say:

I said WT twice, and I assure you for a third time - in the tests I've carried out, Crucial SSD disks, **WT** was not
safeor reliable whereas BATTERY-BACKED WB was. 

I believe it is confusing/surprising to you because your beliefs about WT and WB and about persistence of fsyncs to SSD
disksare inconsistent with what happens in reality.  
Specifically: ssds tell lies to the sata/raid controller about what they're really doing, but unfortunately WT mode
truststhe ssd to be honest. 

Hopefully this will help explain what's happening:

1. Historically, RAID writeback caches were unreliable for fsyncs because with no battery to persist data (and no
nonvolatileWB cache, as we see on modern raid controllers), anything in the WB cache would be lost during a power fail.
So,historically your safe options were: WB cache with battery (safe since the cache 'never loses power'), or WT to disk
(safeif the disk can be trusted to persist writes through a power loss). 

2. If you use WT at the raid controller level, then 'in principle' an fsync call should not return until the data is
safelyon the drive. For postgres, the fsync call is the most important thing. Regular data writes can fail and crash
recoveryis possible via WAL replay. But if fsync's don't work properly, crash recovery is probably not possible.  

3. In reality, on most SSDs, if you use WT on the RAID controller to go direct to disk, and make an fsync call, then
thedrive's controller tells you the data is persisted while behind the scenes the data is actually held in a volatile
writebackcache and is vulnerable to catastrophic loss during a power failure.  

4. So basically: the SSD drive's controller lies about fsync and does not honor the 'fsync contract'.

5. It's not just SSDs. If you are using e.g. a NAS system , perhaps storing your DB over NFS, then chances are the NAS
isalmost certainly lying to you when you make fsync calls. 

6. It's not just SSDs and NAS systems. If you use a virtual server / VMware, then to assist performance, the
virtualiseddisk may be lying about fsync persistence.  

7. Why do SSDs, NAS systems and VMs lie about fsync persistence? Because it improves the apparent performance,
benchmarksand so on, at the expense of corruption which happens only rarely during power failures. In some applications
thisis acceptable e.g. network file shares for powerpoints and word documents, mailservers... but for postgres, it's
notacceptable.  

8. The only way to get a real idea about WT and WB and what is happening under the hood, is to fire up diskchecker.pl
andmeasure the result in the real world when you pull the plug (many times). Then you'll know for sure what happens in
termsof performance and write persistence. You should not trust anything you read online written by other people
especiallyby drive manufacturers - test for yourself if your database matters to you. Remember to use a 'real' plug
pull- yank the cord out of the back - don't simply power off with the button or you'll get incorrect results. That
said,the Intel drives have a great reputation and every report I've read indicates they work correctly with WT or
directconnections.  

9.That said - with non-Intel SSD drives, you may be able to use WT if you turn off all caching on the SSD, which may
stopthe SSD lying about what it's doing. On the disks I've tested, this will allow fsync writes to hit the disk
correctlyin WT mode. 

However! the costs are enormous - substantial increase in ssd disk wear (e.g. much earlier failure) and massive
decreasein performance (96% drop e.g. you get 4% of the performance out of your disk that you expect - it can actually
beslower than an HDD!).  

Example (from memory but I'm very certain about these numbers): I measured ~100 disk operations per second with
diskchecker.pland crucial M500s when all disk cache was disabled and WT used. This compared with ~2000-2500 diskchecker
operationsper second with a *battery-backed* WB cache and disk cache enabled. In both cases, data persisted correctly
andwas not corrupted during repeated plug-pull testing.  



For what it's worth: A similar problem comes up with raid controllers and raid and probabiities of raid failure, and
themassive gap between theory and practice.  

You can read any statistic you like about the likelihood of e.g. RAID6 failing due to consecutive disk failures (e.g.
thisACM paper taking into account UBE errors with RAID5/6 suggests 0.00639% risk of loss per year,
http://queue.acm.org/detail.cfm?id=1670144),but in practice what kills your database is probably the raid controller
cardfailing or having a bug or unreliable battery.... (e.g. http://www.webhostingtalk.com/showthread.php?t=1075803,
e.g.25% failure rates on some models!).  

Graeme Bell