Thread: SSD + RAID
Hello, I'm about to buy SSD drive(s) for a database. For decision making, I used this tech report: http://techreport.com/articles.x/16255/9 http://techreport.com/articles.x/16255/10 Here are my concerns: * I need at least 32GB disk space. So DRAM based SSD is not a real option. I would have to buy 8x4GB memory, costs a fortune. And then it would still not have redundancy. * I could buy two X25-E drives and have 32GB disk space, and some redundancy. This would cost about $1600, not counting the RAID controller. It is on the edge. * I could also buy many cheaper MLC SSD drives. They cost about $140. So even with 10 drives, I'm at $1400. I could put them in RAID6, have much more disk space (256GB), high redundancy and POSSIBLY good read/write speed. Of course then I need to buy a good RAID controller. My question is about the last option. Are there any good RAID cards that are optimized (or can be optimized) for SSD drives? Do any of you have experience in using many cheaper SSD drives? Is it a bad idea? Thank you, Laszlo
Laszlo Nagy wrote: > Hello, > > I'm about to buy SSD drive(s) for a database. For decision making, I > used this tech report: > > http://techreport.com/articles.x/16255/9 > http://techreport.com/articles.x/16255/10 > > Here are my concerns: > > * I need at least 32GB disk space. So DRAM based SSD is not a real > option. I would have to buy 8x4GB memory, costs a fortune. And > then it would still not have redundancy. > * I could buy two X25-E drives and have 32GB disk space, and some > redundancy. This would cost about $1600, not counting the RAID > controller. It is on the edge. > * I could also buy many cheaper MLC SSD drives. They cost about > $140. So even with 10 drives, I'm at $1400. I could put them in > RAID6, have much more disk space (256GB), high redundancy and > POSSIBLY good read/write speed. Of course then I need to buy a > good RAID controller. > > My question is about the last option. Are there any good RAID cards > that are optimized (or can be optimized) for SSD drives? Do any of you > have experience in using many cheaper SSD drives? Is it a bad idea? > > Thank you, > > Laszlo > Note that some RAID controllers (3Ware in particular) refuse to recognize the MLC drives, in particular, they act as if the OCZ Vertex series do not exist when connected. I don't know what they're looking for (perhaps some indication that actual rotation is happening?) but this is a potential problem.... make sure your adapter can talk to these things! BTW I have done some benchmarking with Postgresql against these drives and they are SMOKING fast. -- Karl
Attachment
> Note that some RAID controllers (3Ware in particular) refuse to > recognize the MLC drives, in particular, they act as if the OCZ Vertex > series do not exist when connected. > > I don't know what they're looking for (perhaps some indication that > actual rotation is happening?) but this is a potential problem.... make > sure your adapter can talk to these things! > > BTW I have done some benchmarking with Postgresql against these drives > and they are SMOKING fast. > I was thinking about ARECA 1320 with 2GB memory + BBU. Unfortunately, I cannot find information about using ARECA cards with SSD drives. I'm also not sure how they would work together. I guess the RAID cards are optimized for conventional disks. They read/write data in bigger blocks and they optimize the order of reading/writing for physical cylinders. I know for sure that this particular areca card has an Intel dual core IO processor and its own embedded operating system. I guess it could be tuned for SSD drives, but I don't know how. I was hoping that with a RAID 6 setup, write speed (which is slower for cheaper flash based SSD drives) would dramatically increase, because information written simultaneously to 10 drives. With very small block size, it would probably be true. But... what if the RAID card uses bigger block sizes, and - say - I want to update much smaller blocks in the database? My other option is to buy two SLC SSD drives and use RAID1. It would cost about the same, but has less redundancy and less capacity. Which is the faster? 8-10 MLC disks in RAID 6 with a good caching controller, or two SLC disks in RAID1? Thanks, Laszlo
This is very fast. On IT Toolbox there are many whitepapers about it. On the ERP and DataCenter sections specifically. We need that all tests that we do, we can share it on the Project Wiki. Regards On Nov 13, 2009, at 7:02 AM, Karl Denninger wrote: > Laszlo Nagy wrote: >> Hello, >> >> I'm about to buy SSD drive(s) for a database. For decision making, I >> used this tech report: >> >> http://techreport.com/articles.x/16255/9 >> http://techreport.com/articles.x/16255/10 >> >> Here are my concerns: >> >> * I need at least 32GB disk space. So DRAM based SSD is not a real >> option. I would have to buy 8x4GB memory, costs a fortune. And >> then it would still not have redundancy. >> * I could buy two X25-E drives and have 32GB disk space, and some >> redundancy. This would cost about $1600, not counting the RAID >> controller. It is on the edge. >> * I could also buy many cheaper MLC SSD drives. They cost about >> $140. So even with 10 drives, I'm at $1400. I could put them in >> RAID6, have much more disk space (256GB), high redundancy and >> POSSIBLY good read/write speed. Of course then I need to buy a >> good RAID controller. >> >> My question is about the last option. Are there any good RAID cards >> that are optimized (or can be optimized) for SSD drives? Do any of >> you >> have experience in using many cheaper SSD drives? Is it a bad idea? >> >> Thank you, >> >> Laszlo >> > Note that some RAID controllers (3Ware in particular) refuse to > recognize the MLC drives, in particular, they act as if the OCZ Vertex > series do not exist when connected. > > I don't know what they're looking for (perhaps some indication that > actual rotation is happening?) but this is a potential problem.... > make > sure your adapter can talk to these things! > > BTW I have done some benchmarking with Postgresql against these drives > and they are SMOKING fast. > > -- Karl > <karl.vcf> > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org > ) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
2009/11/13 Laszlo Nagy <gandalf@shopzeus.com>: > Hello, > > I'm about to buy SSD drive(s) for a database. For decision making, I used > this tech report: > > http://techreport.com/articles.x/16255/9 > http://techreport.com/articles.x/16255/10 > > Here are my concerns: > > * I need at least 32GB disk space. So DRAM based SSD is not a real > option. I would have to buy 8x4GB memory, costs a fortune. And > then it would still not have redundancy. > * I could buy two X25-E drives and have 32GB disk space, and some > redundancy. This would cost about $1600, not counting the RAID > controller. It is on the edge. I'm not sure a RAID controller brings much of anything to the table with SSDs. > * I could also buy many cheaper MLC SSD drives. They cost about > $140. So even with 10 drives, I'm at $1400. I could put them in > RAID6, have much more disk space (256GB), high redundancy and I think RAID6 is gonna reduce the throughput due to overhead to something far less than what a software RAID-10 would achieve. > POSSIBLY good read/write speed. Of course then I need to buy a > good RAID controller. I'm guessing that if you spent whatever money you were gonna spend on more SSDs you'd come out ahead, assuming you had somewhere to put them. > My question is about the last option. Are there any good RAID cards that are > optimized (or can be optimized) for SSD drives? Do any of you have > experience in using many cheaper SSD drives? Is it a bad idea? This I don't know. Some quick googling shows the Areca 1680ix and Adaptec 5 Series to be able to handle Samsun SSDs.
On Fri, Nov 13, 2009 at 9:48 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > I think RAID6 is gonna reduce the throughput due to overhead to > something far less than what a software RAID-10 would achieve. I was wondering about this. I think raid 5/6 might be a better fit for SSD than traditional drives arrays. Here's my thinking: *) flash SSD reads are cheaper than writes. With 6 or more drives, less total data has to be written in Raid 5 than Raid 10. The main component of raid 5 performance penalty is that for each written block, it has to be read first than written...incurring rotational latency, etc. SSD does not have this problem. *) flash is much more expensive in terms of storage/$. *) flash (at least the intel stuff) is so fast relative to what we are used to, that the point of using flash in raid is more for fault tolerance than performance enhancement. I don't have data to support this, but I suspect that even with relatively small amount of the slower MLC drives in raid, postgres will become cpu bound for most applications. merlin
Laszlo Nagy wrote: > * I need at least 32GB disk space. So DRAM based SSD is not a real > option. I would have to buy 8x4GB memory, costs a fortune. And > then it would still not have redundancy. At 32GB database size, I'd seriously consider just buying a server with a regular hard drive or a small RAID array for redundancy, and stuffing 16 or 32 GB of RAM into it to ensure everything is cached. That's tried and tested technology. I don't know how you came to the 32 GB figure, but keep in mind that administration is a lot easier if you have plenty of extra disk space for things like backups, dumps+restore, temporary files, upgrades etc. So if you think you'd need 32 GB of disk space, I'm guessing that 16 GB of RAM would be enough to hold all the hot data in cache. And if you choose a server with enough DIMM slots, you can expand easily if needed. Just my 2 cents, I'm not really an expert on hardware.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
2009/11/13 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>: > Laszlo Nagy wrote: >> * I need at least 32GB disk space. So DRAM based SSD is not a real >> option. I would have to buy 8x4GB memory, costs a fortune. And >> then it would still not have redundancy. > > At 32GB database size, I'd seriously consider just buying a server with > a regular hard drive or a small RAID array for redundancy, and stuffing > 16 or 32 GB of RAM into it to ensure everything is cached. That's tried > and tested technology. lots of ram doesn't help you if: *) your database gets written to a lot and you have high performance requirements *) your data is important (if either of the above is not true or even partially true, than your advice is spot on) merlin
In order for a drive to work reliably for database use such as for PostgreSQL, it cannot have a volatile write cache. You either need a write cache with a battery backup (and a UPS doesn't count), or to turn the cache off. The SSD performance figures you've been looking at are with the drive's write cache turned on, which means they're completely fictitious and exaggerated upwards for your purposes. In the real world, that will result in database corruption after a crash one day. No one on the drive benchmarking side of the industry seems to have picked up on this, so you can't use any of those figures. I'm not even sure right now whether drives like Intel's will even meet their lifetime expectations if they aren't allowed to use their internal volatile write cache. Here's two links you should read and then reconsider your whole design: http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/ http://petereisentraut.blogspot.com/2009/07/solid-state-drive-benchmarks-and-write.html I can't even imagine how bad the situation would be if you decide to wander down the "use a bunch of really cheap SSD drives" path; these things are barely usable for databases with Intel's hardware. The needs of people who want to throw SSD in a laptop and those of the enterprise database market are really different, and if you believe doom forecasting like the comments at http://blogs.sun.com/BestPerf/entry/oracle_peoplesoft_payroll_sun_sparc that gap is widening, not shrinking. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On 11/13/09 7:29 AM, "Merlin Moncure" <mmoncure@gmail.com> wrote: > On Fri, Nov 13, 2009 at 9:48 AM, Scott Marlowe <scott.marlowe@gmail.com> > wrote: >> I think RAID6 is gonna reduce the throughput due to overhead to >> something far less than what a software RAID-10 would achieve. > > I was wondering about this. I think raid 5/6 might be a better fit > for SSD than traditional drives arrays. Here's my thinking: > > *) flash SSD reads are cheaper than writes. With 6 or more drives, > less total data has to be written in Raid 5 than Raid 10. The main > component of raid 5 performance penalty is that for each written > block, it has to be read first than written...incurring rotational > latency, etc. SSD does not have this problem. > For random writes, RAID 5 writes as much as RAID 10 (parity + data), and more if the raid block size is larger than 8k. With RAID 6 it writes 50% more than RAID 10. For streaming writes RAID 5 / 6 has an advantage however. For SLC drives, there is really not much of a write performance penalty. >
Greg Smith wrote: > In order for a drive to work reliably for database use such as for > PostgreSQL, it cannot have a volatile write cache. You either need a > write cache with a battery backup (and a UPS doesn't count), or to > turn the cache off. The SSD performance figures you've been looking > at are with the drive's write cache turned on, which means they're > completely fictitious and exaggerated upwards for your purposes. In > the real world, that will result in database corruption after a crash > one day. If power is "unexpectedly" removed from the system, this is true. But the caches on the SSD controllers are BUFFERS. An operating system crash does not disrupt the data in them or cause corruption. An unexpected disconnection of the power source from the drive (due to unplugging it or a power supply failure for whatever reason) is a different matter. > No one on the drive benchmarking side of the industry seems to have > picked up on this, so you can't use any of those figures. I'm not > even sure right now whether drives like Intel's will even meet their > lifetime expectations if they aren't allowed to use their internal > volatile write cache. > > Here's two links you should read and then reconsider your whole design: > http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/ > > http://petereisentraut.blogspot.com/2009/07/solid-state-drive-benchmarks-and-write.html > > > I can't even imagine how bad the situation would be if you decide to > wander down the "use a bunch of really cheap SSD drives" path; these > things are barely usable for databases with Intel's hardware. The > needs of people who want to throw SSD in a laptop and those of the > enterprise database market are really different, and if you believe > doom forecasting like the comments at > http://blogs.sun.com/BestPerf/entry/oracle_peoplesoft_payroll_sun_sparc > that gap is widening, not shrinking. Again, it depends. With the write cache off on these disks they still are huge wins for very-heavy-read applications, which many are. The issue is (as always) operation mix - if you do a lot of inserts and updates then you suffer, but a lot of database applications are in the high 90%+ SELECTs both in frequency and data flow volume. The lack of rotational and seek latency in those applications is HUGE. -- Karl Denninger
Attachment
Karl Denninger wrote: > If power is "unexpectedly" removed from the system, this is true. But > the caches on the SSD controllers are BUFFERS. An operating system > crash does not disrupt the data in them or cause corruption. An > unexpected disconnection of the power source from the drive (due to > unplugging it or a power supply failure for whatever reason) is a > different matter. > As standard operating procedure, I regularly get something writing heavy to the database on hardware I'm suspicious of and power the box off hard. If at any time I suffer database corruption from this, the hardware is unsuitable for database use; that should never happen. This is what I mean when I say something meets the mythical "enterprise" quality. Companies whose data is worth something can't operate in a situation where money has been exchanged because a database commit was recorded, only to lose that commit just because somebody tripped over the power cord and it was in the buffer rather than on permanent disk. That's just not acceptable, and the even bigger danger of the database perhaps not coming up altogether even after such a tiny disaster is also very real with a volatile write cache. > With the write cache off on these disks they still are huge wins for > very-heavy-read applications, which many are. Very read-heavy applications would do better to buy a ton of RAM instead and just make sure they populate from permanent media (say by reading everything in early at sequential rates to prime the cache). There is an extremely narrow use-case where SSDs are the right technology, and it's only in a subset even of read-heavy apps where they make sense. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Greg Smith wrote: > Karl Denninger wrote: >> If power is "unexpectedly" removed from the system, this is true. But >> the caches on the SSD controllers are BUFFERS. An operating system >> crash does not disrupt the data in them or cause corruption. An >> unexpected disconnection of the power source from the drive (due to >> unplugging it or a power supply failure for whatever reason) is a >> different matter. >> > As standard operating procedure, I regularly get something writing > heavy to the database on hardware I'm suspicious of and power the box > off hard. If at any time I suffer database corruption from this, the > hardware is unsuitable for database use; that should never happen. > This is what I mean when I say something meets the mythical > "enterprise" quality. Companies whose data is worth something can't > operate in a situation where money has been exchanged because a > database commit was recorded, only to lose that commit just because > somebody tripped over the power cord and it was in the buffer rather > than on permanent disk. That's just not acceptable, and the even > bigger danger of the database perhaps not coming up altogether even > after such a tiny disaster is also very real with a volatile write cache. Yep. The "plug test" is part of my standard "is this stable enough for something I care about" checkout. >> With the write cache off on these disks they still are huge wins for >> very-heavy-read applications, which many are. > Very read-heavy applications would do better to buy a ton of RAM > instead and just make sure they populate from permanent media (say by > reading everything in early at sequential rates to prime the cache). > There is an extremely narrow use-case where SSDs are the right > technology, and it's only in a subset even of read-heavy apps where > they make sense. I don't know about that in the general case - I'd say "it depends." 250GB of SSD for read-nearly-always applications is a LOT cheaper than 250gb of ECC'd DRAM. The write performance issues can be handled by clever use of controller technology as well (that is, turn off the drive's "write cache" and use the BBU on the RAID adapter.) I have a couple of applications where two 250GB SSD disks in a Raid 1 array with a BBU'd controller, with the disk drive cache off, is all-in a fraction of the cost of sticking 250GB of volatile storage in a server and reading in the data set (plus managing the occasional updates) from "stable storage." It is not as fast as stuffing the 250GB of RAM in a machine but it's a hell of a lot faster than a big array of small conventional drives in a setup designed for maximum IO-Ops. One caution for those thinking of doing this - the incremental improvement of this setup on PostGresql in WRITE SIGNIFICANT environment isn't NEARLY as impressive. Indeed the performance in THAT case for many workloads may only be 20 or 30% faster than even "reasonably pedestrian" rotating media in a high-performance (lots of spindles and thus stripes) configuration and it's more expensive (by a lot.) If you step up to the fast SAS drives on the rotating side there's little argument for the SSD at all (again, assuming you don't intend to "cheat" and risk data loss.) Know your application and benchmark it. -- Karl
Attachment
On Fri, Nov 13, 2009 at 12:22 PM, Scott Carey <scott@richrelevance.com> > On 11/13/09 7:29 AM, "Merlin Moncure" <mmoncure@gmail.com> wrote: > >> On Fri, Nov 13, 2009 at 9:48 AM, Scott Marlowe <scott.marlowe@gmail.com> >> wrote: >>> I think RAID6 is gonna reduce the throughput due to overhead to >>> something far less than what a software RAID-10 would achieve. >> >> I was wondering about this. I think raid 5/6 might be a better fit >> for SSD than traditional drives arrays. Here's my thinking: >> >> *) flash SSD reads are cheaper than writes. With 6 or more drives, >> less total data has to be written in Raid 5 than Raid 10. The main >> component of raid 5 performance penalty is that for each written >> block, it has to be read first than written...incurring rotational >> latency, etc. SSD does not have this problem. >> > > For random writes, RAID 5 writes as much as RAID 10 (parity + data), and > more if the raid block size is larger than 8k. With RAID 6 it writes 50% > more than RAID 10. how does raid 5 write more if the block size is > 8k? raid 10 is also striped, so has the same problem, right? IOW, if the block size is 8k and you need to write 16k sequentially the raid 5 might write out 24k (two blocks + parity). raid 10 always writes out 2x your data in terms of blocks (raid 5 does only in the worst case). For a SINGLE block, it's always 2x your data for both raid 5 and raid 10, so what i said above was not quite correct. raid 6 is not going to outperform raid 10 ever IMO. It's just a slightly safer raid 5. I was just wondering out loud if raid 5 might give similar performance to raid 10 on flash based disks since there is no rotational latency. even if it did, I probably still wouldn't use it... merlin
2009/11/13 Greg Smith <greg@2ndquadrant.com>: > In order for a drive to work reliably for database use such as for > PostgreSQL, it cannot have a volatile write cache. You either need a write > cache with a battery backup (and a UPS doesn't count), or to turn the cache > off. The SSD performance figures you've been looking at are with the > drive's write cache turned on, which means they're completely fictitious and > exaggerated upwards for your purposes. In the real world, that will result > in database corruption after a crash one day. No one on the drive > benchmarking side of the industry seems to have picked up on this, so you > can't use any of those figures. I'm not even sure right now whether drives > like Intel's will even meet their lifetime expectations if they aren't > allowed to use their internal volatile write cache. hm. I never understood why Peter was only able to turn up 400 iops when others were turning up 4000+ (measured from bonnie). This would explain it. Is it authoritatively known that the Intel drives true random write ops is not what they are claiming? If so, then you are right..flash doesn't make sense, at least not without a NV cache on the device. merlin
Greg Smith wrote: > Karl Denninger wrote: >> With the write cache off on these disks they still are huge wins for >> very-heavy-read applications, which many are. > Very read-heavy applications would do better to buy a ton of RAM > instead and just make sure they populate from permanent media (say by > reading everything in early at sequential rates to prime the cache). > There is an extremely narrow use-case where SSDs are the right > technology, and it's only in a subset even of read-heavy apps where > they make sense. Out of curiosity, what are those narrow use cases where you think SSD's are the correct technology? -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
Itching to jump in here :-) There are a lot of things to trade off when choosing storage for a database: performance for different parts of the workload, reliability, performance in degraded mode (when a disk dies), backup methodologies, etc. ... the mistake many people make is to overlook the sub-optimal operating conditions, dailure modes and recovery paths. Some thoughts: - RAID-5 and RAID-6 have poor write performance, and terrible performance in degraded mode - there are a few edge cases, but in almost all cases you should be using RAID-10 for a database. - Like most apps, the ultimate way to make a databse perform is to have most of it (or at least the working set) in RAM, preferably the DB server buffer cache. This is why big banks run Oracle on an HP Superdome with 1TB of RAM ... the $15m Hitachi data array is just backing store :-) - Personally, I'm an SSD skeptic ... the technology just isn't mature enough for the data center. If you apply a typical OLTP workload, they are going to die early deaths. The only case in which they will materially improve performance is where you have a large data set with lots of **totally random** reads, i.e. where buffer cache is ineffective. In the words of TurboTax, "this is not common". - If you're going to use synchronous write with a significant amount of small transactions, then you need some reliable RAM (not SSD) to commit log files into, which means a proper battery-backed RAID controller / external SAN with write-back cache. For many apps though, a synchronous commit simply isn't necessary: losing a few rows of data during a crash is relatively harmless. For these apps, turning off synchronous writes is an often overlooked performance tweak. In summary, don't get distracted by shiny new objects like SSD and RAID-6 :-) 2009/11/13 Brad Nicholson <bnichols@ca.afilias.info>: > Greg Smith wrote: >> >> Karl Denninger wrote: >>> >>> With the write cache off on these disks they still are huge wins for >>> very-heavy-read applications, which many are. >> >> Very read-heavy applications would do better to buy a ton of RAM instead >> and just make sure they populate from permanent media (say by reading >> everything in early at sequential rates to prime the cache). There is an >> extremely narrow use-case where SSDs are the right technology, and it's only >> in a subset even of read-heavy apps where they make sense. > > Out of curiosity, what are those narrow use cases where you think SSD's are > the correct technology? > > -- > Brad Nicholson 416-673-4106 > Database Administrator, Afilias Canada Corp. > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
> -----Mensaje original----- > Laszlo Nagy > > My question is about the last option. Are there any good RAID > cards that are optimized (or can be optimized) for SSD > drives? Do any of you have experience in using many cheaper > SSD drives? Is it a bad idea? > > Thank you, > > Laszlo > Never had a SSD to try yet, still I wonder if software raid + fsync on SSD Drives could be regarded as a sound solution? Shouldn't their write performance be more than a trade-off for fsync? You could benchmark this setup yourself before purchasing a RAID card.
Brad Nicholson wrote: > Out of curiosity, what are those narrow use cases where you think > SSD's are the correct technology? Dave Crooke did a good summary already, I see things like this: * You need to have a read-heavy app that's bigger than RAM, but not too big so it can still fit on SSD * You need reads to be dominated by random-access and uncached lookups, so that system RAM used as a buffer cache doesn't help you much. * Writes have to be low to moderate, as the true write speed is much lower for database use than you'd expect from benchmarks derived from other apps. And it's better if writes are biased toward adding data rather than changing existing pages As far as what real-world apps have that profile, I like SSDs for small to medium web applications that have to be responsive, where the user shows up and wants their randomly distributed and uncached data with minimal latency. SSDs can also be used effectively as second-tier targeted storage for things that have a performance-critical but small and random bit as part of a larger design that doesn't have those characteristics; putting indexes on SSD can work out well for example (and there the write durability stuff isn't quite as critical, as you can always drop an index and rebuild if it gets corrupted). -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
2009/11/13 Greg Smith <greg@2ndquadrant.com>: > As far as what real-world apps have that profile, I like SSDs for small to > medium web applications that have to be responsive, where the user shows up > and wants their randomly distributed and uncached data with minimal latency. > SSDs can also be used effectively as second-tier targeted storage for things > that have a performance-critical but small and random bit as part of a > larger design that doesn't have those characteristics; putting indexes on > SSD can work out well for example (and there the write durability stuff > isn't quite as critical, as you can always drop an index and rebuild if it > gets corrupted). Here's a bonnie++ result for Intel showing 14k seeks: http://www.wlug.org.nz/HarddiskBenchmarks bonnie++ only writes data back 10% of the time. Why is Peter's benchmark showing only 400 seeks? Is this all attributable to write barrier? I'm not sure I'm buying that... merlin
Fernando Hevia wrote: > Shouldn't their write performance be more than a trade-off for fsync? > Not if you have sequential writes that are regularly fsync'd--which is exactly how the WAL writes things out in PostgreSQL. I think there's a potential for SSD to reach a point where they can give good performance even with their write caches turned off. But it will require a more robust software stack, like filesystems that really implement the write barrier concept effectively for this use-case, for that to happen. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
The FusionIO products are a little different. They are card based vs trying to emulate a traditional disk. In terms of volatility, they have an on-board capacitor that allows power to be supplied until all writes drain. They do not have a cache in front of them like a disk-type SSD might. I don't sell these things, I am just a fan. I verified all this with the Fusion IO techs before I replied. Perhaps older versions didn't have this functionality? I am not sure. I have already done some cold power off tests w/o problems, but I could up the workload a bit and retest. I will do a couple of 'pull the cable' tests on monday or tuesday and report back how it goes.
Re the performance #'s... Here is my post:
http://www.kennygorman.com/wordpress/?p=398
-kg
>In order for a drive to work reliably for database use such as for
>PostgreSQL, it cannot have a volatile write cache. You either need a
>write cache with a battery backup (and a UPS doesn't count), or to turn
>the cache off. The SSD performance figures you've been looking at are
>with the drive's write cache turned on, which means they're completely
>fictitious and exaggerated upwards for your purposes. In the real
>world, that will result in database corruption after a crash one day.
>No one on the drive benchmarking side of the industry seems to have
>picked up on this, so you can't use any of those figures. I'm not even
>sure right now whether drives like Intel's will even meet their lifetime
>expectations if they aren't allowed to use their internal volatile write
>cache.
>
>Here's two links you should read and then reconsider your whole design:
>
>http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/
>http://petereisentraut.blogspot.com/2009/07/solid-state-drive-benchmarks-and-write.html
>
>I can't even imagine how bad the situation would be if you decide to
>wander down the "use a bunch of really cheap SSD drives" path; these
>things are barely usable for databases with Intel's hardware. The needs
>of people who want to throw SSD in a laptop and those of the enterprise
>database market are really different, and if you believe doom
>forecasting like the comments at
>http://blogs.sun.com/BestPerf/entry/oracle_peoplesoft_payroll_sun_sparc
>that gap is widening, not shrinking.
Laszlo Nagy wrote: > Hello, > > I'm about to buy SSD drive(s) for a database. For decision making, I > used this tech report: > > http://techreport.com/articles.x/16255/9 > http://techreport.com/articles.x/16255/10 > > Here are my concerns: > > * I need at least 32GB disk space. So DRAM based SSD is not a real > option. I would have to buy 8x4GB memory, costs a fortune. And > then it would still not have redundancy. > * I could buy two X25-E drives and have 32GB disk space, and some > redundancy. This would cost about $1600, not counting the RAID > controller. It is on the edge. This was the solution I went with (4 drives in a raid 10 actually). Not a cheap solution, but the performance is amazing. > * I could also buy many cheaper MLC SSD drives. They cost about > $140. So even with 10 drives, I'm at $1400. I could put them in > RAID6, have much more disk space (256GB), high redundancy and > POSSIBLY good read/write speed. Of course then I need to buy a > good RAID controller. > > My question is about the last option. Are there any good RAID cards > that are optimized (or can be optimized) for SSD drives? Do any of you > have experience in using many cheaper SSD drives? Is it a bad idea? > > Thank you, > > Laszlo > >
Lists wrote: > Laszlo Nagy wrote: >> Hello, >> >> I'm about to buy SSD drive(s) for a database. For decision making, I >> used this tech report: >> >> http://techreport.com/articles.x/16255/9 >> http://techreport.com/articles.x/16255/10 >> >> Here are my concerns: >> >> * I need at least 32GB disk space. So DRAM based SSD is not a real >> option. I would have to buy 8x4GB memory, costs a fortune. And >> then it would still not have redundancy. >> * I could buy two X25-E drives and have 32GB disk space, and some >> redundancy. This would cost about $1600, not counting the RAID >> controller. It is on the edge. > This was the solution I went with (4 drives in a raid 10 actually). Not > a cheap solution, but the performance is amazing. I've came across this article: http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/ It's from a Linux MySQL user so it's a bit confusing but it looks like he has some reservations about performance vs reliability of the Intel drives - apparently they have their own write cache and when it's disabled performance drops sharply.
Merlin Moncure wrote: > 2009/11/13 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>: >> Laszlo Nagy wrote: >>> * I need at least 32GB disk space. So DRAM based SSD is not a real >>> option. I would have to buy 8x4GB memory, costs a fortune. And >>> then it would still not have redundancy. >> At 32GB database size, I'd seriously consider just buying a server with >> a regular hard drive or a small RAID array for redundancy, and stuffing >> 16 or 32 GB of RAM into it to ensure everything is cached. That's tried >> and tested technology. > > lots of ram doesn't help you if: > *) your database gets written to a lot and you have high performance > requirements When all the (hot) data is cached, all writes are sequential writes to the WAL, with the occasional flushing of the data pages at checkpoint. The sequential write bandwidth of SSDs and HDDs is roughly the same. I presume the fsync latency is a lot higher with HDDs, so if you're running a lot of small write transactions, and don't want to risk losing any recently committed transactions by setting synchronous_commit=off, the usual solution is to get a RAID controller with a battery-backed up cache. With a BBU cache, the fsync latency should be in the same ballpark as with SDDs. > *) your data is important Huh? The data is safely on the hard disk in case of a crash. The RAM is just for caching. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sat, Nov 14, 2009 at 6:17 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >> lots of ram doesn't help you if: >> *) your database gets written to a lot and you have high performance >> requirements > > When all the (hot) data is cached, all writes are sequential writes to > the WAL, with the occasional flushing of the data pages at checkpoint. > The sequential write bandwidth of SSDs and HDDs is roughly the same. > > I presume the fsync latency is a lot higher with HDDs, so if you're > running a lot of small write transactions, and don't want to risk losing > any recently committed transactions by setting synchronous_commit=off, > the usual solution is to get a RAID controller with a battery-backed up > cache. With a BBU cache, the fsync latency should be in the same > ballpark as with SDDs. BBU raid controllers might only give better burst performance. If you are writing data randomly all over the volume, the cache will overflow and performance will degrade. Raid controllers degrade in different fashions, at least one (perc 5) halted ALL access to the volume and spun out the cache (a bug, IMO). >> *) your data is important > > Huh? The data is safely on the hard disk in case of a crash. The RAM is > just for caching. I was alluding to not being able to lose any transactions... in this case you can only run fsync, synchronously. You are then bound by the capabilities of the volume to write, ram only buffers reads. merlin
Merlin Moncure wrote: > On Sat, Nov 14, 2009 at 6:17 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >>> lots of ram doesn't help you if: >>> *) your database gets written to a lot and you have high performance >>> requirements >> When all the (hot) data is cached, all writes are sequential writes to >> the WAL, with the occasional flushing of the data pages at checkpoint. >> The sequential write bandwidth of SSDs and HDDs is roughly the same. >> >> I presume the fsync latency is a lot higher with HDDs, so if you're >> running a lot of small write transactions, and don't want to risk losing >> any recently committed transactions by setting synchronous_commit=off, >> the usual solution is to get a RAID controller with a battery-backed up >> cache. With a BBU cache, the fsync latency should be in the same >> ballpark as with SDDs. > > BBU raid controllers might only give better burst performance. If you > are writing data randomly all over the volume, the cache will overflow > and performance will degrade. We're discussing a scenario where all the data fits in RAM. That's what the large amount of RAM is for. The only thing that's being written to disk is the WAL, which is sequential, and the occasional flush of data pages from the buffer cache at checkpoints, which doesn't happen often and will be spread over a period of time. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Laszlo Nagy wrote: > >> * I need at least 32GB disk space. So DRAM based SSD is not a real >> option. I would have to buy 8x4GB memory, costs a fortune. And >> then it would still not have redundancy. >> > > At 32GB database size, I'd seriously consider just buying a server with > a regular hard drive or a small RAID array for redundancy, and stuffing > 16 or 32 GB of RAM into it to ensure everything is cached. That's tried > and tested technology. > 32GB is for one table only. This server runs other applications, and you need to leave space for sort memory, shared buffers etc. Buying 128GB memory would solve the problem, maybe... but it is too expensive. And it is not safe. Power out -> data loss. > I don't know how you came to the 32 GB figure, but keep in mind that > administration is a lot easier if you have plenty of extra disk space > for things like backups, dumps+restore, temporary files, upgrades etc. > This disk space would be dedicated for a smaller tablespace, holding one or two bigger tables with index scans. Of course I would never use an SSD disk for storing database backups. It would be waste of money. L
2009/11/14 Laszlo Nagy <gandalf@shopzeus.com>: > 32GB is for one table only. This server runs other applications, and you > need to leave space for sort memory, shared buffers etc. Buying 128GB memory > would solve the problem, maybe... but it is too expensive. And it is not > safe. Power out -> data loss. Huh? ...Robert
On Sat, Nov 14, 2009 at 8:47 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Merlin Moncure wrote: >> On Sat, Nov 14, 2009 at 6:17 AM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>>> lots of ram doesn't help you if: >>>> *) your database gets written to a lot and you have high performance >>>> requirements >>> When all the (hot) data is cached, all writes are sequential writes to >>> the WAL, with the occasional flushing of the data pages at checkpoint. >>> The sequential write bandwidth of SSDs and HDDs is roughly the same. >>> >>> I presume the fsync latency is a lot higher with HDDs, so if you're >>> running a lot of small write transactions, and don't want to risk losing >>> any recently committed transactions by setting synchronous_commit=off, >>> the usual solution is to get a RAID controller with a battery-backed up >>> cache. With a BBU cache, the fsync latency should be in the same >>> ballpark as with SDDs. >> >> BBU raid controllers might only give better burst performance. If you >> are writing data randomly all over the volume, the cache will overflow >> and performance will degrade. > > We're discussing a scenario where all the data fits in RAM. That's what > the large amount of RAM is for. The only thing that's being written to > disk is the WAL, which is sequential, and the occasional flush of data > pages from the buffer cache at checkpoints, which doesn't happen often > and will be spread over a period of time. We are basically in agreement, but regardless of the effectiveness of your WAL implementation, raid controller, etc, if you have to write data to what approximates random locations to a disk based volume in a sustained manner, you must eventually degrade to whatever the drive can handle plus whatever efficiencies checkpoint, o/s, can gain by grouping writes together. Extra ram mainly helps only because it can shave precious iops off the read side so you use them for writing. merlin
Robert Haas wrote: > 2009/11/14 Laszlo Nagy <gandalf@shopzeus.com>: > >> 32GB is for one table only. This server runs other applications, and you >> need to leave space for sort memory, shared buffers etc. Buying 128GB memory >> would solve the problem, maybe... but it is too expensive. And it is not >> safe. Power out -> data loss. >> I'm sorry I though he was talking about keeping the database in memory with fsync=off. Now I see he was only talking about the OS disk cache. My server has 24GB RAM, and I cannot easily expand it unless I throw out some 2GB modules, and buy more 4GB or 8GB modules. But... buying 4x8GB ECC RAM (+throwing out 4x2GB RAM) is a lot more expensive than buying some 64GB SSD drives. 95% of the table in question is not modified. Only read (mostly with index scan). Only 5% is actively updated. This is why I think, using SSD in my case would be effective. Sorry for the confusion. L
>>> >>> * I could buy two X25-E drives and have 32GB disk space, and some >>> redundancy. This would cost about $1600, not counting the RAID >>> controller. It is on the edge. >> This was the solution I went with (4 drives in a raid 10 actually). >> Not a cheap solution, but the performance is amazing. > > I've came across this article: > > http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/ > > > It's from a Linux MySQL user so it's a bit confusing but it looks like > he has some reservations about performance vs reliability of the Intel > drives - apparently they have their own write cache and when it's > disabled performance drops sharply. Ok, I'm getting confused here. There is the WAL, which is written sequentially. If the WAL is not corrupted, then it can be replayed on next database startup. Please somebody enlighten me! In my mind, fsync is only needed for the WAL. If I could configure postgresql to put the WAL on a real hard drive that has BBU and write cache, then I cannot loose data. Meanwhile, product table data could be placed on the SSD drive, and I sould be able to turn on write cache safely. Am I wrong? L
On 15/11/2009 11:57 AM, Laszlo Nagy wrote: > Ok, I'm getting confused here. There is the WAL, which is written > sequentially. If the WAL is not corrupted, then it can be replayed on > next database startup. Please somebody enlighten me! In my mind, fsync > is only needed for the WAL. If I could configure postgresql to put the > WAL on a real hard drive that has BBU and write cache, then I cannot > loose data. Meanwhile, product table data could be placed on the SSD > drive, and I sould be able to turn on write cache safely. Am I wrong? A change has been written to the WAL and fsync()'d, so Pg knows it's hit disk. It can now safely apply the change to the tables themselves, and does so, calling fsync() to tell the drive containing the tables to commit those changes to disk. The drive lies, returning success for the fsync when it's just cached the data in volatile memory. Pg carries on, shortly deleting the WAL archive the changes were recorded in or recycling it and overwriting it with new change data. The SSD is still merrily buffering data to write cache, and hasn't got around to writing your particular change yet. The machine loses power. Oops! A hole just appeared in history. A WAL replay won't re-apply the changes that the database guaranteed had hit disk, but the changes never made it onto the main database storage. Possible fixes for this are: - Don't let the drive lie about cache flush operations, ie disable write buffering. - Give Pg some way to find out, from the drive, when particular write operations have actually hit disk. AFAIK there's no such mechanism at present, and I don't think the drives are even capable of reporting this data. If they were, Pg would have to be capable of applying entries from the WAL "sparsely" to account for the way the drive's write cache commits changes out-of-order, and Pg would have to maintain a map of committed / uncommitted WAL records. Pg would need another map of tablespace blocks to WAL records to know, when a drive write cache commit notice came in, what record in what WAL archive was affected. It'd also require Pg to keep WAL archives for unbounded and possibly long periods of time, making disk space management for WAL much harder. So - "not easy" is a bit of an understatement here. You still need to turn off write caching. -- Craig Ringer
> A change has been written to the WAL and fsync()'d, so Pg knows it's hit > disk. It can now safely apply the change to the tables themselves, and > does so, calling fsync() to tell the drive containing the tables to > commit those changes to disk. > > The drive lies, returning success for the fsync when it's just cached > the data in volatile memory. Pg carries on, shortly deleting the WAL > archive the changes were recorded in or recycling it and overwriting it > with new change data. The SSD is still merrily buffering data to write > cache, and hasn't got around to writing your particular change yet. > All right. I believe you. In the current Pg implementation, I need to turn of disk cache. But.... I would like to ask some theoretical questions. It is just an idea from me, and probably I'm wrong. Here is a scenario: #1. user wants to change something, resulting in a write_to_disk(data) call #2. data is written into the WAL and fsync()-ed #3. at this point the write_to_disk(data) call CAN RETURN, the user can continue his work (the WAL is already written, changes cannot be lost) #4. Pg can continue writting data onto the disk, and fsync() it. #5. Then WAL archive data can be deleted. Now maybe I'm wrong, but between #3 and #5, the data to be written is kept in memory. This is basically a write cache, implemented in OS memory. We could really handle it like a write cache. E.g. everything would remain the same, except that we add some latency. We can wait some time after the last modification of a given block, and then write it out. Is it possible to do? If so, then can we can turn off write cache for all drives, except the one holding the WAL. And still write speed would remain the same. I don't think that any SSD drive has more than some megabytes of write cache. The same amount of write cache could easily be implemented in OS memory, and then Pg would always know what hit the disk. Thanks, Laci
On 15/11/2009 2:05 PM, Laszlo Nagy wrote: > >> A change has been written to the WAL and fsync()'d, so Pg knows it's hit >> disk. It can now safely apply the change to the tables themselves, and >> does so, calling fsync() to tell the drive containing the tables to >> commit those changes to disk. >> >> The drive lies, returning success for the fsync when it's just cached >> the data in volatile memory. Pg carries on, shortly deleting the WAL >> archive the changes were recorded in or recycling it and overwriting it >> with new change data. The SSD is still merrily buffering data to write >> cache, and hasn't got around to writing your particular change yet. >> > All right. I believe you. In the current Pg implementation, I need to > turn of disk cache. That's certainly my understanding. I've been wrong many times before :S > #1. user wants to change something, resulting in a write_to_disk(data) call > #2. data is written into the WAL and fsync()-ed > #3. at this point the write_to_disk(data) call CAN RETURN, the user can > continue his work (the WAL is already written, changes cannot be lost) > #4. Pg can continue writting data onto the disk, and fsync() it. > #5. Then WAL archive data can be deleted. > > Now maybe I'm wrong, but between #3 and #5, the data to be written is > kept in memory. This is basically a write cache, implemented in OS > memory. We could really handle it like a write cache. E.g. everything > would remain the same, except that we add some latency. We can wait some > time after the last modification of a given block, and then write it out. I don't know enough about the whole affair to give you a good explanation ( I tried, and it just showed me how much I didn't know ) but here are a few issues: - Pg doesn't know the erase block sizes or positions. It can't group writes up by erase block except by hoping that, within a given file, writing in page order will get the blocks to the disk in roughly erase-block order. So your write caching isn't going to do anywhere near as good a job as the SSD's can. - The only way to make this help the SSD out much would be to use a LOT of RAM for write cache and maintain a LOT of WAL archives. That's RAM not being used for caching read data. The large number of WAL archives means incredibly long WAL replay times after a crash. - You still need a reliable way to tell the SSD "really flush your cache now" after you've flushed the changes from your huge chunks of WAL files and are getting ready to recycle them. I was thinking that write ordering would be an issue too, as some changes in the WAL would hit main disk before others that were earlier in the WAL. However, I don't think that matters if full_page_writes are on. If you replay from the start, you'll reapply some changes with older versions, but they'll be corrected again by a later WAL record. So ordering during WAL replay shouldn't be a problem. On the other hand, the INCREDIBLY long WAL replay times during recovery would be a nightmare. > I don't think that any SSD drive has more than some > megabytes of write cache. The big, lots-of-$$ ones have HUGE battery backed caches for exactly this reason. > The same amount of write cache could easily be > implemented in OS memory, and then Pg would always know what hit the disk. Really? How does Pg know what order the SSD writes things out from its cache? -- Craig Ringer
> - Pg doesn't know the erase block sizes or positions. It can't group > writes up by erase block except by hoping that, within a given file, > writing in page order will get the blocks to the disk in roughly > erase-block order. So your write caching isn't going to do anywhere near > as good a job as the SSD's can. > Okay, I see. We cannot query erase block size from an SSD drive. :-( >> I don't think that any SSD drive has more than some >> megabytes of write cache. >> > > The big, lots-of-$$ ones have HUGE battery backed caches for exactly > this reason. > Heh, this is why they are so expensive. :-) >> The same amount of write cache could easily be >> implemented in OS memory, and then Pg would always know what hit the disk. >> > > Really? How does Pg know what order the SSD writes things out from its > cache? > I got the point. We cannot implement an efficient write cache without much more knowledge about how that particular drive works. So... the only solution that works well is to have much more RAM for read cache, and much more RAM for write cache inside the RAID controller (with BBU). Thank you, Laszlo
I've wondered whether this would work for a read-mostly application: Buy a big RAM machine, like 64GB, with a crappy littlesingle disk. Build the database, then make a really big RAM disk, big enough to hold the DB and the WAL. Then builda duplicate DB on another machine with a decent disk (maybe a 4-disk RAID10), and turn on WAL logging. The system would be blazingly fast, and you'd just have to be sure before you shut it off to shut down Postgres and copythe RAM files back to the regular disk. And if you didn't, you could always recover from the backup. Since it's a read-mostlysystem, the WAL logging bandwidth wouldn't be too high, so even a modest machine would be able to keep up. Any thoughts? Craig
Craig James wrote: > I've wondered whether this would work for a read-mostly application: Buy > a big RAM machine, like 64GB, with a crappy little single disk. Build > the database, then make a really big RAM disk, big enough to hold the DB > and the WAL. Then build a duplicate DB on another machine with a decent > disk (maybe a 4-disk RAID10), and turn on WAL logging. > > The system would be blazingly fast, and you'd just have to be sure > before you shut it off to shut down Postgres and copy the RAM files back > to the regular disk. And if you didn't, you could always recover from > the backup. Since it's a read-mostly system, the WAL logging bandwidth > wouldn't be too high, so even a modest machine would be able to keep up. Should work, but I don't see any advantage over attaching the RAID array directly to the 1st machine with the RAM and turning synchronous_commit=off. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
2009/11/13 Greg Smith <greg@2ndquadrant.com>: > As far as what real-world apps have that profile, I like SSDs for small to > medium web applications that have to be responsive, where the user shows up > and wants their randomly distributed and uncached data with minimal latency. > SSDs can also be used effectively as second-tier targeted storage for things > that have a performance-critical but small and random bit as part of a > larger design that doesn't have those characteristics; putting indexes on > SSD can work out well for example (and there the write durability stuff > isn't quite as critical, as you can always drop an index and rebuild if it > gets corrupted). I am right now talking to someone on postgresql irc who is measuring 15k iops from x25-e and no data loss following power plug test. I am becoming increasingly suspicious that peter's results are not representative: given that 90% of bonnie++ seeks are read only, the math doesn't add up, and they contradict broadly published tests on the internet. Has anybody independently verified the results? merlin
On Tue, 2009-11-17 at 11:36 -0500, Merlin Moncure wrote: > 2009/11/13 Greg Smith <greg@2ndquadrant.com>: > > As far as what real-world apps have that profile, I like SSDs for small to > > medium web applications that have to be responsive, where the user shows up > > and wants their randomly distributed and uncached data with minimal latency. > > SSDs can also be used effectively as second-tier targeted storage for things > > that have a performance-critical but small and random bit as part of a > > larger design that doesn't have those characteristics; putting indexes on > > SSD can work out well for example (and there the write durability stuff > > isn't quite as critical, as you can always drop an index and rebuild if it > > gets corrupted). > > I am right now talking to someone on postgresql irc who is measuring > 15k iops from x25-e and no data loss following power plug test. I am > becoming increasingly suspicious that peter's results are not > representative: given that 90% of bonnie++ seeks are read only, the > math doesn't add up, and they contradict broadly published tests on > the internet. Has anybody independently verified the results? How many times have the run the plug test? I've read other reports of people (not on Postgres) losing data on this drive with the write cache on. -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
On Tue, Nov 17, 2009 at 9:54 AM, Brad Nicholson <bnichols@ca.afilias.info> wrote: > On Tue, 2009-11-17 at 11:36 -0500, Merlin Moncure wrote: >> 2009/11/13 Greg Smith <greg@2ndquadrant.com>: >> > As far as what real-world apps have that profile, I like SSDs for small to >> > medium web applications that have to be responsive, where the user shows up >> > and wants their randomly distributed and uncached data with minimal latency. >> > SSDs can also be used effectively as second-tier targeted storage for things >> > that have a performance-critical but small and random bit as part of a >> > larger design that doesn't have those characteristics; putting indexes on >> > SSD can work out well for example (and there the write durability stuff >> > isn't quite as critical, as you can always drop an index and rebuild if it >> > gets corrupted). >> >> I am right now talking to someone on postgresql irc who is measuring >> 15k iops from x25-e and no data loss following power plug test. I am >> becoming increasingly suspicious that peter's results are not >> representative: given that 90% of bonnie++ seeks are read only, the >> math doesn't add up, and they contradict broadly published tests on >> the internet. Has anybody independently verified the results? > > How many times have the run the plug test? I've read other reports of > people (not on Postgres) losing data on this drive with the write cache > on. When I run the plug test it's on a pgbench that's as big as possible (~4000) and I remove memory if there's a lot in the server so the memory is smaller than the db. I run 100+ concurrent and I set checkoint timeouts to 30 minutes, and make a lots of checkpoint segments (100 or so), and set completion target to 0. Then after about 1/2 checkpoint timeout has passed, I issue a checkpoint from the command line, take a deep breath and pull the cord.
On tis, 2009-11-17 at 11:36 -0500, Merlin Moncure wrote: > I am right now talking to someone on postgresql irc who is measuring > 15k iops from x25-e and no data loss following power plug test. I am > becoming increasingly suspicious that peter's results are not > representative: given that 90% of bonnie++ seeks are read only, the > math doesn't add up, and they contradict broadly published tests on > the internet. Has anybody independently verified the results? Notably, between my two blog posts and this email thread, there have been claims of 400 1800 4000 7000 14000 15000 35000 iops (of some kind) per second. That alone should be cause of concern.
Merlin Moncure wrote: > I am right now talking to someone on postgresql irc who is measuring > 15k iops from x25-e and no data loss following power plug test. The funny thing about Murphy is that he doesn't visit when things are quiet. It's quite possible the window for data loss on the drive is very small. Maybe you only see it one out of 10 pulls with a very aggressive database-oriented write test. Whatever the odd conditions are, you can be sure you'll see them when there's a bad outage in actual production though. A good test program that is a bit better at introducing and detecting the write cache issue is described at http://brad.livejournal.com/2116715.html -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Tue, Nov 17, 2009 at 1:51 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Merlin Moncure wrote: >> >> I am right now talking to someone on postgresql irc who is measuring >> 15k iops from x25-e and no data loss following power plug test. > > The funny thing about Murphy is that he doesn't visit when things are quiet. > It's quite possible the window for data loss on the drive is very small. > Maybe you only see it one out of 10 pulls with a very aggressive > database-oriented write test. Whatever the odd conditions are, you can be > sure you'll see them when there's a bad outage in actual production though. > > A good test program that is a bit better at introducing and detecting the > write cache issue is described at http://brad.livejournal.com/2116715.html Sure, not disputing that...I don't have one to test myself, so I can't vouch for the data being safe. But what's up with the 400 iops measured from bonnie++? That's an order of magnitude slower than any other published benchmark on the 'net, and I'm dying to get a little clarification here. merlin
On 11/17/2009 01:51 PM, Greg Smith wrote: > Merlin Moncure wrote: >> I am right now talking to someone on postgresql irc who is measuring >> 15k iops from x25-e and no data loss following power plug test. > The funny thing about Murphy is that he doesn't visit when things are > quiet. It's quite possible the window for data loss on the drive is > very small. Maybe you only see it one out of 10 pulls with a very > aggressive database-oriented write test. Whatever the odd conditions > are, you can be sure you'll see them when there's a bad outage in > actual production though. > > A good test program that is a bit better at introducing and detecting > the write cache issue is described at > http://brad.livejournal.com/2116715.html > I've been following this thread with great interest in your results... Please continue to share... For write cache issues - is it possible that the reduced power utilization of SSD allows for a capacitor to complete all scheduled writes, even with a large cache? Is it this particular drive you are suggesting that is known to be insufficient or is it really the technology or maturity of the technology? Cheers, mark -- Mark Mielke<mark@mielke.cc>
Merlin Moncure wrote: > But what's up with the 400 iops measured from bonnie++? I don't know really. SSD writes are really sensitive to block size and the ability to chunk writes into larger chunks, so it may be that Peter has just found the worst-case behavior and everybody else is seeing something better than that. When the reports I get back from people I believe are competant--Vadim, Peter--show worst-case results that are lucky to beat RAID10, I feel I have to dismiss the higher values reported by people who haven't been so careful. And that's just about everybody else, which leaves me quite suspicious of the true value of the drives. The whole thing really sets off my vendor hype reflex, and short of someone loaning me a drive to test I'm not sure how to get past that. The Intel drives are still just a bit too expensive to buy one on a whim, such that I'll just toss it if the drive doesn't live up to expectations. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Wed, 18 Nov 2009, Greg Smith wrote: > Merlin Moncure wrote: >> But what's up with the 400 iops measured from bonnie++? > I don't know really. SSD writes are really sensitive to block size and the > ability to chunk writes into larger chunks, so it may be that Peter has just > found the worst-case behavior and everybody else is seeing something better > than that. > > When the reports I get back from people I believe are competant--Vadim, > Peter--show worst-case results that are lucky to beat RAID10, I feel I have > to dismiss the higher values reported by people who haven't been so careful. > And that's just about everybody else, which leaves me quite suspicious of the > true value of the drives. The whole thing really sets off my vendor hype > reflex, and short of someone loaning me a drive to test I'm not sure how to > get past that. The Intel drives are still just a bit too expensive to buy > one on a whim, such that I'll just toss it if the drive doesn't live up to > expectations. keep in mind that bonnie++ isn't always going to reflect your real performance. I have run tests on some workloads that were definantly I/O limited where bonnie++ results that differed by a factor of 10x made no measurable difference in the application performance, so I can easily believe in cases where bonnie++ numbers would not change but application performance could be drasticly different. as always it can depend heavily on your workload. you really do need to figure out how to get your hands on one for your own testing. David Lang
I found a bit of time to play with this. I started up a test with 20 concurrent processes all inserting into the same table and committing after each insert. The db was achieving about 5000 inserts per second, and I kept it running for about 10 minutes. The host was doing about 5MB/s of Physical I/O to the Fusion IO drive. I set checkpoint segments very small (10). I observed the following message in the log: checkpoints are occurring too frequently (16 seconds apart). Then I pulled the cord. On reboot I noticed that Fusion IO replayed it's log, then the filesystem (vxfs) did the same. Then I started up the DB and observed the it perform auto-recovery: Nov 18 14:33:53 frutestdb002 postgres[5667]: [6-1] 2009-11-18 14:33:53 PSTLOG: database system was not properly shut down; automatic recovery in progress Nov 18 14:33:53 frutestdb002 postgres[5667]: [7-1] 2009-11-18 14:33:53 PSTLOG: redo starts at 2A/55F9D478 Nov 18 14:33:54 frutestdb002 postgres[5667]: [8-1] 2009-11-18 14:33:54 PSTLOG: record with zero length at 2A/56692F38 Nov 18 14:33:54 frutestdb002 postgres[5667]: [9-1] 2009-11-18 14:33:54 PSTLOG: redo done at 2A/56692F08 Nov 18 14:33:54 frutestdb002 postgres[5667]: [10-1] 2009-11-18 14:33:54 PSTLOG: database system is ready Thanks Kenny On Nov 13, 2009, at 1:35 PM, Kenny Gorman wrote: > The FusionIO products are a little different. They are card based > vs trying to emulate a traditional disk. In terms of volatility, > they have an on-board capacitor that allows power to be supplied > until all writes drain. They do not have a cache in front of them > like a disk-type SSD might. I don't sell these things, I am just a > fan. I verified all this with the Fusion IO techs before I > replied. Perhaps older versions didn't have this functionality? I > am not sure. I have already done some cold power off tests w/o > problems, but I could up the workload a bit and retest. I will do a > couple of 'pull the cable' tests on monday or tuesday and report > back how it goes. > > Re the performance #'s... Here is my post: > > http://www.kennygorman.com/wordpress/?p=398 > > -kg > > > >In order for a drive to work reliably for database use such as for > >PostgreSQL, it cannot have a volatile write cache. You either need a > >write cache with a battery backup (and a UPS doesn't count), or to > turn > >the cache off. The SSD performance figures you've been looking at > are > >with the drive's write cache turned on, which means they're > completely > >fictitious and exaggerated upwards for your purposes. In the real > >world, that will result in database corruption after a crash one day. > >No one on the drive benchmarking side of the industry seems to have > >picked up on this, so you can't use any of those figures. I'm not > even > >sure right now whether drives like Intel's will even meet their > lifetime > >expectations if they aren't allowed to use their internal volatile > write > >cache. > > > >Here's two links you should read and then reconsider your whole > design: > > > >http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/ > >http://petereisentraut.blogspot.com/2009/07/solid-state-drive-benchmarks-and-write.html > > > >I can't even imagine how bad the situation would be if you decide to > >wander down the "use a bunch of really cheap SSD drives" path; these > >things are barely usable for databases with Intel's hardware. The > needs > >of people who want to throw SSD in a laptop and those of the > enterprise > >database market are really different, and if you believe doom > >forecasting like the comments at > >http://blogs.sun.com/BestPerf/entry/oracle_peoplesoft_payroll_sun_sparc > >that gap is widening, not shrinking. > >
On 11/13/09 10:21 AM, "Karl Denninger" <karl@denninger.net> wrote: > > One caution for those thinking of doing this - the incremental > improvement of this setup on PostGresql in WRITE SIGNIFICANT environment > isn't NEARLY as impressive. Indeed the performance in THAT case for > many workloads may only be 20 or 30% faster than even "reasonably > pedestrian" rotating media in a high-performance (lots of spindles and > thus stripes) configuration and it's more expensive (by a lot.) If you > step up to the fast SAS drives on the rotating side there's little > argument for the SSD at all (again, assuming you don't intend to "cheat" > and risk data loss.) For your database DATA disks, leaving the write cache on is 100% acceptable, even with power loss, and without a RAID controller. And even in high write environments. That is what the XLOG is for, isn't it? That is where this behavior is critical. But that has completely different performance requirements and need not bee on the same volume, array, or drive. > > Know your application and benchmark it. > > -- Karl >
On 11/15/09 12:46 AM, "Craig Ringer" <craig@postnewspapers.com.au> wrote: > Possible fixes for this are: > > - Don't let the drive lie about cache flush operations, ie disable write > buffering. > > - Give Pg some way to find out, from the drive, when particular write > operations have actually hit disk. AFAIK there's no such mechanism at > present, and I don't think the drives are even capable of reporting this > data. If they were, Pg would have to be capable of applying entries from > the WAL "sparsely" to account for the way the drive's write cache > commits changes out-of-order, and Pg would have to maintain a map of > committed / uncommitted WAL records. Pg would need another map of > tablespace blocks to WAL records to know, when a drive write cache > commit notice came in, what record in what WAL archive was affected. > It'd also require Pg to keep WAL archives for unbounded and possibly > long periods of time, making disk space management for WAL much harder. > So - "not easy" is a bit of an understatement here. 3: Have PG wait a half second (configurable) after the checkpoint fsync() completes before deleting/ overwriting any WAL segments. This would be a trivial "feature" to add to a postgres release, I think. Actually, it already exists! Turn on log archiving, and have the script that it runs after a checkpoint sleep(). BTW, the information I have seen indicates that the write cache is 256K on the Intel drives, the 32MB/64MB of other RAM is working memory for the drive block mapping / wear leveling algorithms (tracking 160GB of 4k blocks takes space). 4: Yet another solution: The drives DO adhere to write barriers properly. A filesystem that used these in the process of fsync() would be fine too. So XFS without LVM or MD (or the newer versions of those that don't ignore barriers) would work too. So, I think that write caching may not be necessary to turn off for non-xlog disk. > > You still need to turn off write caching. > > -- > Craig Ringer > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
Scott Carey <scott@richrelevance.com> writes: > For your database DATA disks, leaving the write cache on is 100% acceptable, > even with power loss, and without a RAID controller. And even in high write > environments. Really? How hard have you tested that configuration? > That is what the XLOG is for, isn't it? Once we have fsync'd a data change, we discard the relevant XLOG entries. If the disk hasn't actually put the data on stable storage before it claims the fsync is done, you're screwed. XLOG only exists to centralize the writes that have to happen before a transaction can be reported committed (in particular, to avoid a lot of random-access writes at commit). It doesn't make any fundamental change in the rules of the game: a disk that lies about write complete will still burn you. In a zero-seek-cost environment I suspect that XLOG wouldn't actually be all that useful. I gather from what's been said earlier that SSDs don't fully eliminate random-access penalties, though. regards, tom lane
On 11/17/09 10:51 AM, "Greg Smith" <greg@2ndquadrant.com> wrote: > Merlin Moncure wrote: >> I am right now talking to someone on postgresql irc who is measuring >> 15k iops from x25-e and no data loss following power plug test. > The funny thing about Murphy is that he doesn't visit when things are > quiet. It's quite possible the window for data loss on the drive is > very small. Maybe you only see it one out of 10 pulls with a very > aggressive database-oriented write test. Whatever the odd conditions > are, you can be sure you'll see them when there's a bad outage in actual > production though. Yes, but there is nothing fool proof. Murphy visited me recently, and the RAID card with BBU cache that the WAL logs were on crapped out. Data was fine. Had to fix up the system without any WAL logs. Luckily, out of 10TB, only 200GB or so of it could have been in the process of writing (yay! partitioning by date!) to and we could restore just that part rather than initiating a full restore. Then there was fun times in single user mode to fix corrupted system tables (about half the system indexes were dead, and the statistics table was corrupt, but that could be truncated safely). Its all fine now with all data validated. Moral of the story: Nothing is 100% safe, so sometimes a small bit of KNOWN risk is perfectly fine. There is always UNKNOWN risk. If one risks losing 256K of cached data on an SSD if you're really unlucky with timing, how dangerous is that versus the chance that the raid card or other hardware barfs and takes out your whole WAL? Nothing is safe enough to avoid a full DR plan of action. The individual tradeoffs are very application and data dependent. > > A good test program that is a bit better at introducing and detecting > the write cache issue is described at > http://brad.livejournal.com/2116715.html > > -- > Greg Smith 2ndQuadrant Baltimore, MD > PostgreSQL Training, Services and Support > greg@2ndQuadrant.com www.2ndQuadrant.com > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
On 11/17/09 10:58 PM, "david@lang.hm" <david@lang.hm> wrote: > > keep in mind that bonnie++ isn't always going to reflect your real > performance. > > I have run tests on some workloads that were definantly I/O limited where > bonnie++ results that differed by a factor of 10x made no measurable > difference in the application performance, so I can easily believe in > cases where bonnie++ numbers would not change but application performance > could be drasticly different. > Well, that is sort of true for all benchmarks, but I do find that bonnie++ is the worst of the bunch. I consider it relatively useless compared to fio. Its just not a great benchmark for server type load and I find it lacking in the ability to simulate real applications. > as always it can depend heavily on your workload. you really do need to > figure out how to get your hands on one for your own testing. > > David Lang > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
On 19/11/2009 12:22 PM, Scott Carey wrote: > 3: Have PG wait a half second (configurable) after the checkpoint fsync() > completes before deleting/ overwriting any WAL segments. This would be a > trivial "feature" to add to a postgres release, I think. How does that help? It doesn't provide any guarantee that the data has hit main storage - it could lurk in SDD cache for hours. > 4: Yet another solution: The drives DO adhere to write barriers properly. > A filesystem that used these in the process of fsync() would be fine too. > So XFS without LVM or MD (or the newer versions of those that don't ignore > barriers) would work too. *if* the WAL is also on the SSD. If the WAL is on a separate drive, the write barriers do you no good, because they won't ensure that the data hits the main drive storage before the WAL recycling hits the WAL disk storage. The two drives operate independently and the write barriers don't interact. You'd need some kind of inter-drive write barrier. -- Craig Ringer
Scott Carey wrote: > For your database DATA disks, leaving the write cache on is 100% acceptable, > even with power loss, and without a RAID controller. And even in high write > environments. > > That is what the XLOG is for, isn't it? That is where this behavior is > critical. But that has completely different performance requirements and > need not bee on the same volume, array, or drive. > At checkpoint time, writes to the main data files are done that are followed by fsync calls to make sure those blocks have been written to disk. Those writes have exactly the same consistency requirements as the more frequent pg_xlog writes. If the drive ACKs the write, but it's not on physical disk yet, it's possible for the checkpoint to finish and the underlying pg_xlog segments needed to recover from a crash at that point to be deleted. The end of the checkpoint can wipe out many WAL segments, presuming they're not needed anymore because the data blocks they were intended to fix during recovery are now guaranteed to be on disk. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Greg Smith wrote: > Scott Carey wrote: >> For your database DATA disks, leaving the write cache on is 100% >> acceptable, >> even with power loss, and without a RAID controller. And even in >> high write >> environments. >> >> That is what the XLOG is for, isn't it? That is where this behavior is >> critical. But that has completely different performance requirements >> and >> need not bee on the same volume, array, or drive. >> > At checkpoint time, writes to the main data files are done that are > followed by fsync calls to make sure those blocks have been written to > disk. Those writes have exactly the same consistency requirements as > the more frequent pg_xlog writes. If the drive ACKs the write, but > it's not on physical disk yet, it's possible for the checkpoint to > finish and the underlying pg_xlog segments needed to recover from a > crash at that point to be deleted. The end of the checkpoint can wipe > out many WAL segments, presuming they're not needed anymore because > the data blocks they were intended to fix during recovery are now > guaranteed to be on disk. Guys, read that again. IF THE DISK OR DRIVER ACK'S A FSYNC CALL THE WAL ENTRY IS LIKELY GONE, AND YOU ARE SCREWED IF THE DATA IS NOT REALLY ON THE DISK. -- Karl
Attachment
Scott Carey wrote: > Moral of the story: Nothing is 100% safe, so sometimes a small bit of KNOWN > risk is perfectly fine. There is always UNKNOWN risk. If one risks losing > 256K of cached data on an SSD if you're really unlucky with timing, how > dangerous is that versus the chance that the raid card or other hardware > barfs and takes out your whole WAL? > I think the point of the paranoia in this thread is that if you're introducing a component with a known risk in it, you're really asking for trouble because (as you point out) it's hard enough to keep a system running just through the unexpected ones that shouldn't have happened at all. No need to make that even harder by introducing something that is *known* to fail under some conditions. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Wed, Nov 18, 2009 at 11:39 PM, Scott Carey <scott@richrelevance.com> wrote: > Well, that is sort of true for all benchmarks, but I do find that bonnie++ > is the worst of the bunch. I consider it relatively useless compared to > fio. Its just not a great benchmark for server type load and I find it > lacking in the ability to simulate real applications. I agree. My biggest gripe with bonnie actually is that 99% of the time is spent measuring in sequential tests which is not that important in the database world. Dedicated wal volume uses ostensibly sequential io, but it's fairly difficult to outrun a dedicated wal volume even if it's on a vanilla sata drive. pgbench is actually a pretty awesome i/o tester assuming you have big enough scaling factor, because: a) it's much closer to the environment you will actually run in b) you get to see what i/o affecting options have on the load c) you have broad array of options regarding what gets done (select only, -f, etc) d) once you build the test database, you can do multiple runs without rebuilding it merlin merlin
On Thu, Nov 19, 2009 at 10:01 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Wed, Nov 18, 2009 at 11:39 PM, Scott Carey <scott@richrelevance.com> wrote: >> Well, that is sort of true for all benchmarks, but I do find that bonnie++ >> is the worst of the bunch. I consider it relatively useless compared to >> fio. Its just not a great benchmark for server type load and I find it >> lacking in the ability to simulate real applications. > > I agree. My biggest gripe with bonnie actually is that 99% of the > time is spent measuring in sequential tests which is not that > important in the database world. Dedicated wal volume uses ostensibly > sequential io, but it's fairly difficult to outrun a dedicated wal > volume even if it's on a vanilla sata drive. > > pgbench is actually a pretty awesome i/o tester assuming you have big > enough scaling factor, because: > a) it's much closer to the environment you will actually run in > b) you get to see what i/o affecting options have on the load > c) you have broad array of options regarding what gets done (select > only, -f, etc) > d) once you build the test database, you can do multiple runs without > rebuilding it Seeing as how pgbench only goes to scaling factor of 4000, are the any plans on enlarging that number?
Am Donnerstag, 19. November 2009 13:29:56 schrieb Craig Ringer: > On 19/11/2009 12:22 PM, Scott Carey wrote: > > 3: Have PG wait a half second (configurable) after the checkpoint > > fsync() completes before deleting/ overwriting any WAL segments. This > > would be a trivial "feature" to add to a postgres release, I think. > > How does that help? It doesn't provide any guarantee that the data has > hit main storage - it could lurk in SDD cache for hours. > > > 4: Yet another solution: The drives DO adhere to write barriers > > properly. A filesystem that used these in the process of fsync() would be > > fine too. So XFS without LVM or MD (or the newer versions of those that > > don't ignore barriers) would work too. > > *if* the WAL is also on the SSD. > > If the WAL is on a separate drive, the write barriers do you no good, > because they won't ensure that the data hits the main drive storage > before the WAL recycling hits the WAL disk storage. The two drives > operate independently and the write barriers don't interact. > > You'd need some kind of inter-drive write barrier. > > -- > Craig Ringer Hello ! as i understand this: ssd performace is great, but caching is the problem. questions: 1. what about conventional disks with 32/64 mb cache ? how do they handle the plug test if their caches are on ? 2. what about using seperated power supply for the disks ? it it possible to write back the cache after switching the sata to another machine controller ? 3. what about making a statement about a lacking enterprise feature (aka emergency battery equipped ssd) and submitting this to the producers ? I found that one of them (OCZ) seems to handle suggestions of customers (see write speed discussins on vertex fro example) and another (intel) seems to handle serious problems with his disks in rewriting and sometimes redesigning his products - if you tell them and market dictades to react (see degeneration of performace before 1.11 firmware). perhaps its time to act and not only to complain about the fact. (btw: found funny bonnie++ for my intel 160 gb postville and my samsung pb22 after using the sam for now approx. 3 months+ ... my conclusion: NOT all SSD are equal ...) best regards anton -- ATRSoft GmbH Bivetsweg 12 D 41542 Dormagen Deutschland Tel .: +49(0)2182 8339951 Mobil: +49(0)172 3490817 Geschäftsführer Anton Rommerskirchen Köln HRB 44927 STNR 122/5701 - 2030 USTID DE213791450
On Thu, 2009-11-19 at 19:01 +0100, Anton Rommerskirchen wrote: > Am Donnerstag, 19. November 2009 13:29:56 schrieb Craig Ringer: > > On 19/11/2009 12:22 PM, Scott Carey wrote: > > > 3: Have PG wait a half second (configurable) after the checkpoint > > > fsync() completes before deleting/ overwriting any WAL segments. This > > > would be a trivial "feature" to add to a postgres release, I think. > > > > How does that help? It doesn't provide any guarantee that the data has > > hit main storage - it could lurk in SDD cache for hours. > > > > > 4: Yet another solution: The drives DO adhere to write barriers > > > properly. A filesystem that used these in the process of fsync() would be > > > fine too. So XFS without LVM or MD (or the newer versions of those that > > > don't ignore barriers) would work too. > > > > *if* the WAL is also on the SSD. > > > > If the WAL is on a separate drive, the write barriers do you no good, > > because they won't ensure that the data hits the main drive storage > > before the WAL recycling hits the WAL disk storage. The two drives > > operate independently and the write barriers don't interact. > > > > You'd need some kind of inter-drive write barrier. > > > > -- > > Craig Ringer > > > Hello ! > > as i understand this: > ssd performace is great, but caching is the problem. > > questions: > > 1. what about conventional disks with 32/64 mb cache ? how do they handle the > plug test if their caches are on ? If the aren't battery backed, they can lose data. This is not specific to SSD. > 2. what about using seperated power supply for the disks ? it it possible to > write back the cache after switching the sata to another machine controller ? Not sure. I only use devices with battery backed caches or no cache. I would be concerned however about the drive not flushing itself and still running out of power. > 3. what about making a statement about a lacking enterprise feature (aka > emergency battery equipped ssd) and submitting this to the producers ? The producers aren't making Enterprise products, they are using caches to accelerate the speeds of consumer products to make their drives more appealing to consumers. They aren't going to slow them down to make them more reliable, especially when the core consumer doesn't know about this issue, and is even less likely to understand it if explained. They may stamp the word Enterprise on them, but it's nothing more than marketing. > I found that one of them (OCZ) seems to handle suggestions of customers (see > write speed discussins on vertex fro example) > > and another (intel) seems to handle serious problems with his disks in > rewriting and sometimes redesigning his products - if you tell them and > market dictades to react (see degeneration of performace before 1.11 > firmware). > > perhaps its time to act and not only to complain about the fact. Or, you could just buy higher quality equipment that was designed with this in mind. There is nothing unique to SSD here IMHO. I wouldn't run my production grade databases on consumer grade HDD, I wouldn't run them on consumer grade SSD either. -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
Scott Carey wrote: > Have PG wait a half second (configurable) after the checkpoint fsync() > completes before deleting/ overwriting any WAL segments. This would be a > trivial "feature" to add to a postgres release, I think. Actually, it > already exists! Turn on log archiving, and have the script that it runs after a checkpoint sleep(). > That won't help. Once the checkpoint is done, the problem isn't just that the WAL segments are recycled. The server isn't going to use them even if they were there. The reason why you can erase/recycle them is that you're doing so *after* writing out a checkpoint record that says you don't have to ever look at them again. What you'd actually have to do is hack the server code to insert that delay after every fsync--there are none that you can cheat on and not introduce a corruption possibility. The whole WAL/recovery mechanism in PostgreSQL doesn't make a lot of assumptions about what the underlying disk has to actually do beyond the fsync requirement; the flip side to that robustness is that it's the one you can't ever violate safely. > BTW, the information I have seen indicates that the write cache is 256K on > the Intel drives, the 32MB/64MB of other RAM is working memory for the drive > block mapping / wear leveling algorithms (tracking 160GB of 4k blocks takes > space). > Right. It's not used like the write-cache on a regular hard drive, where they're buffering 8MB-32MB worth of writes just to keep seek overhead down. It's there primarily to allow combining writes into large chunks, to better match the block size of the underlying SSD flash cells (128K). Having enough space for two full cells allows spooling out the flash write to a whole block while continuing to buffer the next one. This is why turning the cache off can tank performance so badly--you're going to be writing a whole 128K block no matter what if it's force to disk without caching, even if it's just to write a 8K page to it. That's only going to reach 1/16 of the usual write speed on single page writes. And that's why you should also be concerned at whether disabling the write cache impacts the drive longevity, lots of small writes going out in small chunks is going to wear flash out much faster than if the drive is allowed to wait until it's got a full sized block to write every time. The fact that the cache is so small is also why it's harder to catch the drive doing the wrong thing here. The plug test is pretty sensitive to a problem when you've got megabytes worth of cached writes that are spooling to disk at spinning hard drive speeds. The window for loss on a SSD with no seek overhead and only a moderate number of KB worth of cached data is much, much smaller. Doesn't mean it's gone though. It's a shame that the design wasn't improved just a little bit; a cheap capacitor and blocking new writes once the incoming power dropped is all it would take to make these much more reliable for database use. But that would raise the price, and not really help anybody but the small subset of the market that cares about durable writes. > 4: Yet another solution: The drives DO adhere to write barriers properly. > A filesystem that used these in the process of fsync() would be fine too. > So XFS without LVM or MD (or the newer versions of those that don't ignore > barriers) would work too. > If I really trusted anything beyond the very basics of the filesystem to really work well on Linux, this whole issue would be moot for most of the production deployments I do. Ideally, fsync would just push out the minimum of what's needed, it would call the appropriate write cache flush mechanism the way the barrier implementation does when that all works, life would be good. Alternately, you might even switch to using O_SYNC writes instead, which on a good filesystem implementation are both accelerated and safe compared to write/fsync (I've seen that work as expected on Vertias VxFS for example). Meanwhile, in the actual world we live, patches that make writes more durable by default are dropped by the Linux community because they tank performance for too many types of loads, I'm frightened to turn on O_SYNC at all on ext3 because of reports of corruption on the lists here, fsync does way more work than it needs to, and the way the filesystem and block drivers have been separated makes it difficult to do any sort of device write cache control from userland. This is why I try to use the simplest, best tested approach out there whenever possible. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Scott Marlowe wrote: > On Thu, Nov 19, 2009 at 10:01 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > >> pgbench is actually a pretty awesome i/o tester assuming you have big >> enough scaling factor > Seeing as how pgbench only goes to scaling factor of 4000, are the any > plans on enlarging that number? > I'm doing pgbench tests now on a system large enough for this limit to matter, so I'm probably going to have to fix that for 8.5 just to complete my own work. You can use pgbench to either get interesting peak read results, or peak write ones, but it's not real useful for things in between. The standard test basically turns into a huge stack of writes to a single table, and the select-only one is interesting to gauge either cached or uncached read speed (depending on the scale). It's not very useful for getting a feel for how something with a mixed read/write workload does though, which is unfortunate because I think that scenario is much more common than what it does test. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Thu, Nov 19, 2009 at 4:10 PM, Greg Smith <greg@2ndquadrant.com> wrote: > You can use pgbench to either get interesting peak read results, or peak > write ones, but it's not real useful for things in between. The standard > test basically turns into a huge stack of writes to a single table, and the > select-only one is interesting to gauge either cached or uncached read speed > (depending on the scale). It's not very useful for getting a feel for how > something with a mixed read/write workload does though, which is unfortunate > because I think that scenario is much more common than what it does test. all true, but it's pretty easy to rig custom (-f) commands for virtually any test you want,. merlin
On Thu, Nov 19, 2009 at 2:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Thu, Nov 19, 2009 at 4:10 PM, Greg Smith <greg@2ndquadrant.com> wrote: >> You can use pgbench to either get interesting peak read results, or peak >> write ones, but it's not real useful for things in between. The standard >> test basically turns into a huge stack of writes to a single table, and the >> select-only one is interesting to gauge either cached or uncached read speed >> (depending on the scale). It's not very useful for getting a feel for how >> something with a mixed read/write workload does though, which is unfortunate >> because I think that scenario is much more common than what it does test. > > all true, but it's pretty easy to rig custom (-f) commands for > virtually any test you want,. My primary use of pgbench is to exercise a machine as a part of acceptance testing. After using it to do power plug pulls, I run it for a week or two to exercise the drive array and controller mainly. Any machine that runs smooth for a week with a load factor of 20 or 30 and the amount of updates that pgbench generates don't overwhelm it I'm pretty happy.
Am 13.11.2009 um 14:57 schrieb Laszlo Nagy: > I was thinking about ARECA 1320 with 2GB memory + BBU. > Unfortunately, I cannot find information about using ARECA cards > with SSD drives. They told me: currently not supported, but they have positive customer reports. No date yet for implementation of the TRIM command in firmware. ... > My other option is to buy two SLC SSD drives and use RAID1. It would > cost about the same, but has less redundancy and less capacity. > Which is the faster? 8-10 MLC disks in RAID 6 with a good caching > controller, or two SLC disks in RAID1? I just went the MLC path with X25-Ms mainly to save energy. The fresh assembled box has one SSD for WAL and one RAID 0 with for SSDs as table space. Everything runs smoothly on a areca 1222 with BBU, which turned all write caches off. OS is FreeBSD 8.0. I aligned all partitions on 1 MB boundaries. Next week I will install 8.4.1 and run pgbench for pull-the-plug- testing. I would like to get some advice from the list for testing the SSDs! Axel --- axel.rau@chaos1.de PGP-Key:29E99DD6 +49 151 2300 9283 computing @ chaos claudius
On Thu, 19 Nov 2009, Greg Smith wrote: > This is why turning the cache off can tank performance so badly--you're going > to be writing a whole 128K block no matter what if it's force to disk without > caching, even if it's just to write a 8K page to it. Theoretically, this does not need to be the case. Now, I don't know what the Intel drives actually do, but remember that for flash, it is the *erase* cycle that has to be done in large blocks. Writing itself can be done in small blocks, to previously erased sites. The technology for combining small writes into sequential writes has been around for 17 years or so in http://portal.acm.org/citation.cfm?id=146943&dl= so there really isn't any excuse for modern flash drives not giving really fast small writes. Matthew -- for a in past present future; do for b in clients employers associates relatives neighbours pets; do echo "The opinions here in no way reflect the opinions of my $a $b." done; done
On Wed, Nov 18, 2009 at 8:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Scott Carey <scott@richrelevance.com> writes: >> For your database DATA disks, leaving the write cache on is 100% acceptable, >> even with power loss, and without a RAID controller. And even in high write >> environments. > > Really? How hard have you tested that configuration? > >> That is what the XLOG is for, isn't it? > > Once we have fsync'd a data change, we discard the relevant XLOG > entries. If the disk hasn't actually put the data on stable storage > before it claims the fsync is done, you're screwed. > > XLOG only exists to centralize the writes that have to happen before > a transaction can be reported committed (in particular, to avoid a > lot of random-access writes at commit). It doesn't make any > fundamental change in the rules of the game: a disk that lies about > write complete will still burn you. > > In a zero-seek-cost environment I suspect that XLOG wouldn't actually > be all that useful. You would still need it to guard against partial page writes, unless we have some guarantee that those can't happen. And once your transaction has scattered its transaction id into various xmin and xmax over many tables, you need an atomic, durable repository to decide if that id has or has not committed. Maybe clog fsynced on commit would serve this purpose? Jeff
Axel Rau wrote: > > Am 13.11.2009 um 14:57 schrieb Laszlo Nagy: > >> I was thinking about ARECA 1320 with 2GB memory + BBU. Unfortunately, >> I cannot find information about using ARECA cards with SSD drives. > They told me: currently not supported, but they have positive customer > reports. No date yet for implementation of the TRIM command in firmware. > ... >> My other option is to buy two SLC SSD drives and use RAID1. It would >> cost about the same, but has less redundancy and less capacity. Which >> is the faster? 8-10 MLC disks in RAID 6 with a good caching >> controller, or two SLC disks in RAID1? Despite my other problems, I've found that the Intel X25-Es work remarkably well. The key issue for short,fast transactions seems to be how fast an fdatasync() call can run, forcing the commit to disk, and allowing the transaction to return to userspace. With all the caches off, the intel X25-E beat a standard disk by a factor of about 10. Attached is a short C program which may be of use. For what it's worth, we have actually got a pretty decent (and redundant) setup using a RAIS array of RAID1. [primary server] SSD } } RAID1 -------------------} DRBD --- /var/lib/postgresql SSD } } } } } } [secondary server] } } SSD } } } RAID1 --------gigE--------} SSD } The servers connect back-to-back with a dedicated Gigabit ethernet cable, and DRBD is running in protocol B. We can pull the power out of 1 server, and be using the next within 30 seconds, and with no dataloss. Richard #include <string.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <errno.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #define NUM_ITER 1024 int main ( int argc, char **argv ) { const char data[] = "Liberate"; size_t data_len = strlen ( data ); const char *filename; int fd; unsigned int i; if ( argc != 2 ) { fprintf ( stderr, "Syntax: %s output_file\n", argv[0] ); exit ( 1 ); } filename = argv[1]; fd = open ( filename, ( O_WRONLY | O_CREAT | O_EXCL ), 0666 ); if ( fd < 0 ) { fprintf ( stderr, "Could not create \"%s\": %s\n", filename, strerror ( errno ) ); exit ( 1 ); } for ( i = 0 ; i < NUM_ITER ; i++ ) { if ( write ( fd, data, data_len ) != data_len ) { fprintf ( stderr, "Could not write: %s\n", strerror ( errno ) ); exit ( 1 ); } if ( fdatasync ( fd ) != 0 ) { fprintf ( stderr, "Could not fdatasync: %s\n", strerror ( errno ) ); exit ( 1 ); } } return 0; }
Richard Neill wrote: > The key issue for short,fast transactions seems to be > how fast an fdatasync() call can run, forcing the commit to disk, and > allowing the transaction to return to userspace. > Attached is a short C program which may be of use. Right. I call this the "commit rate" of the storage, and on traditional spinning disks it's slightly below the rotation speed of the media (i.e. 7200RPM = 120 commits/second). If you've got a battery-backed cache in front of standard disks, you can easily clear 10K commits/second. I normally test that out with sysbench, because I use that for some other tests anyway: sysbench --test=fileio --file-fsync-freq=1 --file-num=1 --file-total-size=16384 --file-test-mode=rndwr run | grep "Requests/sec" -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Fri, Nov 20, 2009 at 7:27 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Richard Neill wrote: >> >> The key issue for short,fast transactions seems to be >> how fast an fdatasync() call can run, forcing the commit to disk, and >> allowing the transaction to return to userspace. >> Attached is a short C program which may be of use. > > Right. I call this the "commit rate" of the storage, and on traditional > spinning disks it's slightly below the rotation speed of the media (i.e. > 7200RPM = 120 commits/second). If you've got a battery-backed cache in > front of standard disks, you can easily clear 10K commits/second. ...until you overflow the cache. battery backed cache does not break the laws of physics...it just provides a higher burst rate (plus what ever advantages can be gained by peeking into the write queue and re-arranging/grouping. I learned the hard way that how your raid controller behaves in overflow situations can cause catastrophic performance degradations... merlin
Greg Smith wrote: > Merlin Moncure wrote: > > I am right now talking to someone on postgresql irc who is measuring > > 15k iops from x25-e and no data loss following power plug test. > The funny thing about Murphy is that he doesn't visit when things are > quiet. It's quite possible the window for data loss on the drive is > very small. Maybe you only see it one out of 10 pulls with a very > aggressive database-oriented write test. Whatever the odd conditions > are, you can be sure you'll see them when there's a bad outage in actual > production though. > > A good test program that is a bit better at introducing and detecting > the write cache issue is described at > http://brad.livejournal.com/2116715.html Wow, I had not seen that tool before. I have added a link to it from our documentation, and also added a mention of our src/tools/fsync test tool to our docs. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + Index: doc/src/sgml/config.sgml =================================================================== RCS file: /cvsroot/pgsql/doc/src/sgml/config.sgml,v retrieving revision 1.233 diff -c -c -r1.233 config.sgml *** doc/src/sgml/config.sgml 13 Nov 2009 22:43:39 -0000 1.233 --- doc/src/sgml/config.sgml 28 Nov 2009 16:12:46 -0000 *************** *** 1432,1437 **** --- 1432,1439 ---- The default is the first method in the above list that is supported by the platform. The <literal>open_</>* options also use <literal>O_DIRECT</> if available. + The utility <filename>src/tools/fsync</> in the PostgreSQL source tree + can do performance testing of various fsync methods. This parameter can only be set in the <filename>postgresql.conf</> file or on the server command line. </para> Index: doc/src/sgml/wal.sgml =================================================================== RCS file: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v retrieving revision 1.59 diff -c -c -r1.59 wal.sgml *** doc/src/sgml/wal.sgml 9 Apr 2009 16:20:50 -0000 1.59 --- doc/src/sgml/wal.sgml 28 Nov 2009 16:12:57 -0000 *************** *** 86,91 **** --- 86,93 ---- ensure data integrity. Avoid disk controllers that have non-battery-backed write caches. At the drive level, disable write-back caching if the drive cannot guarantee the data will be written before shutdown. + You can test for reliable I/O subsystem behavior using <ulink + url="http://brad.livejournal.com/2116715.html">diskchecker.pl</ulink>. </para> <para>
Bruce Momjian wrote: > Greg Smith wrote: >> A good test program that is a bit better at introducing and detecting >> the write cache issue is described at >> http://brad.livejournal.com/2116715.html > > Wow, I had not seen that tool before. I have added a link to it from > our documentation, and also added a mention of our src/tools/fsync test > tool to our docs. One challenge with many of these test programs is that some filesystem (ext3 is one) will flush drive caches on fsync() *sometimes, but not always. If your test program happens to do a sequence of commands that makes an fsync() actually flush a disk's caches, it might mislead you if your actual application has a different series of system calls. For example, ext3 fsync() will issue write barrier commands if the inode was modified; but not if the inode wasn't. See test program here: http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg272253.html and read two paragraphs further to see how touching the inode makes ext3 fsync behave differently.
Ron Mayer wrote: > Bruce Momjian wrote: > > Greg Smith wrote: > >> A good test program that is a bit better at introducing and detecting > >> the write cache issue is described at > >> http://brad.livejournal.com/2116715.html > > > > Wow, I had not seen that tool before. I have added a link to it from > > our documentation, and also added a mention of our src/tools/fsync test > > tool to our docs. > > One challenge with many of these test programs is that some > filesystem (ext3 is one) will flush drive caches on fsync() > *sometimes, but not always. If your test program happens to do > a sequence of commands that makes an fsync() actually flush a > disk's caches, it might mislead you if your actual application > has a different series of system calls. > > For example, ext3 fsync() will issue write barrier commands > if the inode was modified; but not if the inode wasn't. > > See test program here: > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg272253.html > and read two paragraphs further to see how touching > the inode makes ext3 fsync behave differently. I thought our only problem was testing the I/O subsystem --- I never suspected the file system might lie too. That email indicates that a large percentage of our install base is running on unreliable file systems --- why have I not heard about this before? Do the write barriers allow data loss but prevent data inconsistency? It sound like they are effectively running with synchronous_commit = off. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > I thought our only problem was testing the I/O subsystem --- I never > suspected the file system might lie too. That email indicates that a > large percentage of our install base is running on unreliable file > systems --- why have I not heard about this before? Do the write > barriers allow data loss but prevent data inconsistency? It sound like > they are effectively running with synchronous_commit = off. > You might occasionally catch me ranting here that Linux write barriers are not a useful solution at all for PostgreSQL, and that you must turn the disk write cache off rather than expect the barrier implementation to do the right thing. This sort of buginess is why. The reason why it doesn't bite more people is that most Linux systems don't turn on write barrier support by default, and there's a number of situations that can disable barriers even if you did try to enable them. It's still pretty unusual to have a working system with barriers turned on nowadays; I really doubt it's "a large percentage of our install base". I've started keeping most of my notes about where ext3 is vulnerable to issues in Wikipedia, specifically http://en.wikipedia.org/wiki/Ext3#No_checksumming_in_journal ; I just updated that section to point out the specific issue Ron pointed out. Maybe we should point people toward that in the docs, I try to keep that article correct. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Greg Smith wrote: > Bruce Momjian wrote: > > I thought our only problem was testing the I/O subsystem --- I never > > suspected the file system might lie too. That email indicates that a > > large percentage of our install base is running on unreliable file > > systems --- why have I not heard about this before? Do the write > > barriers allow data loss but prevent data inconsistency? It sound like > > they are effectively running with synchronous_commit = off. > > > You might occasionally catch me ranting here that Linux write barriers > are not a useful solution at all for PostgreSQL, and that you must turn > the disk write cache off rather than expect the barrier implementation > to do the right thing. This sort of buginess is why. The reason why it > doesn't bite more people is that most Linux systems don't turn on write > barrier support by default, and there's a number of situations that can > disable barriers even if you did try to enable them. It's still pretty > unusual to have a working system with barriers turned on nowadays; I > really doubt it's "a large percentage of our install base". Ah, so it is only when write barriers are enabled, and they are not enabled by default --- OK, that makes sense. > I've started keeping most of my notes about where ext3 is vulnerable to > issues in Wikipedia, specifically > http://en.wikipedia.org/wiki/Ext3#No_checksumming_in_journal ; I just > updated that section to point out the specific issue Ron pointed out. > Maybe we should point people toward that in the docs, I try to keep that > article correct. Yes, good idea. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: >> For example, ext3 fsync() will issue write barrier commands >> if the inode was modified; but not if the inode wasn't. >> >> See test program here: >> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg272253.html >> and read two paragraphs further to see how touching >> the inode makes ext3 fsync behave differently. > > I thought our only problem was testing the I/O subsystem --- I never > suspected the file system might lie too. That email indicates that a > large percentage of our install base is running on unreliable file > systems --- why have I not heard about this before? It came up a on these lists a few times in the past. Here's one example. http://archives.postgresql.org/pgsql-performance/2008-08/msg00159.php As far as I can tell, most of the threads ended with people still suspecting lying hard drives. But to the best of my ability I can't find any drives that actually lie when sent the commands to flush their caches. But various combinations of ext3 & linux MD that decide not to send IDE FLUSH_CACHE_EXT (nor the similiar SCSI SYNCHRONIZE CACHE command) under various situations. I wonder if there are enough ext3 users out there that postgres should touch the inodes before doing a fsync. > Do the write barriers allow data loss but prevent data inconsistency? If I understand right, data inconsistency could occur too. One aspect of the write barriers is flushing a hard drive's caches. > It sound like they are effectively running with synchronous_commit = off. And with the (mythical?) hard drive with lying caches.
Bruce Momjian wrote: > Greg Smith wrote: >> Bruce Momjian wrote: >>> I thought our only problem was testing the I/O subsystem --- I never >>> suspected the file system might lie too. That email indicates that a >>> large percentage of our install base is running on unreliable file >>> systems --- why have I not heard about this before? >>> >> he reason why it >> doesn't bite more people is that most Linux systems don't turn on write >> barrier support by default, and there's a number of situations that can >> disable barriers even if you did try to enable them. It's still pretty >> unusual to have a working system with barriers turned on nowadays; I >> really doubt it's "a large percentage of our install base". > > Ah, so it is only when write barriers are enabled, and they are not > enabled by default --- OK, that makes sense. The test program I linked up-thread shows that fsync does nothing unless the inode's touched on an out-of-the-box Ubuntu 9.10 using ext3 on a straight from Dell system. Surely that's a common config, no? If I uncomment the fchmod lines below I can see that even with ext3 and write caches enabled on my drives it does indeed wait. Note that EXT4 doesn't show the problem on the same system. Here's a slightly modified test program that's a bit easier to run. If you run the program and it exits right away, your system isn't waiting for platters to spin. //////////////////////////////////////////////////////////////////// /* ** based on http://article.gmane.org/gmane.linux.file-systems/21373 ** http://thread.gmane.org/gmane.linux.kernel/646040 ** If this program returns instantly, the fsync() lied. ** If it takes a second or so, fsync() probably works. ** On ext3 and drives that cache writes, you probably need ** to uncomment the fchmod's to make fsync work right. */ #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <stdio.h> #include <stdlib.h> int main(int argc,char *argv[]) { if (argc<2) { printf("usage: fs <filename>\n"); exit(1); } int fd = open (argv[1], O_RDWR | O_CREAT | O_TRUNC, 0666); int i; for (i=0;i<100;i++) { char byte; pwrite (fd, &byte, 1, 0); // fchmod (fd, 0644); fchmod (fd, 0664); fsync (fd); } } //////////////////////////////////////////////////////////////////// ron@ron-desktop:/tmp$ /usr/bin/time ./a.out foo 0.00user 0.00system 0:00.01elapsed 21%CPU (0avgtext+0avgdata 0maxresident)k
Ron Mayer wrote: > Bruce Momjian wrote: > > Greg Smith wrote: > >> Bruce Momjian wrote: > >>> I thought our only problem was testing the I/O subsystem --- I never > >>> suspected the file system might lie too. That email indicates that a > >>> large percentage of our install base is running on unreliable file > >>> systems --- why have I not heard about this before? > >>> > >> he reason why it > >> doesn't bite more people is that most Linux systems don't turn on write > >> barrier support by default, and there's a number of situations that can > >> disable barriers even if you did try to enable them. It's still pretty > >> unusual to have a working system with barriers turned on nowadays; I > >> really doubt it's "a large percentage of our install base". > > > > Ah, so it is only when write barriers are enabled, and they are not > > enabled by default --- OK, that makes sense. > > The test program I linked up-thread shows that fsync does nothing > unless the inode's touched on an out-of-the-box Ubuntu 9.10 using > ext3 on a straight from Dell system. > > Surely that's a common config, no? Yea, this certainly suggests that the problem is wide-spread. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On 11/19/09 1:04 PM, "Greg Smith" <greg@2ndquadrant.com> wrote: > That won't help. Once the checkpoint is done, the problem isn't just > that the WAL segments are recycled. The server isn't going to use them > even if they were there. The reason why you can erase/recycle them is > that you're doing so *after* writing out a checkpoint record that says > you don't have to ever look at them again. What you'd actually have to > do is hack the server code to insert that delay after every fsync--there > are none that you can cheat on and not introduce a corruption > possibility. The whole WAL/recovery mechanism in PostgreSQL doesn't > make a lot of assumptions about what the underlying disk has to actually > do beyond the fsync requirement; the flip side to that robustness is > that it's the one you can't ever violate safely. Yeah, I guess its not so easy. Having the system "hold" one extra checkpoint worth of segments and then during recovery, always replay that previoius one plus the current might work, but I don't know if that could cause corruption. I assume replaying a log twice won't, so replaying N-1 checkpoint, then the current one, might work. If so that would be a cool feature -- so long as the N-2 checkpoint is no longer in the OS or I/O hardware caches when checkpoint N completes, you're safe! Its probably more complicated though, especially with respect to things like MVCC on DDL changes. > Right. It's not used like the write-cache on a regular hard drive, > where they're buffering 8MB-32MB worth of writes just to keep seek > overhead down. It's there primarily to allow combining writes into > large chunks, to better match the block size of the underlying SSD flash > cells (128K). Having enough space for two full cells allows spooling > out the flash write to a whole block while continuing to buffer the next > one. > > This is why turning the cache off can tank performance so badly--you're > going to be writing a whole 128K block no matter what if it's force to > disk without caching, even if it's just to write a 8K page to it. As others mentioned, flash must erase a whole block at once, but it can write sequentially to a block in much smaller chunks. I believe that MLC and SLC differ a bit here, SLC can write smaller subsections of the erase block. A little old but still very useful: http://research.microsoft.com/apps/pubs/?id=63596 > That's only going to reach 1/16 of the usual write speed on single page > writes. And that's why you should also be concerned at whether > disabling the write cache impacts the drive longevity, lots of small > writes going out in small chunks is going to wear flash out much faster > than if the drive is allowed to wait until it's got a full sized block > to write every time. This is still a concern, since even if the SLC cells are technically capable of writing sequentially in smaller chunks, with the write cache off they may not do so. > > The fact that the cache is so small is also why it's harder to catch the > drive doing the wrong thing here. The plug test is pretty sensitive to > a problem when you've got megabytes worth of cached writes that are > spooling to disk at spinning hard drive speeds. The window for loss on > a SSD with no seek overhead and only a moderate number of KB worth of > cached data is much, much smaller. Doesn't mean it's gone though. It's > a shame that the design wasn't improved just a little bit; a cheap > capacitor and blocking new writes once the incoming power dropped is all > it would take to make these much more reliable for database use. But > that would raise the price, and not really help anybody but the small > subset of the market that cares about durable writes. Yup. There are manufacturers who claim no data loss on power failure, hopefully these become more common. http://www.wdc.com/en/products/ssd/technology.asp?id=1 I still contend its a lot more safe than a hard drive. I have not seen one fail yet (out of about 150 heavy use drive-years on X25-Ms). Any system that does not have a battery backed write cache will be faster and safer if an SSD, with write cache on, than hard drives with write cache on. BBU caching is not fail-safe either, batteries wear out, cards die or malfunction. If you need the maximum data integrity, you will probably go with a battery-backed cache raid setup with or without SSDs. If you don't go that route SSD's seem like the best option. The 'middle ground' of software raid with hard drives with their write caches off doesn't seem useful to me at all. I can't think of one use case that isn't better served by a slightly cheaper array of disks with a hardware bbu card (if the data is important or data size is large) OR a set of SSD's (if performance is more important than data safety). >> 4: Yet another solution: The drives DO adhere to write barriers properly. >> A filesystem that used these in the process of fsync() would be fine too. >> So XFS without LVM or MD (or the newer versions of those that don't ignore >> barriers) would work too. >> > If I really trusted anything beyond the very basics of the filesystem to > really work well on Linux, this whole issue would be moot for most of > the production deployments I do. Ideally, fsync would just push out the > minimum of what's needed, it would call the appropriate write cache > flush mechanism the way the barrier implementation does when that all > works, life would be good. Alternately, you might even switch to using > O_SYNC writes instead, which on a good filesystem implementation are > both accelerated and safe compared to write/fsync (I've seen that work > as expected on Vertias VxFS for example). > We could all move to OpenSolaris where that stuff does work right... ;) I think a lot of the things that make ZFS slower for some tasks is that it correctly implements and uses write barriers... > Meanwhile, in the actual world we live, patches that make writes more > durable by default are dropped by the Linux community because they tank > performance for too many types of loads, I'm frightened to turn on > O_SYNC at all on ext3 because of reports of corruption on the lists > here, fsync does way more work than it needs to, and the way the > filesystem and block drivers have been separated makes it difficult to > do any sort of device write cache control from userland. This is why I > try to use the simplest, best tested approach out there whenever possible. > Oh I hear you :) At least ext4 looks like an improvement for the RHEL6/CentOS6 timeframe. Checksums are handy. Many of my systems though don't need the highest data reliability. And a raid 0 of X-25 M's will be much, much more safe than the same thing of regular hard drives, and faster. Putting in a few of those on one system soon (yes M, won't put WAL on it). 2 such drives kick the crap out of anything else for the price when performance is most important and the data is just a copy of something stored in a much safer place than any single server. Previously on such systems, a caching raid card would be needed for performance, but without a bbu data loss risk is very high (much higher than a ssd with caching on -- 256K versus 512M cache!). And a SSD costs less than the raid card. So long as the total data size isn't too big they work well. And even then, some tablespaces can be put on a large HD leaving the more critical ones on the SSD. I estimate the likelihood of complete data loss from a 2 SSD raid-0 as the same as a 4-disk RAID 5 of hard drives. There is a big difference between a couple corrupted files and a lost drive... I have recovered postgres systems with corruption by reindexing and restoring single tables from backups. When one drive in a stripe is lost or a pair in a raid 10 go down, all is lost. I wonder -- has anyone seen an Intel SSD randomly die like a hard drive? I'm still trying to get a "M" to wear out by writing about 120GB a day to it for a year. But rough calculations show that I'm likely years from trouble... By then I'll have upgraded to the gen 3 or 4 drives. > -- > Greg Smith 2ndQuadrant Baltimore, MD > PostgreSQL Training, Services and Support > greg@2ndQuadrant.com www.2ndQuadrant.com > >
On Fri, 13 Nov 2009, Greg Smith wrote: > In order for a drive to work reliably for database use such as for > PostgreSQL, it cannot have a volatile write cache. You either need a write > cache with a battery backup (and a UPS doesn't count), or to turn the cache > off. The SSD performance figures you've been looking at are with the drive's > write cache turned on, which means they're completely fictitious and > exaggerated upwards for your purposes. In the real world, that will result > in database corruption after a crash one day. Seagate are claiming to be on the ball with this one. http://www.theregister.co.uk/2009/12/08/seagate_pulsar_ssd/ Matthew -- The third years are wandering about all worried at the moment because they have to hand in their final projects. Please be sympathetic to them, say things like "ha-ha-ha", but in a sympathetic tone of voice -- Computer Science Lecturer
Matthew Wakeling wrote: > On Fri, 13 Nov 2009, Greg Smith wrote: > > In order for a drive to work reliably for database use such as for > > PostgreSQL, it cannot have a volatile write cache. You either need a write > > cache with a battery backup (and a UPS doesn't count), or to turn the cache > > off. The SSD performance figures you've been looking at are with the drive's > > write cache turned on, which means they're completely fictitious and > > exaggerated upwards for your purposes. In the real world, that will result > > in database corruption after a crash one day. > > Seagate are claiming to be on the ball with this one. > > http://www.theregister.co.uk/2009/12/08/seagate_pulsar_ssd/ I have updated our documentation to mention that even SSD drives often have volatile write-back caches. Patch attached and applied. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive, Christ can be your backup. + Index: doc/src/sgml/wal.sgml =================================================================== RCS file: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v retrieving revision 1.61 diff -c -c -r1.61 wal.sgml *** doc/src/sgml/wal.sgml 3 Feb 2010 17:25:06 -0000 1.61 --- doc/src/sgml/wal.sgml 20 Feb 2010 18:26:40 -0000 *************** *** 59,65 **** same concerns about data loss exist for write-back drive caches as exist for disk controller caches. Consumer-grade IDE and SATA drives are particularly likely to have write-back caches that will not survive a ! power failure. To check write caching on <productname>Linux</> use <command>hdparm -I</>; it is enabled if there is a <literal>*</> next to <literal>Write cache</>; <command>hdparm -W</> to turn off write caching. On <productname>FreeBSD</> use --- 59,66 ---- same concerns about data loss exist for write-back drive caches as exist for disk controller caches. Consumer-grade IDE and SATA drives are particularly likely to have write-back caches that will not survive a ! power failure. Many solid-state drives also have volatile write-back ! caches. To check write caching on <productname>Linux</> use <command>hdparm -I</>; it is enabled if there is a <literal>*</> next to <literal>Write cache</>; <command>hdparm -W</> to turn off write caching. On <productname>FreeBSD</> use
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Bruce Momjian wrote: > Matthew Wakeling wrote: >> On Fri, 13 Nov 2009, Greg Smith wrote: >>> In order for a drive to work reliably for database use such as for >>> PostgreSQL, it cannot have a volatile write cache. You either need a write >>> cache with a battery backup (and a UPS doesn't count), or to turn the cache >>> off. The SSD performance figures you've been looking at are with the drive's >>> write cache turned on, which means they're completely fictitious and >>> exaggerated upwards for your purposes. In the real world, that will result >>> in database corruption after a crash one day. >> Seagate are claiming to be on the ball with this one. >> >> http://www.theregister.co.uk/2009/12/08/seagate_pulsar_ssd/ > > I have updated our documentation to mention that even SSD drives often > have volatile write-back caches. Patch attached and applied. Hmmm. That got me thinking: consider ZFS and HDD with volatile cache. Do the characteristics of ZFS avoid this issue entirely? - -- Dan Langille BSDCan - The Technical BSD Conference : http://www.bsdcan.org/ PGCon - The PostgreSQL Conference: http://www.pgcon.org/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.13 (FreeBSD) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkuAayQACgkQCgsXFM/7nTyMggCgnZUbVzldxjp/nPo8EL1Nq6uG 6+IAoNGIB9x8/mwUQidjM9nnAADRbr9j =3RJi -----END PGP SIGNATURE-----
Dan Langille wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Bruce Momjian wrote: > > Matthew Wakeling wrote: > >> On Fri, 13 Nov 2009, Greg Smith wrote: > >>> In order for a drive to work reliably for database use such as for > >>> PostgreSQL, it cannot have a volatile write cache. You either need a write > >>> cache with a battery backup (and a UPS doesn't count), or to turn the cache > >>> off. The SSD performance figures you've been looking at are with the drive's > >>> write cache turned on, which means they're completely fictitious and > >>> exaggerated upwards for your purposes. In the real world, that will result > >>> in database corruption after a crash one day. > >> Seagate are claiming to be on the ball with this one. > >> > >> http://www.theregister.co.uk/2009/12/08/seagate_pulsar_ssd/ > > > > I have updated our documentation to mention that even SSD drives often > > have volatile write-back caches. Patch attached and applied. > > Hmmm. That got me thinking: consider ZFS and HDD with volatile cache. > Do the characteristics of ZFS avoid this issue entirely? No, I don't think so. ZFS only avoids partial page writes. ZFS still assumes something sent to the drive is permanent or it would have no way to operate. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive, Christ can be your backup. +
On Feb 20, 2010, at 3:19 PM, Bruce Momjian wrote: > Dan Langille wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Bruce Momjian wrote: >>> Matthew Wakeling wrote: >>>> On Fri, 13 Nov 2009, Greg Smith wrote: >>>>> In order for a drive to work reliably for database use such as for >>>>> PostgreSQL, it cannot have a volatile write cache. You either need a write >>>>> cache with a battery backup (and a UPS doesn't count), or to turn the cache >>>>> off. The SSD performance figures you've been looking at are with the drive's >>>>> write cache turned on, which means they're completely fictitious and >>>>> exaggerated upwards for your purposes. In the real world, that will result >>>>> in database corruption after a crash one day. >>>> Seagate are claiming to be on the ball with this one. >>>> >>>> http://www.theregister.co.uk/2009/12/08/seagate_pulsar_ssd/ >>> >>> I have updated our documentation to mention that even SSD drives often >>> have volatile write-back caches. Patch attached and applied. >> >> Hmmm. That got me thinking: consider ZFS and HDD with volatile cache. >> Do the characteristics of ZFS avoid this issue entirely? > > No, I don't think so. ZFS only avoids partial page writes. ZFS still > assumes something sent to the drive is permanent or it would have no way > to operate. > ZFS is write-back cache aware, and safe provided the drive's cache flushing and write barrier related commands work. Itwill flush data in 'transaction groups' and flush the drive write caches at the end of those transactions. Since its copyon write, it can ensure that all the changes in the transaction group appear on disk, or all are lost. This all worksso long as the cache flush commands do. > -- > Bruce Momjian <bruce@momjian.us> http://momjian.us > EnterpriseDB http://enterprisedb.com > PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do > + If your life is a hard drive, Christ can be your backup. + > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
Scott Carey wrote: > On Feb 20, 2010, at 3:19 PM, Bruce Momjian wrote: > > > Dan Langille wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- > >> Hash: SHA1 > >> > >> Bruce Momjian wrote: > >>> Matthew Wakeling wrote: > >>>> On Fri, 13 Nov 2009, Greg Smith wrote: > >>>>> In order for a drive to work reliably for database use such as for > >>>>> PostgreSQL, it cannot have a volatile write cache. You either need a write > >>>>> cache with a battery backup (and a UPS doesn't count), or to turn the cache > >>>>> off. The SSD performance figures you've been looking at are with the drive's > >>>>> write cache turned on, which means they're completely fictitious and > >>>>> exaggerated upwards for your purposes. In the real world, that will result > >>>>> in database corruption after a crash one day. > >>>> Seagate are claiming to be on the ball with this one. > >>>> > >>>> http://www.theregister.co.uk/2009/12/08/seagate_pulsar_ssd/ > >>> > >>> I have updated our documentation to mention that even SSD drives often > >>> have volatile write-back caches. Patch attached and applied. > >> > >> Hmmm. That got me thinking: consider ZFS and HDD with volatile cache. > >> Do the characteristics of ZFS avoid this issue entirely? > > > > No, I don't think so. ZFS only avoids partial page writes. ZFS still > > assumes something sent to the drive is permanent or it would have no way > > to operate. > > > > ZFS is write-back cache aware, and safe provided the drive's > cache flushing and write barrier related commands work. It will > flush data in 'transaction groups' and flush the drive write > caches at the end of those transactions. Since its copy on > write, it can ensure that all the changes in the transaction > group appear on disk, or all are lost. This all works so long > as the cache flush commands do. Agreed, thought I thought the problem was that SSDs lie about their cache flush like SATA drives do, or is there something I am missing? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Agreed, thought I thought the problem was that SSDs lie about their > cache flush like SATA drives do, or is there something I am missing? There's exactly one case I can find[1] where this century's IDE drives lied more than any other drive with a cache: Under 120GB Maxtor drives from late 2003 to early 2004. and it's apparently been worked around for years. Those drives claimed to support the "FLUSH_CACHE_EXT" feature (IDE command 0xEA), but did not support sending 48-bit commands which was needed to send the cache flushing command. And for that case a workaround for Linux was quickly identified by checking for *both* the support for 48-bit commands and support for the flush cache extension[2]. Beyond those 2004 drive + 2003 kernel systems, I think most the rest of such reports have been various misfeatures in some of Linux's filesystems (like EXT3 that only wants to send drives cache-flushing commands when inode change[3]) and linux software raid misfeatures.... ...and ISTM those would affect SSDs the same way they'd affect SATA drives. [1] http://lkml.org/lkml/2004/5/12/132 [2] http://lkml.org/lkml/2004/5/12/200 [3] http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg272253.html
Bruce Momjian wrote:Agreed, thought I thought the problem was that SSDs lie about their cache flush like SATA drives do, or is there something I am missing?There's exactly one case I can find[1] where this century's IDE drives lied more than any other drive with a cache:
Ron is correct that the problem of mainstream SATA drives accepting the cache flush command but not actually doing anything with it is long gone at this point. If you have a regular SATA drive, it almost certainly supports proper cache flushing. And if your whole software/storage stacks understands all that, you should not end up with corrupted data just because there's a volative write cache in there.
But the point of this whole testing exercise coming back into vogue again is that SSDs have returned this negligent behavior to the mainstream again. See http://opensolaris.org/jive/thread.jspa?threadID=121424 for a discussion of this in a ZFS context just last month. There are many documented cases of Intel SSDs that will fake a cache flush, such that the only way to get good reliable writes is to totally disable their writes caches--at which point performance is so bad you might as well have gotten a RAID10 setup instead (and longevity is toast too).
This whole area remains a disaster area and extreme distrust of all the SSD storage vendors is advisable at this point. Basically, if I don't see the capacitor responsible for flushing outstanding writes, and get a clear description from the manufacturer how the cached writes are going to be handled in the event of a power failure, at this point I have to assume the answer is "badly and your data will be eaten". And the prices for SSDs that meet that requirement are still quite steep. I keep hoping somebody will address this market at something lower than the standard "enterprise" prices. The upcoming SandForce designs seem to have thought this through correctly: http://www.anandtech.com/storage/showdoc.aspx?i=3702&p=6 But the product's not out to the general public yet (just like the Seagate units that claim to have capacitor backups--I heard a rumor those are also Sandforce designs actually, so they may be the only ones doing this right and aiming at a lower price).
-- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On 22-2-2010 6:39 Greg Smith wrote: > But the point of this whole testing exercise coming back into vogue > again is that SSDs have returned this negligent behavior to the > mainstream again. See > http://opensolaris.org/jive/thread.jspa?threadID=121424 for a discussion > of this in a ZFS context just last month. There are many documented > cases of Intel SSDs that will fake a cache flush, such that the only way > to get good reliable writes is to totally disable their writes > caches--at which point performance is so bad you might as well have > gotten a RAID10 setup instead (and longevity is toast too). That's weird. Intel's SSD's didn't have a write cache afaik: "I asked Intel about this and it turns out that the DRAM on the Intel drive isn't used for user data because of the risk of data loss, instead it is used as memory by the Intel SATA/flash controller for deciding exactly where to write data (I'm assuming for the wear leveling/reliability algorithms)." http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403&p=10 But that is the old version, perhaps the second generation does have a bit of write caching. I can understand a SSD might do unexpected things when it loses power all of a sudden. It will probably try to group writes to fill a single block (and those blocks vary in size but are normally way larger than those of a normal spinning disk, they are values like 256 or 512KB) and it might loose that "waiting until a full block can be written"-data or perhaps it just couldn't complete a full block-write due to the power failure. Although that behavior isn't really what you want, it would be incorrect to blame write caching for the behavior if the device doesn't even have a write cache ;) Best regards, Arjen
Greg Smith wrote: > Ron Mayer wrote: > > Bruce Momjian wrote: > > > >> Agreed, thought I thought the problem was that SSDs lie about their > >> cache flush like SATA drives do, or is there something I am missing? > >> > > > > There's exactly one case I can find[1] where this century's IDE > > drives lied more than any other drive with a cache: > > Ron is correct that the problem of mainstream SATA drives accepting the > cache flush command but not actually doing anything with it is long gone > at this point. If you have a regular SATA drive, it almost certainly > supports proper cache flushing. And if your whole software/storage > stacks understands all that, you should not end up with corrupted data > just because there's a volative write cache in there. OK, but I have a few questions. Is a write to the drive and a cache flush command the same? Which file systems implement both? I thought a write to the drive was always assumed to flush it to the platters, assuming the drive's cache is set to write-through. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive, Christ can be your backup. +
Ron Mayer wrote: > Bruce Momjian wrote: > > Agreed, thought I thought the problem was that SSDs lie about their > > cache flush like SATA drives do, or is there something I am missing? > > There's exactly one case I can find[1] where this century's IDE > drives lied more than any other drive with a cache: > > Under 120GB Maxtor drives from late 2003 to early 2004. > > and it's apparently been worked around for years. > > Those drives claimed to support the "FLUSH_CACHE_EXT" feature (IDE > command 0xEA), but did not support sending 48-bit commands which > was needed to send the cache flushing command. > > And for that case a workaround for Linux was quickly identified by > checking for *both* the support for 48-bit commands and support for the > flush cache extension[2]. > > > Beyond those 2004 drive + 2003 kernel systems, I think most the rest > of such reports have been various misfeatures in some of Linux's > filesystems (like EXT3 that only wants to send drives cache-flushing > commands when inode change[3]) and linux software raid misfeatures.... > > ...and ISTM those would affect SSDs the same way they'd affect SATA drives. I think the point is not that drives lie about their write-back and write-through behavior, but rather that many SATA/IDE drives default to write-back, and not write-through, and many administrators an file systems are not aware of this behavior. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Greg Smith wrote: >> .... If you have a regular SATA drive, it almost certainly >> supports proper cache flushing.... > > OK, but I have a few questions. Is a write to the drive and a cache > flush command the same? I believe they're different as of ATAPI-6 from 2001. > Which file systems implement both? Seems ZFS and recent ext4 have thought these interactions out thoroughly. Find a slow ext4 that people complain about, and that's the one doing it right :-). Ext3 has some particularly odd annoyances where it flushes and waits for certain writes (ones involving inode changes) but doesn't bother to flush others (just data changes). As far as I can tell, with ext3 you need userspace utilities to make sure flushes occur when you need them. At one point I was tempted to try to put such userspace hacks into postgres. I know less about other file systems. Apparently the NTFS guys are aware of such stuff - but don't know what kinds of fsync equivalent you'd need to make it happen. Also worth noting - Linux's software raid stuff (MD and LVM) need to handle this right as well - and last I checked (sometime last year) the default setups didn't. > I thought a > write to the drive was always assumed to flush it to the platters, > assuming the drive's cache is set to write-through. Apparently somewhere around here: http://www.t10.org/t13/project/d1410r3a-ATA-ATAPI-6.pdf they were separated in the IDE world.
Arjen van der Meijden wrote: > That's weird. Intel's SSD's didn't have a write cache afaik: > "I asked Intel about this and it turns out that the DRAM on the Intel > drive isn't used for user data because of the risk of data loss, > instead it is used as memory by the Intel SATA/flash controller for > deciding exactly where to write data (I'm assuming for the wear > leveling/reliability algorithms)." > http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403&p=10 Read further down: "Despite the presence of the external DRAM, both the Intel controller and the JMicron rely on internal buffers to cache accesses to the SSD...Intel's controller has a 256KB SRAM on-die." That's the problematic part: the Intel controllers have a volatile 256KB write cache stored deep inside the SSD controller, and issuing a standard SATA write cache flush command doesn't seem to clear it. Makes the drives troublesome for database use. > I can understand a SSD might do unexpected things when it loses power > all of a sudden. It will probably try to group writes to fill a single > block (and those blocks vary in size but are normally way larger than > those of a normal spinning disk, they are values like 256 or 512KB) > and it might loose that "waiting until a full block can be > written"-data or perhaps it just couldn't complete a full block-write > due to the power failure. > Although that behavior isn't really what you want, it would be > incorrect to blame write caching for the behavior if the device > doesn't even have a write cache ;) If you write data and that write call returns before the data hits disk, it's a write cache, period. And if that write cache loses its contents if power is lost, it's a volatile write cache that can cause database corruption. The fact that the one on the Intel devices is very small, basically just dealing with the block chunking behavior you describe, doesn't change either of those facts. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On 02/22/2010 08:04 PM, Greg Smith wrote: > Arjen van der Meijden wrote: >> That's weird. Intel's SSD's didn't have a write cache afaik: >> "I asked Intel about this and it turns out that the DRAM on the Intel >> drive isn't used for user data because of the risk of data loss, >> instead it is used as memory by the Intel SATA/flash controller for >> deciding exactly where to write data (I'm assuming for the wear >> leveling/reliability algorithms)." >> http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403&p=10 > > Read further down: > > "Despite the presence of the external DRAM, both the Intel controller > and the JMicron rely on internal buffers to cache accesses to the > SSD...Intel's controller has a 256KB SRAM on-die." > > That's the problematic part: the Intel controllers have a volatile > 256KB write cache stored deep inside the SSD controller, and issuing a > standard SATA write cache flush command doesn't seem to clear it. > Makes the drives troublesome for database use. I had read the above when posted, and then looked up SRAM. SRAM seems to suggest it will hold the data even after power loss, but only for a period of time. As long as power can restore within a few minutes, it seemed like this would be ok? >> I can understand a SSD might do unexpected things when it loses power >> all of a sudden. It will probably try to group writes to fill a >> single block (and those blocks vary in size but are normally way >> larger than those of a normal spinning disk, they are values like 256 >> or 512KB) and it might loose that "waiting until a full block can be >> written"-data or perhaps it just couldn't complete a full block-write >> due to the power failure. >> Although that behavior isn't really what you want, it would be >> incorrect to blame write caching for the behavior if the device >> doesn't even have a write cache ;) > > If you write data and that write call returns before the data hits > disk, it's a write cache, period. And if that write cache loses its > contents if power is lost, it's a volatile write cache that can cause > database corruption. The fact that the one on the Intel devices is > very small, basically just dealing with the block chunking behavior > you describe, doesn't change either of those facts. > The SRAM seems to suggest that it does not necessarily lose its contents if power is lost - it just doesn't say how long you have to plug it back in. Isn't this similar to a battery-backed cache or capacitor-backed cache? I'd love to have a better guarantee - but is SRAM really such a bad model? Cheers, mark
Ron Mayer wrote: > I know less about other file systems. Apparently the NTFS guys > are aware of such stuff - but don't know what kinds of fsync equivalent > you'd need to make it happen. > It's actually pretty straightforward--better than ext3. Windows with NTFS has been perfectly aware how to do write-through on drives that support it when you execute _commit for some time: http://msdn.microsoft.com/en-us/library/17618685(VS.80).aspx If you switch the postgresql.conf setting to fsync_writethrough on Windows, it will execute _commit where it would execute fsync on other platforms, and that pushes through the drive's caches as it should (unlike fsync in many cases). More about this at http://archives.postgresql.org/pgsql-hackers/2005-08/msg00227.php and http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm (which also covers OS X). -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Mark Mielke wrote: > I had read the above when posted, and then looked up SRAM. SRAM seems > to suggest it will hold the data even after power loss, but only for a > period of time. As long as power can restore within a few minutes, it > seemed like this would be ok? The normal type of RAM everyone uses is DRAM, which requires constrant "refresh" cycles to keep it working and is pretty power hungry as a result. Power gone, data gone an instant later. There is also Non-volatile SRAM that includes an integrated battery ( http://www.maxim-ic.com/quick_view2.cfm/qv_pk/2648 is a typical example), and that sort of design can be used to build the sort of battery-backed caches that RAID controllers provide. If Intel's drives were built using a NV-SRAM implementation, I'd be a happy owner of one instead of a constant critic of their drives. But regular old SRAM is still completely volatile and loses its contents very quickly after power fails. I doubt you'd even get minutes of time before it's gone. The ease at which data loss failures with these Intel drives continue to be duplicated in the field says their design isn't anywhere near good enough to be considered non-volatile. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Mon, Feb 22, 2010 at 6:39 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Mark Mielke wrote: >> >> I had read the above when posted, and then looked up SRAM. SRAM seems to >> suggest it will hold the data even after power loss, but only for a period >> of time. As long as power can restore within a few minutes, it seemed like >> this would be ok? > > The normal type of RAM everyone uses is DRAM, which requires constrant > "refresh" cycles to keep it working and is pretty power hungry as a result. > Power gone, data gone an instant later. Actually, oddly enough, per bit stored dram is much lower power usage than sram, because it only has something like 2 transistors per bit, while sram needs something like 4 or 5 (it's been a couple decades since I took the classes on each). Even with the constant refresh, dram has a lower power draw than sram.
On Mon, Feb 22, 2010 at 7:21 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Mon, Feb 22, 2010 at 6:39 PM, Greg Smith <greg@2ndquadrant.com> wrote: >> Mark Mielke wrote: >>> >>> I had read the above when posted, and then looked up SRAM. SRAM seems to >>> suggest it will hold the data even after power loss, but only for a period >>> of time. As long as power can restore within a few minutes, it seemed like >>> this would be ok? >> >> The normal type of RAM everyone uses is DRAM, which requires constrant >> "refresh" cycles to keep it working and is pretty power hungry as a result. >> Power gone, data gone an instant later. > > Actually, oddly enough, per bit stored dram is much lower power usage > than sram, because it only has something like 2 transistors per bit, > while sram needs something like 4 or 5 (it's been a couple decades > since I took the classes on each). Even with the constant refresh, > dram has a lower power draw than sram. Note that's power draw per bit. dram is usually much more densely packed (it can be with fewer transistors per cell) so the individual chips for each may have similar power draws while the dram will be 10 times as densely packed as the sram.
On Mon, 22 Feb 2010, Ron Mayer wrote: > > Also worth noting - Linux's software raid stuff (MD and LVM) > need to handle this right as well - and last I checked (sometime > last year) the default setups didn't. > I think I saw some stuff in the last few months on this issue on the kernel mailing list. you may want to doublecheck this when 2.6.33 gets released (probably this week) David Lang
> Note that's power draw per bit. dram is usually much more densely > packed (it can be with fewer transistors per cell) so the individual > chips for each may have similar power draws while the dram will be 10 > times as densely packed as the sram. Differences between SRAM and DRAM : - price per byte (DRAM much cheaper) - silicon area per byte (DRAM much smaller) - random access latency SRAM = fast, uniform, and predictable, usually 0/1 cycles DRAM = "a few" up to "a lot" of cycles depending on chip type, which page/row/column you want to access, wether it's R or W, wether the page is already open, etc In fact, DRAM is the new harddisk. SRAM is used mostly when low-latency is needed (caches, etc). - ease of use : SRAM very easy to use : address, data, read, write, clock. SDRAM needs a smart controller. SRAM easier to instantiate on a silicon chip - power draw When used at high speeds, SRAM ist't power-saving at all, it's used for speed. However when not used, the power draw is really negligible. While it is true that you can recover *some* data out of a SRAM/DRAM chip that hasn't been powered for a few seconds, you can't really trust that data. It's only a forensics tool. Most DRAM now (especially laptop DRAM) includes special power-saving modes which only keep the data retention logic (refresh, etc) powered, but not the rest of the chip (internal caches, IO buffers, etc). Laptops, PDAs, etc all use this feature in suspend-to-RAM mode. In this mode, the power draw is higher than SRAM, but still pretty minimal, so a laptop can stay in suspend-to-RAM mode for days. Anyway, the SRAM vs DRAM isn't really relevant for the debate of SSD data integrity. You can backup both with a small battery of ultra-cap. What is important too is that the entire SSD chipset must have been designed with this in mind : it must detect power loss, and correctly react to it, and especially not reset itself or do funny stuff to the memory when the power comes back. Which means at least some parts of the chipset must stay powered to keep their state. Now I wonder about something. SSDs use wear-leveling which means the information about which block was written where must be kept somewhere. Which means this information must be updated. I wonder how crash-safe and how atomic these updates are, in the face of a power loss. This is just like a filesystem. You've been talking only about data, but the block layout information (metadata) is subject to the same concerns. If the drive says it's written, not only the data must have been written, but also the information needed to locate that data... Therefore I think the yank-the-power-cord test should be done with random writes happening on an aged and mostly-full SSD... and afterwards, I'd be interested to know if not only the last txn really committed, but if some random parts of other stuff weren't "wear-leveled" into oblivion at the power loss...
Differences between SRAM and DRAM :Note that's power draw per bit. dram is usually much more densely
packed (it can be with fewer transistors per cell) so the individual
chips for each may have similar power draws while the dram will be 10
times as densely packed as the sram.[lots of informative stuff]
On Feb 23, 2010, at 3:49 AM, Pierre C wrote: > Now I wonder about something. SSDs use wear-leveling which means the > information about which block was written where must be kept somewhere. > Which means this information must be updated. I wonder how crash-safe and > how atomic these updates are, in the face of a power loss. This is just > like a filesystem. You've been talking only about data, but the block > layout information (metadata) is subject to the same concerns. If the > drive says it's written, not only the data must have been written, but > also the information needed to locate that data... > > Therefore I think the yank-the-power-cord test should be done with random > writes happening on an aged and mostly-full SSD... and afterwards, I'd be > interested to know if not only the last txn really committed, but if some > random parts of other stuff weren't "wear-leveled" into oblivion at the > power loss... > A couple years ago I postulated that SSD's could do random writes fast if they remapped blocks. Microsoft's SSD whitepaperat the time hinted at this too. Persisting the remap data is not hard. It goes in the same location as the data, or a separate area that can be writtento linearly. Each block may contain its LBA and a transaction ID or other atomic count. Or another block can have that info. When theSSD powers up, it can build its table of LBA > block by looking at that data and inverting it and keeping the highest transactionID for duplicate LBA claims. Although SSD's have to ERASE data in a large block at a time (256K to 2M typically), they can write linearly to an erasedblock in much smaller chunks. Thus, to commit a write, either: Data, LBA tag, and txID in same block (may require oddly sized blocks). or Data written to one block (not committed yet), then LBA tag and txID written elsewhere (which commits the write). Sinceits all copy on write, partial writes can't happen. If a block is being moved or compressed when power fails data should never be lost since the old data still exists, the newversion just didn't commit. But new data that is being written may not be committed yet in the case of a power failureunless other measures are taken. > > > > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On Tue, 23 Feb 2010, david@lang.hm wrote: > On Mon, 22 Feb 2010, Ron Mayer wrote: > >> >> Also worth noting - Linux's software raid stuff (MD and LVM) >> need to handle this right as well - and last I checked (sometime >> last year) the default setups didn't. >> > > I think I saw some stuff in the last few months on this issue on the kernel > mailing list. you may want to doublecheck this when 2.6.33 gets released > (probably this week) to clarify further (after getting more sleep ;-) I believe that the linux software raid always did the right thing if you did a fsync/fdatacync. however barriers that filesystems attempted to use to avoid the need for a hard fsync used to be silently ignored. I believe these are now honored (in at least some configurations) However, one thing that you do not get protection against with software raid is the potential for the writes to hit some drives but not others. If this happens the software raid cannot know what the correct contents of the raid stripe are, and so you could loose everything in that stripe (including contents of other files that are not being modified that happened to be in the wrong place on the array) If you have critical data, you _really_ want to use a raid controller with battery backup so that if you loose power you have a chance of eventually completing the write. David Lang
* david@lang.hm <david@lang.hm> [100223 15:05]: > However, one thing that you do not get protection against with software > raid is the potential for the writes to hit some drives but not others. > If this happens the software raid cannot know what the correct contents > of the raid stripe are, and so you could loose everything in that stripe > (including contents of other files that are not being modified that > happened to be in the wrong place on the array) That's for stripe-based raid. Mirror sets like raid-1 should give you either the old data, or the new data, both acceptable responses since the fsync/barreir hasn't "completed". Or have I missed another subtle interaction? a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Attachment
On Tue, 23 Feb 2010, Aidan Van Dyk wrote: > * david@lang.hm <david@lang.hm> [100223 15:05]: > >> However, one thing that you do not get protection against with software >> raid is the potential for the writes to hit some drives but not others. >> If this happens the software raid cannot know what the correct contents >> of the raid stripe are, and so you could loose everything in that stripe >> (including contents of other files that are not being modified that >> happened to be in the wrong place on the array) > > That's for stripe-based raid. Mirror sets like raid-1 should give you > either the old data, or the new data, both acceptable responses since > the fsync/barreir hasn't "completed". > > Or have I missed another subtle interaction? one problem is that when the system comes back up and attempts to check the raid array, it is not going to know which drive has valid data. I don't know exactly what it does in that situation, but this type of error in other conditions causes the system to take the array offline. David Lang
On 02/23/2010 04:22 PM, david@lang.hm wrote: > On Tue, 23 Feb 2010, Aidan Van Dyk wrote: > >> * david@lang.hm <david@lang.hm> [100223 15:05]: >> >>> However, one thing that you do not get protection against with software >>> raid is the potential for the writes to hit some drives but not others. >>> If this happens the software raid cannot know what the correct contents >>> of the raid stripe are, and so you could loose everything in that >>> stripe >>> (including contents of other files that are not being modified that >>> happened to be in the wrong place on the array) >> >> That's for stripe-based raid. Mirror sets like raid-1 should give you >> either the old data, or the new data, both acceptable responses since >> the fsync/barreir hasn't "completed". >> >> Or have I missed another subtle interaction? > > one problem is that when the system comes back up and attempts to > check the raid array, it is not going to know which drive has valid > data. I don't know exactly what it does in that situation, but this > type of error in other conditions causes the system to take the array > offline. I think the real concern here is that depending on how the data is read later - and depending on which disks it reads from - it could read *either* old or new, at any time in the future. I.e. it reads "new" from disk 1 the first time, and then an hour later it reads "old" from disk 2. I think this concern might be invalid for a properly running system, though. When a RAID array is not cleanly shut down, the RAID array should run in "degraded" mode until it can be sure that the data is consistent. In this case, it should pick one drive, and call it the "live" one, and then rebuild the other from the "live" one. Until it is re-built, it should only satisfy reads from the "live" one, or parts of the "rebuilding" one that are known to be clean. I use mdadm software RAID, and all of me reading (including some of its source code) and experience (shutting down the box uncleanly) tells me, it is working properly. In fact, the "rebuild" process can get quite ANNOYING as the whole system becomes much slower during rebuild, and rebuild of large partitions can take hours to complete. For mdadm, there is a not-so-well-known "write-intent bitmap" capability. Once enabled, mdadm will embed a small bitmap (128 bits?) into the partition, and each bit will indicate a section of the partition. Before writing to a section, it will mark that section as dirty using this bitmap. It will leave this bit set for some time after the partition is "clean" (lazy clear). The effect of this, is that at any point in time, only certain sections of the drive are dirty, and on recovery, it is a lot cheaper to only rebuild the dirty sections. It works really well. So, I don't think this has to be a problem. There are solutions, and any solution that claims to be complete should offer these sorts of capabilities. Cheers, mark
You *should* never lose a whole stripe ... for example, RAID-5 updates do "read old data / parity, write new data, write new parity" ... there is no need to touch any other data disks, so they will be preserved through the rebuild. Similarly, if only one block is being updated there is no need to update the entire stripe.
David - what caused /dev/md to decide to take an array offline?
Cheers
Dave
On Tue, 23 Feb 2010, Aidan Van Dyk wrote:one problem is that when the system comes back up and attempts to check the raid array, it is not going to know which drive has valid data. I don't know exactly what it does in that situation, but this type of error in other conditions causes the system to take the array offline.* david@lang.hm <david@lang.hm> [100223 15:05]:However, one thing that you do not get protection against with software
raid is the potential for the writes to hit some drives but not others.
If this happens the software raid cannot know what the correct contents
of the raid stripe are, and so you could loose everything in that stripe
(including contents of other files that are not being modified that
happened to be in the wrong place on the array)
That's for stripe-based raid. Mirror sets like raid-1 should give you
either the old data, or the new data, both acceptable responses since
the fsync/barreir hasn't "completed".
Or have I missed another subtle interaction?
David Lang
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
I have added documentation about the ATAPI drive flush command, and the typical SSD behavior. --------------------------------------------------------------------------- Greg Smith wrote: > Ron Mayer wrote: > > Bruce Momjian wrote: > > > >> Agreed, thought I thought the problem was that SSDs lie about their > >> cache flush like SATA drives do, or is there something I am missing? > >> > > > > There's exactly one case I can find[1] where this century's IDE > > drives lied more than any other drive with a cache: > > Ron is correct that the problem of mainstream SATA drives accepting the > cache flush command but not actually doing anything with it is long gone > at this point. If you have a regular SATA drive, it almost certainly > supports proper cache flushing. And if your whole software/storage > stacks understands all that, you should not end up with corrupted data > just because there's a volative write cache in there. > > But the point of this whole testing exercise coming back into vogue > again is that SSDs have returned this negligent behavior to the > mainstream again. See > http://opensolaris.org/jive/thread.jspa?threadID=121424 for a discussion > of this in a ZFS context just last month. There are many documented > cases of Intel SSDs that will fake a cache flush, such that the only way > to get good reliable writes is to totally disable their writes > caches--at which point performance is so bad you might as well have > gotten a RAID10 setup instead (and longevity is toast too). > > This whole area remains a disaster area and extreme distrust of all the > SSD storage vendors is advisable at this point. Basically, if I don't > see the capacitor responsible for flushing outstanding writes, and get a > clear description from the manufacturer how the cached writes are going > to be handled in the event of a power failure, at this point I have to > assume the answer is "badly and your data will be eaten". And the > prices for SSDs that meet that requirement are still quite steep. I > keep hoping somebody will address this market at something lower than > the standard "enterprise" prices. The upcoming SandForce designs seem > to have thought this through correctly: > http://www.anandtech.com/storage/showdoc.aspx?i=3702&p=6 But the > product's not out to the general public yet (just like the Seagate units > that claim to have capacitor backups--I heard a rumor those are also > Sandforce designs actually, so they may be the only ones doing this > right and aiming at a lower price). > > -- > Greg Smith 2ndQuadrant US Baltimore, MD > PostgreSQL Training, Services and Support > greg@2ndQuadrant.com www.2ndQuadrant.us > -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive, Christ can be your backup. + Index: doc/src/sgml/wal.sgml =================================================================== RCS file: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v retrieving revision 1.62 diff -c -c -r1.62 wal.sgml *** doc/src/sgml/wal.sgml 20 Feb 2010 18:28:37 -0000 1.62 --- doc/src/sgml/wal.sgml 27 Feb 2010 01:37:03 -0000 *************** *** 59,66 **** same concerns about data loss exist for write-back drive caches as exist for disk controller caches. Consumer-grade IDE and SATA drives are particularly likely to have write-back caches that will not survive a ! power failure. Many solid-state drives also have volatile write-back ! caches. To check write caching on <productname>Linux</> use <command>hdparm -I</>; it is enabled if there is a <literal>*</> next to <literal>Write cache</>; <command>hdparm -W</> to turn off write caching. On <productname>FreeBSD</> use --- 59,69 ---- same concerns about data loss exist for write-back drive caches as exist for disk controller caches. Consumer-grade IDE and SATA drives are particularly likely to have write-back caches that will not survive a ! power failure, though <acronym>ATAPI-6</> introduced a drive cache ! flush command that some file systems use, e.g. <acronym>ZFS</>. ! Many solid-state drives also have volatile write-back ! caches, and many do not honor cache flush commands by default. ! To check write caching on <productname>Linux</> use <command>hdparm -I</>; it is enabled if there is a <literal>*</> next to <literal>Write cache</>; <command>hdparm -W</> to turn off write caching. On <productname>FreeBSD</> use
Bruce Momjian wrote: > I have added documentation about the ATAPI drive flush command, and the > typical SSD behavior. > If one of us goes back into that section one day to edit again it might be worth mentioning that FLUSH CACHE EXT is the actual ATAPI-6 command that a drive needs to support properly. I wouldn't bother with another doc edit commit just for that specific part though, pretty obscure. I find it kind of funny how many discussions run in parallel about even really detailed technical implementation details around the world. For example, doesn't http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg30585.html look exactly like the exchange between myself and Arjen the other day, referencing the same AnandTech page? Could be worse; one of us could be the poor sap at http://opensolaris.org/jive/thread.jspa;jsessionid=41B679C30D136C059E1BB7C06CA7DCE0?messageID=397730 who installed Windows XP, VirtualBox for Windows, an OpenSolaris VM inside of it, and then was shocked that cache flushes didn't make their way all the way through that chain and had his 10TB ZFS pool corrupted as a result. Hurray for virtualization! -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Greg Smith wrote: > Bruce Momjian wrote: > > I have added documentation about the ATAPI drive flush command, and the > > typical SSD behavior. > > > > If one of us goes back into that section one day to edit again it might > be worth mentioning that FLUSH CACHE EXT is the actual ATAPI-6 command > that a drive needs to support properly. I wouldn't bother with another > doc edit commit just for that specific part though, pretty obscure. That setting name was not easy to find so I added it to the documentation. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
Bruce Momjian wrote: > Greg Smith wrote: >> Bruce Momjian wrote: >>> I have added documentation about the ATAPI drive flush command, and the >> >> If one of us goes back into that section one day to edit again it might >> be worth mentioning that FLUSH CACHE EXT is the actual ATAPI-6 command >> that a drive needs to support properly. I wouldn't bother with another >> doc edit commit just for that specific part though, pretty obscure. > > That setting name was not easy to find so I added it to the > documentation. If we're spelling out specific IDE commands, it might be worth noting that the corresponding SCSI command is "SYNCHRONIZE CACHE"[1]. Linux apparently sends FLUSH_CACHE commands to IDE drives in the exact sample places it sends SYNCHRONIZE CACHE commands to SCSI drives[2]. It seems that the same file systems, SW raid layers, virtualization platforms, and kernels that have a problem sending FLUSH CACHE commands to SATA drives have he same exact same problems sending SYNCHRONIZE CACHE commands to SCSI drives. With the exact same effect of not getting writes all the way through disk caches. No? [1] http://linux.die.net/man/8/sg_sync [2] http://hardware.slashdot.org/comments.pl?sid=149349&cid=12519114
Ron Mayer wrote: > Linux apparently sends FLUSH_CACHE commands to IDE drives in the > exact sample places it sends SYNCHRONIZE CACHE commands to SCSI > drives[2]. > [2] http://hardware.slashdot.org/comments.pl?sid=149349&cid=12519114 > Well, that's old enough to not even be completely right anymore about SATA disks and kernels. It's FLUSH_CACHE_EXT that's been added to ATA-6 to do the right thing on modern drives and that gets used nowadays, and that doesn't necessarily do so on most of the SSDs out there; all of which Bruce's recent doc additions now talk about correctly. There's this one specific area we know about that the most popular systems tend to get really wrong all the time; that's got the appropriate warning now with the right magic keywords that people can look into it more if motivated. While it would be nice to get super thorough and document everything, I think there's already more docs in there than this project would prefer to have to maintain in this area. Are we going to get into IDE, SATA, SCSI, SAS, FC, and iSCSI? If the idea is to be complete that's where this would go. I don't know that the documentation needs to address every possible way every possible filesystem can be flushed. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Ron Mayer wrote: > Bruce Momjian wrote: > > Greg Smith wrote: > >> Bruce Momjian wrote: > >>> I have added documentation about the ATAPI drive flush command, and the > >> > >> If one of us goes back into that section one day to edit again it might > >> be worth mentioning that FLUSH CACHE EXT is the actual ATAPI-6 command > >> that a drive needs to support properly. I wouldn't bother with another > >> doc edit commit just for that specific part though, pretty obscure. > > > > That setting name was not easy to find so I added it to the > > documentation. > > If we're spelling out specific IDE commands, it might be worth > noting that the corresponding SCSI command is "SYNCHRONIZE CACHE"[1]. > > > Linux apparently sends FLUSH_CACHE commands to IDE drives in the > exact sample places it sends SYNCHRONIZE CACHE commands to SCSI > drives[2]. > > It seems that the same file systems, SW raid layers, > virtualization platforms, and kernels that have a problem > sending FLUSH CACHE commands to SATA drives have he same exact > same problems sending SYNCHRONIZE CACHE commands to SCSI drives. > With the exact same effect of not getting writes all the way > through disk caches. I always assumed SCSI disks had a write-through cache and therefore didn't need a drive cache flush comment. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
Greg Smith wrote: > Ron Mayer wrote: > > Linux apparently sends FLUSH_CACHE commands to IDE drives in the > > exact sample places it sends SYNCHRONIZE CACHE commands to SCSI > > drives[2]. > > [2] http://hardware.slashdot.org/comments.pl?sid=149349&cid=12519114 > > > > Well, that's old enough to not even be completely right anymore about > SATA disks and kernels. It's FLUSH_CACHE_EXT that's been added to ATA-6 > to do the right thing on modern drives and that gets used nowadays, and > that doesn't necessarily do so on most of the SSDs out there; all of > which Bruce's recent doc additions now talk about correctly. > > There's this one specific area we know about that the most popular > systems tend to get really wrong all the time; that's got the > appropriate warning now with the right magic keywords that people can > look into it more if motivated. While it would be nice to get super > thorough and document everything, I think there's already more docs in > there than this project would prefer to have to maintain in this area. > > Are we going to get into IDE, SATA, SCSI, SAS, FC, and iSCSI? If the > idea is to be complete that's where this would go. I don't know that > the documentation needs to address every possible way every possible > filesystem can be flushed. The bottom line is that the reason we have so much detailed documentation about this is that mostly only database folks care about such issues, so we end up having to research and document this ourselves --- I don't see any alternatives. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
Bruce Momjian wrote: > I always assumed SCSI disks had a write-through cache and therefore > didn't need a drive cache flush comment. > There's more detail on all this mess at http://wiki.postgresql.org/wiki/SCSI_vs._IDE/SATA_Disks and it includes this perception, which I've recently come to believe isn't actually correct anymore. Like the IDE crowd, it looks like one day somebody said "hey, we lose every write heavy benchmark badly because we only have a write-through cache", and that principle got lost along the wayside. What has been true, and I'm staring to think this is what we've all been observing rather than a write-through cache, is that the proper cache flushing commands have been there in working form for so much longer that it's more likely your SCSI driver and drive do the right thing if the filesystem asks them to. SCSI SYNCHRONIZE CACHE has a much longer and prouder history than IDE's FLUSH_CACHE and SATA's FLUSH_CACHE_EXT. It's also worth noting that many current SAS drives, the current SCSI incarnation, are basically SATA drives with a bridge chipset stuck onto them, or with just the interface board swapped out. This one reason why top-end SAS capacities lag behind consumer SATA drives. They use the consumers as beta testers to get the really fundamental firmware issues sorted out, and once things are stable they start stamping out the version with the SAS interface instead. (Note that there's a parallel manufacturing approach that makes much smaller SAS drives, the 2.5" server models or those at higher RPMs, that doesn't go through this path. Those are also the really expensive models, due to economy of scale issues). The idea that these would have fundamentally different write cache behavior doesn't really follow from that development model. At this point, there are only two common differences between "consumer" and "enterprise" hard drives of the same size and RPM when there are directly matching ones: 1) You might get SAS instead of SATA as the interface, which provides the more mature command set I was talking about above--and therefore may give you a sane write-back cache with proper flushing, which is all the database really expects. 2) The timeouts when there's a read/write problem are tuned down in the enterprise version, to be more compatible with RAID setups where you want to push the drive off-line when this happens rather than presuming you can fix it. Consumers would prefer that the drive spent a lot of time doing heroics to try and save their sole copy of the apparently missing data. You might get a slightly higher grade of parts if you're lucky too; I wouldn't count on it though. That seems to be saved for the high RPM or smaller size drives only. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
> I always assumed SCSI disks had a write-through cache and therefore > didn't need a drive cache flush comment. Maximum performance can only be reached with a writeback cache so the drive can reorder and cluster writes, according to the realtime position of the heads and platter rotation. The problem is not the write cache itself, it is that, for your data to be safe, the "flush cache" or "barrier" command must get all the way through the application / filesystem to the hardware, going through a nondescript number of software/firmware/hardware layers, all of which may : - not specify if they honor or ignore flush/barrier commands, and which ones - not specify if they will reordre writes ignoring barriers/flushes or not - have been written by people who are not aware of such issues - have been written by companies who are perfectly aware of such issues but chose to ignore them to look good in benchmarks - have some incompatibilities that result in broken behaviour - have bugs As far as I'm concerned, a configuration that doesn't properly respect the commands needed for data integrity is broken. The sad truth is that given a software/hardware IO stack, there's no way to be sure, and testing isn't easy, if at all possible to do. Some cache flushes might be ignored under some circumstances. For this to change, you don't need a hardware change, but a mentality change. Flash filesystem developers use flash simulators which measure wear leveling, etc. We'd need a virtual box with a simulated virtual harddrive which is able to check this. What a mess.
Greg Smith wrote: > Bruce Momjian wrote: >> I always assumed SCSI disks had a write-through cache and therefore >> didn't need a drive cache flush comment. Some do. SCSI disks have write-back caches. Some have both(!) - a write-back cache but the user can explicitly send write-through requests. Microsoft explains it well (IMHO) here: http://msdn.microsoft.com/en-us/library/aa508863.aspx "For example, suppose that the target is a SCSI device with a write-back cache. If the device supports write-through requests, the initiator can bypass the write cache by setting the force unit access (FUA) bit in the command descriptor block (CDB) of the write command." > this perception, which I've recently come to believe isn't actually > correct anymore. ... I'm staring to think this is what > we've all been observing rather than a write-through cache I think what we've been observing is that guys with SCSI drives are more likely to either (a) have battery-backed RAID controllers that insure writes succeed, or (b) have other decent RAID controllers that understand details like that FUA bit to send write-through requests even if a SCSI devices has a write-back cache. In contrast, most guys with PATA drives are probably running software RAID (if any) with a RAID stack (older LVM and MD) known to lose the cache flushing commands.