Thread: SSDs with Postgresql?
The speed benefits of SSDs as benchmarked would seem incredible. Can anybody comment on SSD benefits and problems in real life use?
I maintain some 100 databases on 3 servers, with 32 GB of RAM each and an extremely rich, complex schema. (300+ normalized tables)
I was wondering if anybody here could comment on the benefits of SSD in similar, high-demand rich schema situations?
-Ben
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
On 04/13/11 9:19 PM, Benjamin Smith wrote: > > The speed benefits of SSDs as benchmarked would seem incredible. Can > anybody comment on SSD benefits and problems in real life use? > > > I maintain some 100 databases on 3 servers, with 32 GB of RAM each and > an extremely rich, complex schema. (300+ normalized tables) > > > I was wondering if anybody here could comment on the benefits of SSD > in similar, high-demand rich schema situations? > > consumer grade MLC SSD's will crash and burn in short order under a heavy transactional workload characterized by sustained small block random writes. The enterprise grade SLC SSDs' will perform very nicely, but they are very very expensive, and found in high end enterprise database servers like Oracle's Exadata machines.
On Thu, April 14, 2011 06:19, Benjamin Smith wrote: > The speed benefits of SSDs as benchmarked would seem incredible. Can anybody > comment on SSD benefits and problems in real life use? > > I maintain some 100 databases on 3 servers, with 32 GB of RAM each and an > extremely rich, complex schema. (300+ normalized tables) > > I was wondering if anybody here could comment on the benefits of SSD in > similar, high-demand rich schema situations? Even if you only use SSDs for your indexes, the gains are staggering. We use them on several servers, one of which is extremely busy (xid wraparound stuff) and the performance gains are game-changing. There is no going back. Hint: don't use cheap SSDs - cough up and use Intel. Cheers Henry
On 14/04/2011 4:35 PM, Henry C. wrote: > There is no going back. Hint: don't use cheap SSDs - cough up and use Intel. The server-grade SLC stuff with a supercap, I hope, not the scary consumer-oriented MLC "pray you weren't writing anything during power-loss" devices? -- Craig Ringer Tech-related writing at http://soapyfrogs.blogspot.com/
On 04/14/11 1:35 AM, Henry C. wrote: > Hint: don't use cheap SSDs - cough up and use Intel. aren't most of the Intel SSD's still MLC, and still have performance and reliability issues with sustained small block random writes such as are generated by database servers? the enterprise grade SLC SSD drives are things like STEC ZeusIOPS and Seagate Pulsar. and the majority of them end up in EMC and other big iron SAN systems.
have a look at http://postgresql.1045698.n5.nabble.com/Intel-SSDs-that-may-not-suck-td4268261.html It looks like those are "safe" to use with a db, and aren't that expensive.
On Thu, April 14, 2011 10:51, Craig Ringer wrote: > On 14/04/2011 4:35 PM, Henry C. wrote: > > >> There is no going back. Hint: don't use cheap SSDs - cough up and use >> Intel. >> > > The server-grade SLC stuff with a supercap, I hope, not the scary > consumer-oriented MLC "pray you weren't writing anything during power-loss" > devices? That's what a UPS and genset are for. Who writes critical stuff to *any* drive without power backup? You have a valid point about using SLC if that's what you need though. However, MLC works just fine provided you stick them into RAID1. In fact, we use a bunch of them in RAID0 on top of RAID1. In our environment (clusters) it's all about using cheap consumer-grade commodity hardware with lots of redundancy to cater for the inevitable failures. The trade-off is huge: performance with low cost. We've been using MLC intel drives since they came out and have never had a failure. Other SSDs we've tried have failed, and so have hard drives. The point though, is that there are tremendous performance gains to be had with commodity h/w if you factor in failure rates and make *sure* you have redundancy. h
Le 14/04/2011 10:54, John R Pierce a écrit : > On 04/14/11 1:35 AM, Henry C. wrote: >> Hint: don't use cheap SSDs - cough up and use Intel. > > aren't most of the Intel SSD's still MLC, and still have performance and > reliability issues with sustained small block random writes such as are > generated by database servers? the enterprise grade SLC SSD drives are > things like STEC ZeusIOPS and Seagate Pulsar. and the majority of them > end up in EMC and other big iron SAN systems. I thnik Henry is referring to Intel's X25-E line. They are SLC, enterprise grade. Quite expensive though, ~700 euros for the 64GB version. We have one of them in a production server (light load), it works very well so far. Performance gain versus WD Raptor RAID array is huge. I never tried to quantify it. Arnaud
On Thu, April 14, 2011 11:30, Leonardo Francalanci wrote: > have a look at > > http://postgresql.1045698.n5.nabble.com/Intel-SSDs-that-may-not-suck-td426826 > 1.html > > > > It looks like those are "safe" to use with a db, and aren't that expensive. The new SSDs look great. From our experience, we trust SSDs (even MLC) far more than mechanical hard drives. I believe this perception that SSDs are less "safe" than failure-prone mechanical hard drives will eventually change. In the meantime, we've embraced them and the advantages are compelling. h
Le 14/04/2011 11:40, Henry C. a écrit : > You have a valid point about using SLC if that's what you need though. > However, MLC works just fine provided you stick them into RAID1. In fact, we > use a bunch of them in RAID0 on top of RAID1. AFAIK, you won't have TRIM support on RAID-arrayed SSDs. That might change soon, but I think that RAID boards supporting TRIM are still a work in progress. Arnaud
On Thu, 14 Apr 2011 11:46:12 +0200, Henry C. wrote: > On Thu, April 14, 2011 11:30, Leonardo Francalanci wrote: >> have a look at >> >> >> http://postgresql.1045698.n5.nabble.com/Intel-SSDs-that-may-not-suck-td426826 >> 1.html >> >> >> >> It looks like those are "safe" to use with a db, and aren't that >> expensive. > > The new SSDs look great. From our experience, we trust SSDs (even > MLC) far > more than mechanical hard drives. > > I believe this perception that SSDs are less "safe" than > failure-prone > mechanical hard drives will eventually change. > > In the meantime, we've embraced them and the advantages are > compelling. > > h One thing you should care about is such called write endurance - number of writes to one memory region before it will be destroyed - if your SSD driver do not have transparent allocation, then you may destroy it really fast, because write of each "block" will be in same memory segment, clog/xlog may be failed with 10k-100k writes. But if your SSD has transparent allocation, then internal controller will count your writes to given memory cell, and when lifetime of this cell will be at the end, it will "associate" block with different cell. With transparent allocation, You may sometimes do not fear if system uses journaling, you store logs there on any kind of often updatable data. You may calculate life time of your SSD with: WritesToDestroyCells = "write_endurance" * "disk_size" AvgLifeTime = WritesToDestroyCells / writes_per_sec Those are high numbers, even with simply disks as 10.000 * 60GB, means you need to send 600TB of data to one SSD (not completely true, as you can't send one byte, but full blocks) . Ofc, In order to extend life time of SSD you should provide file systems cache, or SSD with cache, as well turn off FS journaling. Regards, Radek
> I believe this perception that SSDs are less "safe" than failure-prone > mechanical hard drives will eventually change. By "safe" I mean they won't corrupt data in case of crash of the machine.
On 14/04/2011 5:40 PM, Henry C. wrote: >> The server-grade SLC stuff with a supercap, I hope, not the scary >> consumer-oriented MLC "pray you weren't writing anything during power-loss" >> devices? > > That's what a UPS and genset are for. Who writes critical stuff to *any* > drive without power backup? Even a server with redundant PSUs on a UPS backed by generators can go down hard and unexpectedly. I'd be extremely nervous unless I could afford to lose data since the last backup, or unless I had a really trustworthy replication setup going. Of course, it's wise to have one or both of those conditions be true anyway, because no redundant storage system will save you from file system corruption caused by an OS bug, data corruption caused by a Pg bug, or a "DELETE FROM critical_table;" by a careless superuser. So I guess it doesn't cost you more than the risk of some downtime to use potentially corruption-prone non-supercap MLC, and it's probably worth it for the performance in your clustered environment. All I meant with my post was to raise the concern that the OP needs to be aware of the untrustworthy nature of even the low-end Intel SSDs. They're usable, you just have to compensate for their deficiencies. > You have a valid point about using SLC if that's what you need though. > However, MLC works just fine provided you stick them into RAID1. In fact, we > use a bunch of them in RAID0 on top of RAID1. RAID won't help you if they all drop their caches if the power supply throws a wobbly. That said, it's certainly good for the lifetime issues especially if the units are upgraded or rotated out in phases. -- Craig Ringer Tech-related writing at http://soapyfrogs.blogspot.com/
I was wondering if anybody here could comment on the benefits of SSD in similar, high-demand rich schema situations?
At 06:07 PM 4/14/2011, RadosÅaw Smogura wrote: >One thing you should care about is such called >write endurance - number of writes to one memory >region before it will be destroyed - if your SSD >driver do not have transparent allocation, then >you may destroy it really fast, because write of >each "block" will be in same memory segment, >clog/xlog may be failed with 10k-100k writes. >But if your SSD has transparent allocation, then >internal controller will count your writes to >given memory cell, and when lifetime of this >cell will be at the end, it will "associate" >block with different cell. With transparent >allocation, You may sometimes do not fear if >system uses journaling, you store logs there on >any kind of often updatable data. You may calculate life time of your SSD with: >WritesToDestroyCells = "write_endurance" * "disk_size" >AvgLifeTime = WritesToDestroyCells / writes_per_sec > >Those are high numbers, even with simply disks >as 10.000 * 60GB, means you need to send 600TB >of data to one SSD (not completely true, as you >can't send one byte, but full blocks) . Ofc, In >order to extend life time of SSD you should >provide file systems cache, or SSD with cache, as well turn off FS journaling. I'm not an expert on SSDs, but I believe modern SSDs are supposed to automatically spread the writes across the entire disk where possible - even to the extent of moving already written stuff. So if the drives are full or near full, the tradeoff is between lower performance (because you have to keep moving stuff about) or lower lifespan (one area gets overused). If the drives are mostly empty the SSD's controller has an easier job - it doesn't have to move as much data around. Regards, Link.
After a glowing review at AnandTech (including DB benchmarks!) I decided to spring for an OCX Vertex 3 Pro 120 for evaluation purposes. It cost about $300 with shipping, etc and at this point, won't be putting any
Considering that I sprang for 96 GB of ECC RAM last spring for around $5000, even if I put the OCX drives in pairs w/RAID1, I'd still come out well ahead if it allows me to put off buying more servers for a year or two.
-Ben
On Thursday, April 14, 2011 02:30:06 AM Leonardo Francalanci wrote:
> have a look at
>
> http://postgresql.1045698.n5.nabble.com/Intel-SSDs-that-may-not-suck-td426
> 8261.html
>
>
> It looks like those are "safe" to use with a db, and aren't that expensive.
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
On Thu, Apr 14, 2011 at 3:40 AM, Henry C. <henka@cityweb.co.za> wrote: > > On Thu, April 14, 2011 10:51, Craig Ringer wrote: >> On 14/04/2011 4:35 PM, Henry C. wrote: >> >> >>> There is no going back. Hint: don't use cheap SSDs - cough up and use >>> Intel. >>> >> >> The server-grade SLC stuff with a supercap, I hope, not the scary >> consumer-oriented MLC "pray you weren't writing anything during power-loss" >> devices? > > That's what a UPS and genset are for. Who writes critical stuff to *any* > drive without power backup? Because power supply systems with UPS never fail. (hint, I've seen them fail, more than once)
On Thu, Apr 14, 2011 at 12:27:34PM -0600, Scott Marlowe wrote: > > That's what a UPS and genset are for. Who writes critical stuff to *any* > > drive without power backup? > > Because power supply systems with UPS never fail. Right, there's obviously a trade-off here. Some of this has to do with how much your data is worth vs. how much the speed is worth. There's also the issue of whether you can stand to lose a few rows, and whether you can stand to lose them for a short time. For instance, collecting user comments might be a matter of great value, but if you write them to more than one system, you might not care whether one of the systems fails briefly. In that case, maybe big redundancy of cheap disks with power backup is good enough to meet the price:value ratio. On stock trades worth maybe millions of dollars, not so much: you miss your teeny window of opportunity to do a trade and suddenly you're out in the street wearing a barrel. I can think of lots of different points to be along that continuum, and surely nobody is suggesting that there is one right answer for everything. A -- Andrew Sullivan ajs@crankycanuck.ca
Henry C. wrote: > I believe this perception that SSDs are less "safe" than failure-prone > mechanical hard drives will eventually change. > Only because the manufacturers are starting to care about write durability enough to include the right hardware for it. Many of them are less safe right now on some common database tasks. Intel's gen 1 and gen 2 drives are garbage for database use. I've had customers lose terabytes of data due to them. Yes, every system can fail, but these *will* fail and corrupt your database the first time there's a serious power problem of some sort. And the idea that a UPS is sufficient to protect against that even happening in wildly optimistic. See http://wiki.postgresql.org/wiki/Reliable_Writes for more background here, and links to reading on the older Intel drives. I summarized the situation with their newer 320 series drives at http://blog.2ndquadrant.com/en/2011/04/intel-ssd-now-off-the-sherr-sh.html Those finally get the write flushing right. But the random seeks IOPS is wildly lower than you might expect on read/write workloads. My own tests and other sources have all come up with around 3500 IOPS as being a real-world expectation for the larger sizes of these drives. Also, it is cheap flash, so durability in a server environment won't be great. Don't put your WAL on them if you have a high transaction rate. Put some indexes there instead. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Thu, Apr 14, 2011 at 1:14 PM, Greg Smith <greg@2ndquadrant.com> wrote: > And the idea that a UPS is sufficient to protect against that even happening in > wildly optimistic. Note that the real danger in relying on a UPS is that most power conditioning / UPS setups tend to fail in total, not in parts. The two times I've seen it happen, the whole grid shut down completely for a few hours. The first time we had Oracle, Ingress, Sybase, SQL-Server, etc. etc. database server across the company corrupted. DAYS of recovery time, and since they all failed at once, the machines in replication got corrupted as well. Restoring production dbs from backups took days. The only machine to survive was the corporate intranet running pgsql on twin 15k SCSI drives with a proven reliable battery backed controller on it. It was mine. This was a company that lost something like $10k a minute for downtime. And the downtime was measured not in seconds, minutes or hours, but days because everyone had said the same thing, "The UPS and power conditioners make power plug pull survivability a non issue." When the only machine with an uncorrupted database is the corporate intranet server the 24/7 production guys look pretty stupid. They also suddenly decided to start doing power plug pull tests on all database servers. To make matters worse, the kind of system to NEED the higher throughput from SSDs is likely the kind of system to be the worst kind to suffer downtime due to corruption. OTOH, restores from backups should run pretty fast. :)
On Thu, April 14, 2011 18:56, Benjamin Smith wrote: > After a glowing review at AnandTech (including DB benchmarks!) I decided to > spring for an OCX Vertex 3 Pro 120 for evaluation purposes. It cost about $300 > with shipping, etc and at this point, won't be putting any > > Considering that I sprang for 96 GB of ECC RAM last spring for around $5000, > even if I put the OCX drives in pairs w/RAID1, I'd still come out well ahead if > it allows me to put off buying more servers for a year or two. Exactly. Be aware of the risks, plan for failure and reap the rewards.
On Thu, April 14, 2011 20:54, Andrew Sullivan wrote: > On Thu, Apr 14, 2011 at 12:27:34PM -0600, Scott Marlowe wrote: > >>> That's what a UPS and genset are for. Â Who writes critical stuff to >>> *any* >>> drive without power backup? >> >> Because power supply systems with UPS never fail. >> > > Right, there's obviously a trade-off here. Some of this has to do > with how much your data is worth vs. how much the speed is worth. There's also > the issue of whether you can stand to lose a few rows, and whether you can > stand to lose them for a short time. For instance, collecting user comments > might be a matter of great value, but if you write them to more than one > system, you might not care whether one of the systems fails briefly. In that > case, maybe big redundancy of cheap disks with power backup is good enough to > meet the price:value ratio. On stock trades worth maybe millions of dollars, > not so much: you miss your teeny window of opportunity to do a trade and > suddenly you're out in the street wearing a barrel. > > I can think of lots of different points to be along that continuum, > and surely nobody is suggesting that there is one right answer for everything. Exactly. To address the issue of something going wrong (it will, given enough time), we recently started trying out Pg's built-in replication (we were using Londiste - nothing wrong with it, but life is simpler now that the Pg overlords have realised what everyone has known for a long time). When the super-fast SSD-based machine fails, switching to a (slower) standard hard-drive based machine provides continuity and buys time while we get the primary machine back online. For us, life is better, thanks to SSDs. h
When the super-fast SSD-based machine fails, switching to a (slower) standard
hard-drive based machine provides continuity and buys time while we get the
primary machine back online.
On Sunday, April 17, 2011 01:55:02 AM Henry C. wrote:
> On Thu, April 14, 2011 18:56, Benjamin Smith wrote:
> > After a glowing review at AnandTech (including DB benchmarks!) I decided
> > to spring for an OCX Vertex 3 Pro 120 for evaluation purposes. It cost
> > about $300
> >
> > with shipping, etc and at this point, won't be putting any
> >
> > Considering that I sprang for 96 GB of ECC RAM last spring for around
> > $5000, even if I put the OCX drives in pairs w/RAID1, I'd still come out
> > well ahead if it allows me to put off buying more servers for a year or
> > two.
>
> Exactly. Be aware of the risks, plan for failure and reap the rewards.
Just curious what your thoughts are with respect to buying SSDs and mirroring them with software RAID 1. (I use Linux/CentOS)
-Ben
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
p, li { white-space: pre-wrap; } On Sunday, April 17, 2011 01:55:02 AM Henry C. wrote:
>
> Exactly. Be aware of the risks, plan for failure and reap the rewards.
Just curious what your thoughts are with respect to buying SSDs and mirroring them with software RAID 1. (I use Linux/CentOS)
Since SSD fail when the write cycles are gone, it wouldn't make sense to buy two identical ones and put them in a RAID 1: under normal circumstances they'd fail simultanously. An idea I'm thinking of is to put both a OCZ Vertex 2 Pro and Intel 320 in a software RAID 1 setup. It would have the benefit that there are no complications to be expected with an extra device layer from a hardware RAID card, such as incompatibilites between controller and SSD firmware, or not being able to access the physical disk's smart values, which in the SSD case are important to figure out the remaining lifetime. Also since both drives have a supercap, PostgreSQL data on it would survive power failures. It would be interesting if md could be configured to do reads on both mirror legs and compare them before returning values, like a continual check, but there doesn't seem to be a option for something like that.
-- Yeb Havinga http://www.mgrid.net/ Mastering Medical Data
On 20/04/11 04:28, Yeb Havinga wrote: > On 2011-04-19 19:07, Benjamin Smith wrote: >> >> On Sunday, April 17, 2011 01:55:02 AM Henry C. wrote: >> >> > >> >> > Exactly. Be aware of the risks, plan for failure and reap the rewards. >> >> >> Just curious what your thoughts are with respect to buying SSDs and >> mirroring them with software RAID 1. (I use Linux/CentOS) >> > > Since SSD fail when the write cycles are gone, it wouldn't make sense to > buy two identical ones and put them in a RAID 1: under normal > circumstances they'd fail simultanously. I don't think SSD drives wear out in quite the manner you seem to describe. The wearing out of blocks is not something that occurs at an exact number; the quoted amounts are more of an average, and will vary from block to block and from drive to drive. Also, all decent SSD drives will remap bad blocks as they wear out, so you don't get just one and then die. Also, the number of erase cycles you can get, over the whole disk, is quite large on modern disks! So large that you'll probably go decades before you wear the disk out, even with continual writes. Don't buy into the SSD FUD myths.. Cheers, Toby
On 14/04/11 23:25, Vick Khera wrote: > On Thu, Apr 14, 2011 at 12:19 AM, Benjamin Smith > <lists@benjamindsmith.com <mailto:lists@benjamindsmith.com>> wrote: > > I was wondering if anybody here could comment on the benefits of SSD > in similar, high-demand rich schema situations? > > > For the last several months, I've been using Texas Memory Systems RamSAN > 620 drives on my main DB servers. Having near zero seek times has been > a tremendous boon to our performance, and will have pretty much paid for > themselves within the next couple of months. Ie, the "throw hardware at > it" solution worked really well :) hey, I wonder - could you, or someone else with some SSD drives running their DBs in production - check the SMART attributes for their drives? In particular, the Media_Wearout_Indicator - this starts at 100 and goes down towards 1 as the erase cycles add up.. So you can calculate the total estimated lifetime by looking at how much has been used up over how long you've been using the drive in production. I have a very cheap 64GB consumer SSD used in a personal server (so not in serious production use, but it does see some traffic), and I note that after a year it's still on 100%! Toby
On 04/20/2011 01:50 AM, Toby Corkindale wrote: > Also, the number of erase cycles you can get, over the whole disk, is > quite large on modern disks! > > So large that you'll probably go decades before you wear the disk out, > even with continual writes. > > Don't buy into the SSD FUD myths.. There is no FUD being spread here. Particularly given the PostgreSQL WAL write pattern, it's not impossible to wear out a SSD placed there in a small number of years. A system with a trivial but not completely idle workload will generate one 16MB WAL segment every 5 minutes, which works out to 4.5GB/day of writes. That's the baseline--the reality is much, much higher than that on most systems. The fact that every row update can temporarily use more than 8K means that actual write throughput on the WAL can be shockingly large. The smallest customer I work with regularly has a 50GB database, yet they write 20GB of WAL every day. You can imagine how much WAL is generated daily on systems with terabyte databases. As for what this translates into in the real world, go read http://archives.postgresql.org/message-id/BANLkTi=GsyBfq+ApWPR_qCA7AN+NqT=zww@mail.gmail.com as one worked out sample. Anyone deploying PostgreSQL onto MLC can't necessarily ignore this issue. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
* Greg Smith: > The fact that every row update can temporarily use more than 8K means > that actual write throughput on the WAL can be shockingly large. The > smallest customer I work with regularly has a 50GB database, yet they > write 20GB of WAL every day. You can imagine how much WAL is > generated daily on systems with terabyte databases. Interesting. Is there an easy way to monitor WAL traffic in away? It does not have to be finegrained, but it might be helpful to know if we're doing 10 GB, 100 GB or 1 TB of WAL traffic on a particular database, should the question of SSDs ever come up. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
On Thursday, April 21, 2011 8:33:45 am Florian Weimer wrote: > * Greg Smith: > > The fact that every row update can temporarily use more than 8K means > > that actual write throughput on the WAL can be shockingly large. The > > smallest customer I work with regularly has a 50GB database, yet they > > write 20GB of WAL every day. You can imagine how much WAL is > > generated daily on systems with terabyte databases. > > Interesting. Is there an easy way to monitor WAL traffic in away? It > does not have to be finegrained, but it might be helpful to know if > we're doing 10 GB, 100 GB or 1 TB of WAL traffic on a particular > database, should the question of SSDs ever come up. They are found in $DATA/pg_xlog so checking the size of that directory regularly would get you the information. -- Adrian Klaver adrian.klaver@gmail.com
* Adrian Klaver: >> Interesting. Is there an easy way to monitor WAL traffic in away? It >> does not have to be finegrained, but it might be helpful to know if >> we're doing 10 GB, 100 GB or 1 TB of WAL traffic on a particular >> database, should the question of SSDs ever come up. > > They are found in $DATA/pg_xlog so checking the size of that > directory regularly would get you the information. But log files are recycled, so looking at the directory alone does not seem particularly helpful. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
On Apr 21, 2011, at 9:44 AM, Florian Weimer wrote: > But log files are recycled, so looking at the directory alone does not > seem particularly helpful. You have to look at the file timestamps. From that you can get an idea of traffic. -- Scott Ribe scott_ribe@elevated-dev.com http://www.elevated-dev.com/ (303) 722-0567 voice
On 04/21/2011 11:33 AM, Florian Weimer wrote: > Is there an easy way to monitor WAL traffic in away? It > does not have to be finegrained, but it might be helpful to know if > we're doing 10 GB, 100 GB or 1 TB of WAL traffic on a particular > database, should the question of SSDs ever come up. > You can use functions like pg_current_xlog_location() : http://www.postgresql.org/docs/9.0/interactive/functions-admin.html Save a copy of this periodically: select now(),pg_current_xlog_location(); And you can see WAL volume over time given any two points from that set of samples. To convert the internal numbers returned by that into bytes, you'll need to do some math on them. Examples showing how that works and code in a few languages: http://archives.postgresql.org/pgsql-general/2010-10/msg00077.php (by hand) http://munin-monitoring.org/browser/trunk/plugins/node.d/postgres_streaming_.in?rev=3905 (in Perl) http://archives.postgresql.org/pgsql-general/2010-10/msg00079.php (in C) http://postgresql.1045698.n5.nabble.com/How-can-we-tell-how-far-behind-the-standby-is-td3252297.html (in bash with bc(!), other links) What I keep meaning to write is something that does that as part of the SQL itself, so it gets pulled out of the database already in bytes. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Florian Weimer <fweimer@bfk.de> writes: > * Adrian Klaver: >>> Interesting. Is there an easy way to monitor WAL traffic in away? >> They are found in $DATA/pg_xlog so checking the size of that >> directory regularly would get you the information. > But log files are recycled, so looking at the directory alone does not > seem particularly helpful. "du" would be useless, but you could check the name of the newest WAL segment file from time to time, and do a bit of math to see how much WAL had been written since the previous time. regards, tom lane
On Thu, Apr 21, 2011 at 11:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Florian Weimer <fweimer@bfk.de> writes: >> * Adrian Klaver: >>>> Interesting. Is there an easy way to monitor WAL traffic in away? > >>> They are found in $DATA/pg_xlog so checking the size of that >>> directory regularly would get you the information. > >> But log files are recycled, so looking at the directory alone does not >> seem particularly helpful. > > "du" would be useless, but you could check the name of the newest WAL > segment file from time to time, and do a bit of math to see how much > WAL had been written since the previous time. I'd think using sysstat packages sar is the way to see how much work your drives are doing. Assuming the sysstat package / daemon is set to monitor disk block activity.
----- Original Message ----- > From: "Greg Smith" <greg@2ndQuadrant.com> > To: pgsql-general@postgresql.org > Sent: Friday, 22 April, 2011 12:49:28 AM > Subject: Re: [GENERAL] SSDs with Postgresql? > On 04/20/2011 01:50 AM, Toby Corkindale wrote: > > Also, the number of erase cycles you can get, over the whole disk, > > is > > quite large on modern disks! > > > > So large that you'll probably go decades before you wear the disk > > out, > > even with continual writes. > > > > Don't buy into the SSD FUD myths.. > > There is no FUD being spread here. Particularly given the PostgreSQL > WAL write pattern, it's not impossible to wear out a SSD placed there > in > a small number of years. A system with a trivial but not completely > idle workload will generate one 16MB WAL segment every 5 minutes, > which > works out to 4.5GB/day of writes. That's the baseline--the reality is > much, much higher than that on most systems. The fact that every row > update can temporarily use more than 8K means that actual write > throughput on the WAL can be shockingly large. The smallest customer I > work with regularly has a 50GB database, yet they write 20GB of WAL > every day. You can imagine how much WAL is generated daily on systems > with terabyte databases. The larger the database, the larger number of disks the DB will be spread over, thus keeping the amount of data written per-diskto a manageable amount. (Also, larger SSDs seem to have higher write-limits too) I see Intel is/was claiming their SLC SSDs had a *minimum* lifetime of 2PB in writes for their 64GB disks; for your customerwith a 50GB db and 20GB/day of WAL, that would work out at a minimum lifetime of a million days, or about 273 years! The cheaper "consumer grade" MLC drives should still last minimum 5 years at 20GB/day according to their literature. (Andwhat I found was fairly out of date) That doesn't seem too bad to me - I don't think I've worked anywhere that keeps their traditional spinning disks in servicebeyond 5 years either. > As for what this translates into in the real world, go read > http://archives.postgresql.org/message-id/BANLkTi=GsyBfq+ApWPR_qCA7AN+NqT=zww@mail.gmail.com > as one worked out sample. Anyone deploying PostgreSQL onto MLC can't > necessarily ignore this issue. Agreed, it shouldn't be ignored, but it seems like you have to have an early-generation SSD and combine that with very heavywrite loads in order to risk wearing them out within a matter of years - and there are SMART attributes that shouldbe able to indicate when the drive has used a lot of its life-span as an early-warning signal. You've linked to one anecdotal failure, but surely there are many more people using SSDs successfully, who just haven't postedabout it because it's working fine..? -Toby
* Greg Smith:Interesting. Is there an easy way to monitor WAL traffic in away? It
> The fact that every row update can temporarily use more than 8K means
> that actual write throughput on the WAL can be shockingly large. The
> smallest customer I work with regularly has a 50GB database, yet they
> write 20GB of WAL every day. You can imagine how much WAL is
> generated daily on systems with terabyte databases.
does not have to be finegrained, but it might be helpful to know if
we're doing 10 GB, 100 GB or 1 TB of WAL traffic on a particular
database, should the question of SSDs ever come up.
For example, yesterday I archived 74 WAL files, each 16MB. That's about 1.2 gigabytes for a database that takes up about 58 GB.
--
Mike Nolan
* Michael Nolan: > If you archive your WAL files, wouldn't that give you a pretty good > indication of write activity? WAL archiving may increase WAL traffic considerably, I think. Fresh table contents (after CREATE TABLE or TRUNCATE) is written to the log if WAL archiving is active. This would have a significant performance impact on some of our loads. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
* Greg Smith: > To convert the internal numbers returned by that into bytes, you'll > need to do some math on them. Examples showing how that works and > code in a few languages: Thanks for the pointers. Those examples are slightly incongruent, but I think I've distilled something that should be reasonably accurate. The numbers look as if they are valid, they match my expectations for different databases with different loads. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
One thing to remember in this discussion about SSD longevity is that the underlying value of interest is the total number of erase cycles, per block, on the flash devices. Vendors quote lifetime as a number of bytes, but this is calculated using an assumed write amplification factor. That constant varies with the workload, and I suspect that a database WAL will not produce the value used in the vendor marketing material. I don't think you can simply say that I am writing so many Gb WAL files, therefore according to the vendor's spec the device should have lifetime of X years -- presumably you'd need to know how many fsync() calls per second were being made, and then how those translate to flash block erase rate on the far side of the SSD controller firmware.
On Apr 28, 2011, at 7:21 AM, David Boreham wrote: > I don't think you can simply say that I am writing so many Gb WAL files, therefore according to the vendor's spec Also, I fully expect the vendors lie about erase cycles as baldly as they lie about MTBF, so I would divide by a very healthyskepticism factor. -- Scott Ribe scott_ribe@elevated-dev.com http://www.elevated-dev.com/ (303) 722-0567 voice
On 4/28/2011 8:20 AM, Scott Ribe wrote: >> I don't think you can simply say that I am writing so many Gb WAL files, therefore according to the vendor's spec > Also, I fully expect the vendors lie about erase cycles as baldly as they lie about MTBF, so I would divide by a very healthyskepticism factor. > As a former card-carrying semiconductor company employee, I'm not so sure about this. I'd expect a healthy guard-band on the erase cycles spec (so if they say 100K, the devices will actually achieve better than 2X). MTBF otoh is a mythical computed value that everyone takes with a big pinch of salt, in my experience. I'm not sure it is possible to lie about MTBF since it's essentially just the result of a calculation based on the number of contacts, cells, traces, etc and the known failure mechanisms for those things. Therefore it will be "true", but not necessarily useful (e.g it does not take account of process defects or an aging mechanism that were not known at the time of manufacture).
On Apr 28, 2011, at 8:48 AM, David Boreham wrote: > As a former card-carrying semiconductor company employee, I'm not so sure about this. Well, yes, you have a good point that in many, if not all, cases we're dealing with different companies. That really shouldhave occurred to me, that manufacturers of SSDDs (or at least some of them) might not have an ingrained culture ofextreme cost cutting and deceptive ratings--I'm going to use "feeling under the weather" as my excuse. (Of course reliabilityof some early consumer-grade SSDDs was abysmal, but that should be a fairly easy problem to avoid.) > MTBF otoh is a mythical computed value... It's not only mythical, it's not even remotely realistic, to the point that it is no exaggeration to call it a bald-facedlie. Sorry, don't remember the university, but there was a nice study of large numbers of disks in data centers,and the result was that actual lifespans were so far from MBTF specs, that the remaining disks would have to justabout outlive the universe in order to get the mean near the same order of magnitude as the published numbers. -- Scott Ribe scott_ribe@elevated-dev.com http://www.elevated-dev.com/ (303) 722-0567 voice
On Thu, Apr 21, 2011 at 12:10 PM, Greg Smith <greg@2ndquadrant.com> wrote: > On 04/21/2011 11:33 AM, Florian Weimer wrote: >> >> Is there an easy way to monitor WAL traffic in away? It >> does not have to be finegrained, but it might be helpful to know if >> we're doing 10 GB, 100 GB or 1 TB of WAL traffic on a particular >> database, should the question of SSDs ever come up. >> > > You can use functions like pg_current_xlog_location() : > http://www.postgresql.org/docs/9.0/interactive/functions-admin.html > > Save a copy of this periodically: > > select now(),pg_current_xlog_location(); > > And you can see WAL volume over time given any two points from that set of > samples. > > To convert the internal numbers returned by that into bytes, you'll need to > do some math on them. Examples showing how that works and code in a few > languages: > > http://archives.postgresql.org/pgsql-general/2010-10/msg00077.php (by hand) > http://munin-monitoring.org/browser/trunk/plugins/node.d/postgres_streaming_.in?rev=3905 > (in Perl) > http://archives.postgresql.org/pgsql-general/2010-10/msg00079.php (in C) > http://postgresql.1045698.n5.nabble.com/How-can-we-tell-how-far-behind-the-standby-is-td3252297.html > (in bash with bc(!), other links) > > What I keep meaning to write is something that does that as part of the SQL > itself, so it gets pulled out of the database already in bytes. > We have an open task to work on this same problem. What we had cobbled together so far was some sql which converted the xlog value into an integer (it's pretty ugly, but I could send it over if you think it would help), which we could then stick in a monitoring system and graph. To get an idea of traffic, I just multiplied this by 16MB. End result ended up looking like this: https://circonus.com/shared/graphs/9497d906-4c5b-e6d2-bf91-d8869e7c1668/OnxdZG Couldn't decide on exactly where to go from there. That's graphing MB/sec, which does map easily in my mind, since xlogs streams are in 16mb bursts. It would make more sense for wal streaming though (but in that case we'd probably want to measure it more precisely). Robert Treat play: http://xzilla.net work: http://omniti.com
On 2011-04-28 21:34, Robert Treat wrote: > > We have an open task to work on this same problem. What we had cobbled > together so far was some sql which converted the xlog value into an > integer (it's pretty ugly, but I could send it over if you think it > would help), which we could then stick in a monitoring system and > graph. To get an idea of traffic, I just multiplied this by 16MB. End > result ended up looking like this: > https://circonus.com/shared/graphs/9497d906-4c5b-e6d2-bf91-d8869e7c1668/OnxdZG > > Couldn't decide on exactly where to go from there. That's graphing > MB/sec, which does map easily in my mind, since xlogs streams are in > 16mb bursts. It would make more sense for wal streaming though (but in > that case we'd probably want to measure it more precisely). If the goal is predicting the EOL of the SSD, graphing IO to the disk/partition, or perhaps graphing the smart values containing write cycles/GBs written/lifetime curve could work. Both monitoring disk IO (and iops) as well as smart values can be done with Symon: example picture with smart attributes graphed at http://i.imgur.com/T4NAq.png - the actual smart values for a SSD firmware would have to be configured though, since they vary a lot. (*) http://www.xs4all.nl/~wpd/symon/ -- Yeb Havinga http://www.mgrid.net/ Mastering Medical Data
Instead of the usual SSD, you may want to consider the 'ioDrive' products from "Fusion-io". http://www.fusionio.com/ This company makes enterprise-class storage on a board populated with flash and managed by their own supposedly-sophisticateddrivers. The board + drivers are meant to get around the problems of SSDs, such as write-failures.They make both SLC and MLC products, at different sizes, to meet different budgets and purposes. I tried them about 3 years ago on a database server (not Postgres). The real-world speed was excellent, yet disappointingin that it performed as well as our high-end RAID-10 from HP. We were looking for even more speed, and werepromised that, but did not see it in our trials. Caveats: • This was using their first attempt at Windows drivers. • We may not have tuned the settings properly. In the end, we chose to keep our RAID for the time being. So, while I can't specifically recommend their products, I certainly suggest considering them. They have drivers for Linuxand Windows, but apparently their rumored Mac OS X drivers never came to fruition. Other benefits beyond speed include size, power, and heat. An internal board with flash saves on all three, compared to aRAID made of either discs or SSDs. --Basil Bourque
On Thu, 28 Apr 2011 17:27:04 -0500, Basil Bourque <basil.list@me.com> wrote: > So, while I can't specifically recommend their products, I certainly > suggest considering them. Customer of ours is probably lurking on here. We host their servers in our datacenter -- we had a UPS go "pop" after an amazing surge and their servers all went down (weren't paying for N+1 power). They had several FusionIO cards in servers running Postgres and experienced zero corruption. YMMV.
On 22/04/11 01:33, Florian Weimer wrote: > * Greg Smith: > >> The fact that every row update can temporarily use more than 8K means >> that actual write throughput on the WAL can be shockingly large. The >> smallest customer I work with regularly has a 50GB database, yet they >> write 20GB of WAL every day. You can imagine how much WAL is >> generated daily on systems with terabyte databases. > > Interesting. Is there an easy way to monitor WAL traffic in away? It > does not have to be finegrained, but it might be helpful to know if > we're doing 10 GB, 100 GB or 1 TB of WAL traffic on a particular > database, should the question of SSDs ever come up. One thought I had on monitoring write usage.. If you're on Linux with the ext4 filesystem, then it keeps track of some statistics for you. Check out /sys/fs/ext4/$DEV/lifetime_write_kbytes (where $DEV is the device the fs is mounted on, eg. sda1, or dm-0, or whatnot - see /dev/mapper to get mappings from LVMs to dm-numbers) If you log that value every day, you could get an idea of your daily write load. -Toby
On 04/26/2011 10:30 AM, Toby Corkindale wrote: > I see Intel is/was claiming their SLC SSDs had a *minimum* lifetime of > 2PB in writes for their 64GB disks; for your customer with a 50GB db > and 20GB/day of WAL, that would work out at a minimum lifetime of a > million days, or about 273 years! > The cheaper "consumer grade" MLC drives should still last minimum 5 years at 20GB/day according to their literature. (Andwhat I found was fairly out of date) > That doesn't seem too bad to me - I don't think I've worked anywhere that keeps their traditional spinning disks in servicebeyond 5 years either. > The comment I made there was that the 20GB/day system was a very small customer. One busy server, who are also the ones most likely to want SSD, I just watched recently chug through 16MB of WAL every 3 seconds=450GB/day. Now, you're right that those systems also aren't running with a tiny amount of flash, either. But the write volume scales along with the size, too. If you're heavily updating records in particular, the WAL volume can be huge relative to the drive space needed to store the result. As for the idea that I'm just singling out one anecdote, I have terabytes of lost data on multiple systems behind my negativity here. I was just pointing out a public failure that included some post-mortem I liked. I'm not sure if I have any happy customers who were early adopters of regular SLC or MLC drives really; the disaster rate is very close to 100% for the first few generations of those drives I've seen, and I've been around 50-ish of them. I'm hoping the current models shipping now are better, getting the write cache stuff sorted out better will be a big help. But it's been a scary technology for database use so far. The published numbers from the manufacturer literature are a very rosy best case when you're hitting the disk with this type of workload. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On 29/04/11 16:35, Greg Smith wrote: > On 04/26/2011 10:30 AM, Toby Corkindale wrote: >> I see Intel is/was claiming their SLC SSDs had a *minimum* lifetime of >> 2PB in writes for their 64GB disks; for your customer with a 50GB db >> and 20GB/day of WAL, that would work out at a minimum lifetime of a >> million days, or about 273 years! >> The cheaper "consumer grade" MLC drives should still last minimum 5 >> years at 20GB/day according to their literature. (And what I found was >> fairly out of date) >> That doesn't seem too bad to me - I don't think I've worked anywhere >> that keeps their traditional spinning disks in service beyond 5 years >> either. > > > The comment I made there was that the 20GB/day system was a very small > customer. One busy server, who are also the ones most likely to want > SSD, I just watched recently chug through 16MB of WAL every 3 > seconds=450GB/day. Now, you're right that those systems also aren't > running with a tiny amount of flash, either. But the write volume scales > along with the size, too. If you're heavily updating records in > particular, the WAL volume can be huge relative to the drive space > needed to store the result. For that sort of workload, it does sound like SSD isn't ideal. Although if the Intel figure of 2PB is to be believed, it'd still take >200 years to wear the drives out (assuming you'll need at least ten 64 GB disks just to store the DB on). (Reference: http://download.intel.com/design/flash/nand/extreme/319984.pdf ) But yeah, much sooner on the cheaper MLC drives, as I understand it.. seems harder to get info out of Intel on their max write amounts. > As for the idea that I'm just singling out one anecdote, I have > terabytes of lost data on multiple systems behind my negativity here. I > was just pointing out a public failure that included some post-mortem I > liked. I'm not sure if I have any happy customers who were early > adopters of regular SLC or MLC drives really; the disaster rate is very > close to 100% for the first few generations of those drives I've seen, > and I've been around 50-ish of them. I'm hoping the current models > shipping now are better, getting the write cache stuff sorted out better > will be a big help. But it's been a scary technology for database use so > far. The published numbers from the manufacturer literature are a very > rosy best case when you're hitting the disk with this type of workload. Ah, thanks - it's interesting to hear more of your experiences there. I do note that more recent SSD drives have made many improvements for durability - now some of them are doing a sort of internal-RAID over their NAND chips, so that if/when bits of them die, you don't lose your data, and can keep operating. (Reference: http://www.anandtech.com/show/4244/intel-ssd-320-review/2 ) It sounds like the technology is more mature now, but I guess we won't know until more people have been using it, successfully, for a while.. -Toby
On Fri, Apr 29, 2011 at 1:23 AM, Toby Corkindale <toby.corkindale@strategicdata.com.au> wrote: > On 29/04/11 16:35, Greg Smith wrote: >> >> On 04/26/2011 10:30 AM, Toby Corkindale wrote: >>> >>> I see Intel is/was claiming their SLC SSDs had a *minimum* lifetime of >>> 2PB in writes for their 64GB disks; for your customer with a 50GB db >>> and 20GB/day of WAL, that would work out at a minimum lifetime of a >>> million days, or about 273 years! >>> The cheaper "consumer grade" MLC drives should still last minimum 5 >>> years at 20GB/day according to their literature. (And what I found was >>> fairly out of date) >>> That doesn't seem too bad to me - I don't think I've worked anywhere >>> that keeps their traditional spinning disks in service beyond 5 years >>> either. >> >> >> The comment I made there was that the 20GB/day system was a very small >> customer. One busy server, who are also the ones most likely to want >> SSD, I just watched recently chug through 16MB of WAL every 3 >> seconds=450GB/day. Now, you're right that those systems also aren't >> running with a tiny amount of flash, either. But the write volume scales >> along with the size, too. If you're heavily updating records in >> particular, the WAL volume can be huge relative to the drive space >> needed to store the result. > > For that sort of workload, it does sound like SSD isn't ideal. > Although if the Intel figure of 2PB is to be believed, it'd still take >200 > years to wear the drives out (assuming you'll need at least ten 64 GB disks > just to store the DB on). > (Reference: http://download.intel.com/design/flash/nand/extreme/319984.pdf ) I think you misunderstood. He's not storing 480GB on the drives, that's how much WAL is moving across it. It could easily be a single 80GB SSD drive or something like that.
On Thu, Apr 28, 2011 at 7:00 PM, Mark Felder <feld@feld.me> wrote: > On Thu, 28 Apr 2011 17:27:04 -0500, Basil Bourque <basil.list@me.com> wrote: > >> So, while I can't specifically recommend their products, I certainly >> suggest considering them. > > Customer of ours is probably lurking on here. We host their servers in our > datacenter -- we had a UPS go "pop" after an amazing surge and their servers > all went down (weren't paying for N+1 power). They had several FusionIO > cards in servers running Postgres and experienced zero corruption. YMMV. > Yeah, we're running fusion-io on some pretty heavily traffic'd servers, and the performance has been good, and durability there when needed. It's certainly worth checking out for those investigating these options. Robert Treat conjecture: xzilla.net consulting: omniti.com
On 04/29/2011 06:42 AM, Scott Marlowe wrote: > I think you misunderstood. He's not storing 480GB on the drives, > that's how much WAL is moving across it. It could easily be a single > 80GB SSD drive or something like that. > Right; that's why you don't necessarily get saved by the fact that larger databases must go onto more flash cells, too. Sometimes, yes, but not always. The WAL is really close to a worst-case for flash: lots of redundant information that's constantly overwritten. It's the last thing you want to consider putting onto SSD. There's a good reason why so many of the "enterprise" SSDs try to distinguish themselves with redundancy and wear leveling advancements; it's so this sort of workload doesn't kill them. Combine that workload possibility with the limitations of MLC flash, and you can see why the lifetimes actually are a serious concern in some situations. Not all of them, of course, but this is why I recommend things like directly measuring your WAL volume. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books