Thread: Completely un-tuned Postgresql benchmark results: SSD vs desktop HDD
If anyone is interested I just completed a series of benchmarks of stock Postgresql running on a normal HDD vs a SSD.
If you don't want to read the post, the summary is that SSDs are 5 to 7 times faster than a 7200RPM HDD drive under a pgbench load.
Is this what everyone else is seeing?
Thanks!
SSD's actually vary quite a bit with typical postgres benchmark workloads. Many of them also do not guarantee data that has been sync'd will not be lost if power fails (most hard drives with a sane OS and file system do).
On Aug 7, 2010, at 4:47 PM, Michael March wrote:
If anyone is interested I just completed a series of benchmarks of stock Postgresql running on a normal HDD vs a SSD.If you don't want to read the post, the summary is that SSDs are 5 to 7 times faster than a 7200RPM HDD drive under a pgbench load.Is this what everyone else is seeing?Thanks!
SSD's actually vary quite a bit with typical postgres benchmark workloads.
You mean various SSDs from different vendors? Or are you saying the same SSD model might vary in performance from drive to drive?
Many of them also do not guarantee data that has been sync'd will not be lost if power fails (most hard drives with a sane OS and file system do).
What feature does an SSD need to have to insure that sync'd data is indeed written to the SSD in the case of power loss?
On Aug 7, 2010, at 4:47 PM, Michael March wrote:If anyone is interested I just completed a series of benchmarks of stock Postgresql running on a normal HDD vs a SSD.If you don't want to read the post, the summary is that SSDs are 5 to 7 times faster than a 7200RPM HDD drive under a pgbench load.Is this what everyone else is seeing?Thanks!
On Sat, Aug 7, 2010 at 5:47 PM, Michael March <mmarch@gmail.com> wrote: > If anyone is interested I just completed a series of benchmarks of stock > Postgresql running on a normal HDD vs a SSD. > If you don't want to read the post, the summary is that SSDs are 5 to 7 > times faster than a 7200RPM HDD drive under a pgbench load. > http://it-blog.5amsolutions.com/2010/08/performance-of-postgresql-ssd-vs.html > > Is this what everyone else is seeing? > Thanks! It's a good first swing, but I'd be interested in seeing how it runs with various numbers of clients, and how it runs as the number of clients goes past optimal. I.e. a nice smooth downward trend or a horrible drop-off for whichever drives. -- To understand recursion, one must first understand recursion.
It's a good first swing, but I'd be interested in seeing how it runsOn Sat, Aug 7, 2010 at 5:47 PM, Michael March <mmarch@gmail.com> wrote:
> If anyone is interested I just completed a series of benchmarks of stock
> Postgresql running on a normal HDD vs a SSD.
> If you don't want to read the post, the summary is that SSDs are 5 to 7
> times faster than a 7200RPM HDD drive under a pgbench load.
> http://it-blog.5amsolutions.com/2010/08/performance-of-postgresql-ssd-vs.html
>
> Is this what everyone else is seeing?
> Thanks!
with various numbers of clients, and how it runs as the number of
clients goes past optimal. I.e. a nice smooth downward trend or a
horrible drop-off for whichever drives.
Yeah. I was thinking the same thing..
I need to automate the tests so I can incrementally increase the scaling of the seed tables and increase of number simultaneous clients over time. Put another way,I need to do A LOT more tests that will gently increment all the testable factors one small step at a time.
On Sun, Aug 8, 2010 at 12:49 AM, Michael March <mmarch@gmail.com> wrote: > >> SSD's actually vary quite a bit with typical postgres benchmark workloads. > > You mean various SSDs from different vendors? Or are you saying the same SSD > model might vary in performance from drive to drive? > >> >> Many of them also do not guarantee data that has been sync'd will not be >> lost if power fails (most hard drives with a sane OS and file system do). > > What feature does an SSD need to have to insure that sync'd data is indeed > written to the SSD in the case of power loss? A big freaking capacitor and the ability to notice power's been cut and start writing out the cache. There are a few that have it that are coming out right about now. There was a post about one such drive a few days ago, it was like 50G and $450 or so, so not cheap, but not that bad compared to the $7000 drive bays with 16 15k6 drives I've used in to the past to get good performance (3 to 4k tps) -- To understand recursion, one must first understand recursion.
Michael March wrote: > If anyone is interested I just completed a series of benchmarks of > stock Postgresql running on a normal HDD vs a SSD. > > If you don't want to read the post, the summary is that SSDs are 5 to > 7 times faster than a 7200RPM HDD drive under a pgbench load. > > http://it-blog.5amsolutions.com/2010/08/performance-of-postgresql-ssd-vs.html > > Is this what everyone else is seeing? I tested a SSD with a capacitor and posted conclusions here http://archives.postgresql.org/pgsql-performance/2010-07/msg00449.php regards, Yeb Havinga
On Aug 7, 2010, at 11:49 PM, Michael March wrote:
SSD's actually vary quite a bit with typical postgres benchmark workloads.You mean various SSDs from different vendors? Or are you saying the same SSD model might vary in performance from drive to drive?
Model to model (more specifically, controller chip to controller chip -- i.e. most 'Indilinx Barefoot' controller based SSD's perform similar).
Many of them also do not guarantee data that has been sync'd will not be lost if power fails (most hard drives with a sane OS and file system do).What feature does an SSD need to have to insure that sync'd data is indeed written to the SSD in the case of power loss?
Either properly flush to storage when the OS / File sytem asks for it (most SSD's don't, most Hard Drives do), or have a supercapacitor to flush data on power loss.
The former can be achieved by turning off the write cache on some drives (such as Intel's X25-M and -E), but hurts performance.
Also, the amount of data at risk in a power loss varies between drives. For Intel's drives, its a small chunk of data ( < 256K). For some other drives, the cache can be over 30MB of outstanding writes.
For some workloads this is acceptable -- not every application is doing financial transactions. Not every part of the system needs to be on an SSD either -- the WAL, and various table spaces can all have different data integrity and performance requirements.
On Aug 7, 2010, at 4:47 PM, Michael March wrote:If anyone is interested I just completed a series of benchmarks of stock Postgresql running on a normal HDD vs a SSD.If you don't want to read the post, the summary is that SSDs are 5 to 7 times faster than a 7200RPM HDD drive under a pgbench load.Is this what everyone else is seeing?Thanks!
On Mon, 2010-08-09 at 09:49 -0700, Scott Carey wrote: > Also, the amount of data at risk in a power loss varies between > drives. For Intel's drives, its a small chunk of data ( < 256K). For > some other drives, the cache can be over 30MB of outstanding writes. > For some workloads this is acceptable -- not every application is > doing financial transactions. Not every part of the system needs to > be on an SSD either -- the WAL, and various table spaces can all have > different data integrity and performance requirements. I don't think it makes sense to speak about the data integrity of a drive in terms of the amount of data at risk, especially with a DBMS. Depending on which 256K you lose, you might as well lose your entire database. That may be an exaggeration, but the point is that it's not as simple as "this drive is only risking 256K data loss per outage". Regards, Jeff Davis
Scott Carey wrote: > Also, the amount of data at risk in a power loss varies between > drives. For Intel's drives, its a small chunk of data ( < 256K). For > some other drives, the cache can be over 30MB of outstanding writes. > For some workloads this is acceptable No, it isn't ever acceptable. You can expect the type of data loss you get when a cache fails to honor write flush calls results in catastrophic database corruption. It's not "I lost the last few seconds"; it's "the database is corrupted and won't start" after a crash. This is why we pound on this topic on this list. A SSD that fails to honor flush requests is completely worthless for anything other than toy databases. You can expect significant work to recover any portion of your data after the first unexpected power loss under heavy write load in this environment, during which you're down. We do database corruption recovery at 2ndQuadrant; while I can't talk about the details of some recent incidents, I am not speaking theoretically when I warn about this. Michael, I would suggest you read http://www.postgresql.org/docs/current/static/wal-reliability.html and link to it at the end of your article. You are recommending that people consider a configuration that will result in their data being lost. That can be acceptable, if for example your data is possible to recreate from backups or the like. But people should be extremely clear that trade-off is happening, and your blog post is not doing that yet. Part of the reason for the bang per buck you're seeing here is that cheap SSDs are cheating. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Jeff Davis wrote: > Depending on which 256K you lose, you might as well lose your entire > database. > Let's be nice and assume that you only lose one 8K block because of the SSD write cache; that's not so bad, right? Guess what--you could easily be the next lucky person who discovers the block corrupted is actually in the middle of the pg_class system catalog, where the list of tables in the database is at! Enjoy getting your data back again with that piece missing. It's really a fun time, I'll tell you that. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On 8/10/2010 12:21 PM, Greg Smith wrote: > Scott Carey wrote: >> Also, the amount of data at risk in a power loss varies between >> drives. For Intel's drives, its a small chunk of data ( < 256K). >> For some other drives, the cache can be over 30MB of outstanding writes. >> For some workloads this is acceptable > > No, it isn't ever acceptable. You can expect the type of data loss > you get when a cache fails to honor write flush calls results in > catastrophic database corruption. It's not "I lost the last few > seconds"; it's "the database is corrupted and won't start" after a > crash. This is why we pound on this topic on this list. A SSD that > fails to honor flush requests is completely worthless for anything > other than toy databases. You can expect significant work to recover > any portion of your data after the first unexpected power loss under > heavy write load in this environment, during which you're down. We do > database corruption recovery at 2ndQuadrant; while I can't talk about > the details of some recent incidents, I am not speaking theoretically > when I warn about this. > What about putting indexes on them? If the drive fails and drops writes on those, they could be rebuilt - assuming your system can function without the index(es) temporarily. -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
Brad Nicholson wrote: > On 8/10/2010 12:21 PM, Greg Smith wrote: >> Scott Carey wrote: >>> Also, the amount of data at risk in a power loss varies between >>> drives. For Intel's drives, its a small chunk of data ( < 256K). >>> For some other drives, the cache can be over 30MB of outstanding >>> writes. >>> For some workloads this is acceptable >> >> No, it isn't ever acceptable. You can expect the type of data loss >> you get when a cache fails to honor write flush calls results in >> catastrophic database corruption. It's not "I lost the last few >> seconds"; it's "the database is corrupted and won't start" after a >> crash. This is why we pound on this topic on this list. A SSD that >> fails to honor flush requests is completely worthless for anything >> other than toy databases. You can expect significant work to recover >> any portion of your data after the first unexpected power loss under >> heavy write load in this environment, during which you're down. We >> do database corruption recovery at 2ndQuadrant; while I can't talk >> about the details of some recent incidents, I am not speaking >> theoretically when I warn about this. > > What about putting indexes on them? If the drive fails and drops > writes on those, they could be rebuilt - assuming your system can > function without the index(es) temporarily. You could put indices on them but as noted by Scott, he's SPOT ON. ANY disk that says "write is complete" when it really is not is entirely unsuitable for ANY real database use. It is simply a matter of time before you have a failure of some sort that results in catastrophic data loss. If you're LUCKY the database won't start and you know you're in trouble. If you're UNLUCKY the database DOES start but there is undetected and unrecoverable data corruption somewhere inside the data tables, which you WILL discover at the most-inopportune moment (like when you desperately NEED that business record for some reason.) You cannot put either the tables or the logs on such a drive without running the risk of a data corruption problem that WILL lose data and MAY be catastrophic, depending on exactly what fails when. While it is possible to recover that which is not damaged from a database that has corruption like this it simply is not possible to recover data that never made it to the disk - no matter what you do - and the time and effort expended (not to mention money if you have to bring in someone with specialized skills you do not possess) that result from such "decisions" when things go wrong are extreme. Don't do it. -- Karl
Attachment
On Tue, Aug 10, 2010 at 12:13 PM, Karl Denninger <karl@denninger.net> wrote: > ANY disk that says "write is complete" when it really is not is entirely > unsuitable for ANY real database use. It is simply a matter of time What about read only slaves where there's a master with 100+spinning hard drives "getting it right" and you need a half dozen or so read slaves? I can imagine that being ok, as long as you don't restart a server after a crash without checking on it.
Brad Nicholson wrote: > What about putting indexes on them? If the drive fails and drops > writes on those, they could be rebuilt - assuming your system can > function without the index(es) temporarily. Dumping indexes on SSD is one of the better uses for them, presuming you can survive what is likely to be an outage from a "can the site handle full load?" perspective while they rebuild after a crash. As I'm sure Brad is painfully aware of already, index rebuilding in PostgreSQL can take a while. To spin my broken record here again, the main thing to note when you consider that--relocate indexes onto SSD--is that the ones you are most concerned about the performance of were likely to be already sitting in RAM anyway, meaning the SSD speedup doesn't help reads much. So the giant performance boost just isn't there in that case. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Scott Marlowe wrote:
I mean, c'mon - how does the data get there?
IF you mean "a server that only accepts SELECTs, does not accept UPDATEs or INSERTs, and on a crash **reloads the entire database from the master**", then ok.
Most people who will do this won't reload it after a crash. They'll "inspect" the database and say "ok", and put it back online. Bad Karma will ensue in the future.
Incidentally, that risk is not theoretical either (I know about this one from hard experience. Fortunately the master was still ok and I was able to force a full-table copy.... I didn't like it as the database was a few hundred GB, but I had no choice.)
-- Karl
A read-only slave isn't read-only, is it?On Tue, Aug 10, 2010 at 12:13 PM, Karl Denninger <karl@denninger.net> wrote:ANY disk that says "write is complete" when it really is not is entirely unsuitable for ANY real database use. It is simply a matter of timeWhat about read only slaves where there's a master with 100+spinning hard drives "getting it right" and you need a half dozen or so read slaves? I can imagine that being ok, as long as you don't restart a server after a crash without checking on it.
I mean, c'mon - how does the data get there?
IF you mean "a server that only accepts SELECTs, does not accept UPDATEs or INSERTs, and on a crash **reloads the entire database from the master**", then ok.
Most people who will do this won't reload it after a crash. They'll "inspect" the database and say "ok", and put it back online. Bad Karma will ensue in the future.
Incidentally, that risk is not theoretical either (I know about this one from hard experience. Fortunately the master was still ok and I was able to force a full-table copy.... I didn't like it as the database was a few hundred GB, but I had no choice.)
-- Karl
Attachment
On 8/10/2010 2:28 PM, Greg Smith wrote: > Brad Nicholson wrote: >> What about putting indexes on them? If the drive fails and drops >> writes on those, they could be rebuilt - assuming your system can >> function without the index(es) temporarily. > > Dumping indexes on SSD is one of the better uses for them, presuming > you can survive what is likely to be an outage from a "can the site > handle full load?" perspective while they rebuild after a crash. As > I'm sure Brad is painfully aware of already, index rebuilding in > PostgreSQL can take a while. To spin my broken record here again, the > main thing to note when you consider that--relocate indexes onto > SSD--is that the ones you are most concerned about the performance of > were likely to be already sitting in RAM anyway, meaning the SSD > speedup doesn't help reads much. So the giant performance boost just > isn't there in that case. > The case where I'm thinking they may be of use is for indexes you can afford to lose. I'm thinking of ones that are needed by nightly batch jobs, down stream systems or reporting - the sorts of things that you can turn off during a rebuild, and where the data sets are not likely to be in cache. We have a few such cases, but we don't need the speed of SSD's for them. Personally, I wouldn't entertain any SSD with a capacitor backing it for anything, even indexes. Not worth the hassle to me. -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
On 8/10/2010 2:38 PM, Karl Denninger wrote:
A valid case is a Slony replica if used for query offloading (not for DR). It's considered a read-only subscriber from the perspective of Slony as only Slony can modify the data (although you are technically correct, it is not read only - controlled write may be more accurate).
In case of failure, a rebuild + resubscribe gets you back to the same consistency. If you have high IO requirements, and don't have the budget to rack up extra disk arrays to meet them, it could be an option.
Scott Marlowe wrote:A read-only slave isn't read-only, is it?On Tue, Aug 10, 2010 at 12:13 PM, Karl Denninger <karl@denninger.net> wrote:ANY disk that says "write is complete" when it really is not is entirely unsuitable for ANY real database use. It is simply a matter of timeWhat about read only slaves where there's a master with 100+spinning hard drives "getting it right" and you need a half dozen or so read slaves? I can imagine that being ok, as long as you don't restart a server after a crash without checking on it.
I mean, c'mon - how does the data get there?
A valid case is a Slony replica if used for query offloading (not for DR). It's considered a read-only subscriber from the perspective of Slony as only Slony can modify the data (although you are technically correct, it is not read only - controlled write may be more accurate).
In case of failure, a rebuild + resubscribe gets you back to the same consistency. If you have high IO requirements, and don't have the budget to rack up extra disk arrays to meet them, it could be an option.
-- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
Brad Nicholson wrote:
Specifically, the following will hose you without warning:
1. SLONY gets a change on the master.
2. SLONY commits it to the (read-only) slave.
3. Confirmation comes back to the master that the change was propagated.
4. Slave CRASHES without actually committing the changed data to stable storage.
When the slave restarts it will not know that the transaction was lost. Neither will the master, since it was told that it was committed. Slony will happily go on its way and replicate forward, without any indication of a problem - except that on the slave, there are one or more transactions that are **missing**.
Some time later you issue an update that goes to the slave, but the change previously lost causes the slave commit to violate referential integrity. SLONY will fail to propagate that change and all behind it - it effectively locks at that point in time.
You can recover from this by dropping the slave from replication and re-inserting it, but that forces a full-table copy of everything in the replication set. The bad news is that the queries to the slave in question may have been returning erroneous data for some unknown period of time prior to the lockup in replication (which hopefully you detect reasonably quickly - you ARE watching SLONY queue depth with some automated process, right?)
I can both cause this in the lab and have had it happen in the field. It's a nasty little problem that bit me on a series of disks that claimed to have write caching off, but in fact did not. I was very happy that the data on the master was good at that point, as if I had needed to failover to the slave (thinking it was a "good" copy) I would have been in SERIOUS trouble.
-- Karl
On 8/10/2010 2:38 PM, Karl Denninger wrote:CAREFUL with that model and beliefs.Scott Marlowe wrote:A valid case is a Slony replica if used for query offloading (not for DR). It's considered a read-only subscriber from the perspective of Slony as only Slony can modify the data (although you are technically correct, it is not read only - controlled write may be more accurate).A read-only slave isn't read-only, is it?On Tue, Aug 10, 2010 at 12:13 PM, Karl Denninger <karl@denninger.net> wrote:ANY disk that says "write is complete" when it really is not is entirely unsuitable for ANY real database use. It is simply a matter of timeWhat about read only slaves where there's a master with 100+spinning hard drives "getting it right" and you need a half dozen or so read slaves? I can imagine that being ok, as long as you don't restart a server after a crash without checking on it.
I mean, c'mon - how does the data get there?
In case of failure, a rebuild + resubscribe gets you back to the same consistency. If you have high IO requirements, and don't have the budget to rack up extra disk arrays to meet them, it could be an option.
Specifically, the following will hose you without warning:
1. SLONY gets a change on the master.
2. SLONY commits it to the (read-only) slave.
3. Confirmation comes back to the master that the change was propagated.
4. Slave CRASHES without actually committing the changed data to stable storage.
When the slave restarts it will not know that the transaction was lost. Neither will the master, since it was told that it was committed. Slony will happily go on its way and replicate forward, without any indication of a problem - except that on the slave, there are one or more transactions that are **missing**.
Some time later you issue an update that goes to the slave, but the change previously lost causes the slave commit to violate referential integrity. SLONY will fail to propagate that change and all behind it - it effectively locks at that point in time.
You can recover from this by dropping the slave from replication and re-inserting it, but that forces a full-table copy of everything in the replication set. The bad news is that the queries to the slave in question may have been returning erroneous data for some unknown period of time prior to the lockup in replication (which hopefully you detect reasonably quickly - you ARE watching SLONY queue depth with some automated process, right?)
I can both cause this in the lab and have had it happen in the field. It's a nasty little problem that bit me on a series of disks that claimed to have write caching off, but in fact did not. I was very happy that the data on the master was good at that point, as if I had needed to failover to the slave (thinking it was a "good" copy) I would have been in SERIOUS trouble.
-- Karl
Attachment
On 8/10/2010 3:28 PM, Karl Denninger wrote:
In case of failure, you need to assume data loss until proven otherwise. If there is a problem, rebuild.
Correct.
It's very easy to cause those sorts of problems.
What I am saying is that the technology can have a use, if you are aware of the sharp edges, and can both work around them and live with them. Everything you are citing is correct, but is more implying that they they are blindly thrown in without understanding the risks and mitigating them.
I'm also not suggesting that this is a configuration I would endorse, but it could potentially save a lot of money in certain use cases.
Brad Nicholson wrote:On 8/10/2010 2:38 PM, Karl Denninger wrote:CAREFUL with that model and beliefs.Scott Marlowe wrote:A valid case is a Slony replica if used for query offloading (not for DR). It's considered a read-only subscriber from the perspective of Slony as only Slony can modify the data (although you are technically correct, it is not read only - controlled write may be more accurate).A read-only slave isn't read-only, is it?On Tue, Aug 10, 2010 at 12:13 PM, Karl Denninger <karl@denninger.net> wrote:ANY disk that says "write is complete" when it really is not is entirely unsuitable for ANY real database use. It is simply a matter of timeWhat about read only slaves where there's a master with 100+spinning hard drives "getting it right" and you need a half dozen or so read slaves? I can imagine that being ok, as long as you don't restart a server after a crash without checking on it.
I mean, c'mon - how does the data get there?
In case of failure, a rebuild + resubscribe gets you back to the same consistency. If you have high IO requirements, and don't have the budget to rack up extra disk arrays to meet them, it could be an option.
Specifically, the following will hose you without warning:What will hose you is assuming that your data will be okay in the case of a failure, which is a very bad assumption to make in the case on unreliable SSD's. You are assuming I am implying that these should be treated like reliable media - I am not.
1. SLONY gets a change on the master.
2. SLONY commits it to the (read-only) slave.
3. Confirmation comes back to the master that the change was propagated.
4. Slave CRASHES without actually committing the changed data to stable storage.
In case of failure, you need to assume data loss until proven otherwise. If there is a problem, rebuild.
When the slave restarts it will not know that the transaction was lost. Neither will the master, since it was told that it was committed. Slony will happily go on its way and replicate forward, without any indication of a problem - except that on the slave, there are one or more transactions that are **missing**.
Correct.
Some time later you issue an update that goes to the slave, but the change previously lost causes the slave commit to violate referential integrity. SLONY will fail to propagate that change and all behind it - it effectively locks at that point in time.It will lock data flow to that subscriber, but not to others.
You can recover from this by dropping the slave from replication and re-inserting it, but that forces a full-table copy of everything in the replication set. The bad news is that the queries to the slave in question may have been returning erroneous data for some unknown period of time prior to the lockup in replication (which hopefully you detect reasonably quickly - you ARE watching SLONY queue depth with some automated process, right?)There are ways around that - run two subscribers and redirect your queries on failure. Don't bring up the failed replica until it is verified or rebuilt.
I can both cause this in the lab and have had it happen in the field. It's a nasty little problem that bit me on a series of disks that claimed to have write caching off, but in fact did not. I was very happy that the data on the master was good at that point, as if I had needed to failover to the slave (thinking it was a "good" copy) I would have been in SERIOUS trouble.
It's very easy to cause those sorts of problems.
What I am saying is that the technology can have a use, if you are aware of the sharp edges, and can both work around them and live with them. Everything you are citing is correct, but is more implying that they they are blindly thrown in without understanding the risks and mitigating them.
I'm also not suggesting that this is a configuration I would endorse, but it could potentially save a lot of money in certain use cases.
-- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
On Tue, Aug 10, 2010 at 12:38 PM, Karl Denninger <karl@denninger.net> wrote: > Scott Marlowe wrote: > > On Tue, Aug 10, 2010 at 12:13 PM, Karl Denninger <karl@denninger.net> wrote: > > > ANY disk that says "write is complete" when it really is not is entirely > unsuitable for ANY real database use. It is simply a matter of time > > > What about read only slaves where there's a master with 100+spinning > hard drives "getting it right" and you need a half dozen or so read > slaves? I can imagine that being ok, as long as you don't restart a > server after a crash without checking on it. > > > A read-only slave isn't read-only, is it? > > I mean, c'mon - how does the data get there? Well, duh. However, what I'm looking at is having two big servers in failover running on solid reliable hardware, and then a small army of read only slony slaves that are used for things like sending user rss feeds and creating weekly reports and such. These 1U machines with 12 to 24 cores and a single SSD drive are "disposable" in terms that if they ever crash, there's a simple script to run that reinits the db and then subscribes them to the set. My point being, no matter how terrible an idea a certain storage media is, there's always a use case for it. Even if it's very narrow.
Re: Completely un-tuned Postgresql benchmark results: SSD vs desktop HDD
From
Christopher Browne
Date:
On Tue, Aug 10, 2010 at 3:52 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > My point being, no matter how terrible an idea a certain storage media > is, there's always a use case for it. Even if it's very narrow. The trouble is, if extra subscribers induce load on the "master," which they presumably will, then that sliver of "use case" may very well get obscured by the cost, such that the sliver should be treated as not existing :-(. -- http://linuxfinances.info/info/linuxdistributions.html
On Tue, Aug 10, 2010 at 2:00 PM, Christopher Browne <cbbrowne@gmail.com> wrote: > On Tue, Aug 10, 2010 at 3:52 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: >> My point being, no matter how terrible an idea a certain storage media >> is, there's always a use case for it. Even if it's very narrow. > > The trouble is, if extra subscribers induce load on the "master," > which they presumably will, then that sliver of "use case" may very > well get obscured by the cost, such that the sliver should be treated > as not existing :-(. One master, one slave, master handles all writes, slave handles all of the other subscribers. I've run a setup like this with as many as 8 or so slaves at the bottom of the pile with no problems at all.
On Aug 10, 2010, at 9:21 AM, Greg Smith wrote: > Scott Carey wrote: >> Also, the amount of data at risk in a power loss varies between >> drives. For Intel's drives, its a small chunk of data ( < 256K). For >> some other drives, the cache can be over 30MB of outstanding writes. >> For some workloads this is acceptable > > No, it isn't ever acceptable. You can expect the type of data loss you > get when a cache fails to honor write flush calls results in > catastrophic database corruption. It's not "I lost the last few > seconds"; I never said it was. > it's "the database is corrupted and won't start" after a > crash. Which is sometimes acceptables. There is NO GUARANTEE that you won't lose data, ever. An increase in the likelihood isan acceptable tradeoff in some situations, especially when it is small. On ANY power loss event, with or without batterybacked caches and such, you should do a consistency check on the system proactively. With less reliable hardware,that task becomes much more of a burden, and is much more likely to require restoring data from somewhere. What is the likelihood that your RAID card fails, or that the battery that reported 'good health' only lasts 5 minutes andyou lose data before power is restored? What is the likelihood of human error? Not that far off from the likelihood of power failure in a datacenter with redundant power. One MUST have a DR plan. Neverassume that your perfect hardware won't fail. > This is why we pound on this topic on this list. A SSD that > fails to honor flush requests is completely worthless for anything other > than toy databases. Overblown. Not every DB and use case is a financial application or business critical app. Many are not toys at all. Slave,read only DB's (or simply subset tablespaces) ... Indexes. (per application, schema) Tables. (per application, schema) System tables / indexes. WAL. Each has different reliability requirement and consequences from losing recently written data. less than 8K can be fatalto the WAL, or table data. Corrupting some tablespaces is not a big deal. Corrupting others is catastrophic. Theproblem with the assertion that this hardware is worthless is that it implies that every user, every use case, is at thefar end of the reliability requirement spectrum. Yes, that can be a critical requirement for many, perhaps most, DB's. But there are many uses for slightly unsafe storagesystems. > You can expect significant work to recover any > portion of your data after the first unexpected power loss under heavy > write load in this environment, during which you're down. We do > database corruption recovery at 2ndQuadrant; while I can't talk about > the details of some recent incidents, I am not speaking theoretically > when I warn about this. I've done the single-user mode recover system tables by hand thing myself at 4AM, on a system with battery backed RAID 10,redundant power, etc. Raid cards die, and 10TB recovery times from backup are long. Its a game of balancing your data loss tolerance with the likelihood of power failure. Both of these variables are highlyvariable, and not just with 'toy' dbs. If you know what you are doing, you can use 'fast but not completely safe'storage for many things safely. Chance of loss is NEVER zero, do not assume that 'good' hardware is flawless. Imagine a common internet case where synchronous_commit=false is fine. Recovery from backups is a pain (but a daily snapshotis taken of the important tables, and weekly for easily recoverable other stuff). If you expect one power relatedfailure every 2 years, it might be perfectly reasonable to use 'unsafe' SSD's in order to support high transactionload on the risk that that once every 2 year downtime is 12 hours long instead of 30 minutes, and includes losingup to a day's information. Applications like this exist all over the place. > -- > Greg Smith 2ndQuadrant US Baltimore, MD > PostgreSQL Training, Services and Support > greg@2ndQuadrant.com www.2ndQuadrant.us >
On Aug 10, 2010, at 11:28 AM, Greg Smith wrote: > Brad Nicholson wrote: >> What about putting indexes on them? If the drive fails and drops >> writes on those, they could be rebuilt - assuming your system can >> function without the index(es) temporarily. > > Dumping indexes on SSD is one of the better uses for them, presuming you > can survive what is likely to be an outage from a "can the site handle > full load?" perspective while they rebuild after a crash. As I'm sure > Brad is painfully aware of already, index rebuilding in PostgreSQL can > take a while. To spin my broken record here again, the main thing to > note when you consider that--relocate indexes onto SSD--is that the ones > you are most concerned about the performance of were likely to be > already sitting in RAM anyway, meaning the SSD speedup doesn't help > reads much. So the giant performance boost just isn't there in that case. > For an OLTP type system, yeah. But for DW/OLAP and batch processing the gains are pretty big. Those indexes get kickedout of RAM and then pulled back in a lot. I'm talking about a server with 72GB of RAM that can't keep enough indexesin memory to avoid a lot of random access. Putting the indexes on an SSD has lowered the random I/O load on the otherdrives a lot, letting them get through sequential scans a lot faster. Estimated power failure, once every 18 months (mostly due to human error). Rebuild indexes offline for 40 minutes every18 months? No problem. > -- > Greg Smith 2ndQuadrant US Baltimore, MD > PostgreSQL Training, Services and Support > greg@2ndQuadrant.com www.2ndQuadrant.us >
On Aug 10, 2010, at 11:38 AM, Karl Denninger wrote:
Scott Marlowe wrote:A read-only slave isn't read-only, is it?On Tue, Aug 10, 2010 at 12:13 PM, Karl Denninger <karl@denninger.net> wrote:ANY disk that says "write is complete" when it really is not is entirely unsuitable for ANY real database use. It is simply a matter of timeWhat about read only slaves where there's a master with 100+spinning hard drives "getting it right" and you need a half dozen or so read slaves? I can imagine that being ok, as long as you don't restart a server after a crash without checking on it.
I mean, c'mon - how does the data get there?
IF you mean "a server that only accepts SELECTs, does not accept UPDATEs or INSERTs, and on a crash **reloads the entire database from the master**", then ok.
Depends on your tablespace setup and schema usage pattern.
If:
* 90% of your data tables are partitioned by date, and untouched a week after insert. Partitions are backed up incrementally.
* The remaining 10% of it is backed up daily, and of that 9% can be re-generated from data elsewhere if data is lost.
* System catalog and wal are on 'safest of safe' hardware.
Then your 'bulk' data on a slave can be on less than flawless hardware. Simply restore the tables from the last week from the master or backup when the (rare) power failure occurs. The remaining data is safe, since it is not written to.
Split up your 10% of non-date partitioned data into what needs to be on safe hardware and what does not (maybe some indexes, etc).
Most of the time, the incremental cost of getting a BBU is too small to not do it, so the above hardly applies. But if you have data that is known to be read-only, you can do many unconventional things with it safely.
Most people who will do this won't reload it after a crash. They'll "inspect" the database and say "ok", and put it back online. Bad Karma will ensue in the future.
Anyone going with something unconventional better know what they are doing and not just blindly plug it in and think everything will be OK. I'd never recommend unconventional setups for a user that wasn't an expert and understood the tradeoff.
Incidentally, that risk is not theoretical either (I know about this one from hard experience. Fortunately the master was still ok and I was able to force a full-table copy.... I didn't like it as the database was a few hundred GB, but I had no choice.)
Been there with 10TB with hardware that should have been perfectly safe. 5 days of copying, and wishing that pg_dump supported lzo compression so that the dump portion had a chance at keeping up with the much faster restore portion with some level of compression on to save the copy bandwidth.
-- Karl<karl.vcf>
Scott Carey wrote: > > On Aug 10, 2010, at 11:38 AM, Karl Denninger wrote: > > ..... >> >> Most people who will do this won't reload it after a crash. They'll >> "inspect" the database and say "ok", and put it back online. Bad >> Karma will ensue in the future. > > Anyone going with something unconventional better know what they are > doing and not just blindly plug it in and think everything will be OK. > I'd never recommend unconventional setups for a user that wasn't an > expert and understood the tradeoff. True. >> >> Incidentally, that risk is not theoretical either (I know about this >> one from hard experience. Fortunately the master was still ok and I >> was able to force a full-table copy.... I didn't like it as the >> database was a few hundred GB, but I had no choice.) > > Been there with 10TB with hardware that should have been perfectly > safe. 5 days of copying, and wishing that pg_dump supported lzo > compression so that the dump portion had a chance at keeping up with > the much faster restore portion with some level of compression on to > save the copy bandwidth. Pipe it through ssh -C PS: This works for SLONY and Bucardo too - set up a tunnel and then change the port temporarily. This is especially useful when the DB being COPY'd across has big fat honking BYTEA fields in it, which otherwise expand about 400% - or more - on the wire. -- Karl
Attachment
Scott Carey wrote: > What is the likelihood that your RAID card fails, or that the battery that reported 'good health' only lasts 5 minutesand you lose data before power is restored? What is the likelihood of human error? > These are all things that happen sometimes, sure. The problem with the cheap SSDs is that they happen downright often if you actually test for it. If someone is aware of the risk and makes an informed decision, fine. But most of the time I see articles like the one that started this thread that are oblivious to the issue, and that's really bad. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Aug 11, 2010, at 9:30 PM, Greg Smith wrote: > Scott Carey wrote: >> What is the likelihood that your RAID card fails, or that the battery that reported 'good health' only lasts 5 minutesand you lose data before power is restored? What is the likelihood of human error? >> > > These are all things that happen sometimes, sure. The problem with the > cheap SSDs is that they happen downright often if you actually test for > it. If someone is aware of the risk and makes an informed decision, > fine. But most of the time I see articles like the one that started > this thread that are oblivious to the issue, and that's really bad. > Agreed. There is a HUGE gap between "ooh ssd's are fast, look!" and engineering a solution that uses them properly withall their strengths and faults. And as 'gnuoytr' points out, there is a big difference between an Intel SSD and say,this thing: http://www.nimbusdata.com/products/s-class_overview.html > -- > Greg Smith 2ndQuadrant US Baltimore, MD > PostgreSQL Training, Services and Support > greg@2ndQuadrant.com www.2ndQuadrant.us >
Re: Completely un-tuned Postgresql benchmark results: SSD vs desktop HDD
From
Arjen van der Meijden
Date:
On 13-8-2010 1:40 Scott Carey wrote: > Agreed. There is a HUGE gap between "ooh ssd's are fast, look!" and > engineering a solution that uses them properly with all their > strengths and faults. And as 'gnuoytr' points out, there is a big > difference between an Intel SSD and say, this thing: > http://www.nimbusdata.com/products/s-class_overview.html From the description it sounds as if its either FreeBSD or OpenSolaris with ZFS with some webinterface-layer. That's not a bad thing per se, but as the site suggests its 'only' $25k for the smallest (2.5TB?) device. That makes it very likely that it are "off the shelf" MLC flash drives. Given the design of the device and the pricing it probably are your average 2.5"-drives with 100, 200 or 400GB capacity (maybe OCZ vertex 2 pro, which do have such a capacitor?), similar to the Intel SSD you compared it to. And than we're basically back to square one, unless the devices have a capacitor or ZFS works better with SSD-drives to begin with (it will at least know silent data corruption did occur). There are of course devices that are not built on top of normal disk form factor SSD-drives like the Ramsan devices or Sun's F5100. Best regards, Arjen
As a postscript to these tests.. I just tried the 500GB Monentus XT hybrid SSD/HDD drive. I had this fantasy that it would at least do better than the 7200 rpm desktop drive.
Oh lord, my gut was wrong. The performance was inconsistent and never over 2/3rds the performance of the slowest desktop drive.
--
<admiral>
Michael F. March ----- mmarch@gmail.com
On Sat, Aug 7, 2010 at 4:47 PM, Michael March <mmarch@gmail.com> wrote:
If anyone is interested I just completed a series of benchmarks of stock Postgresql running on a normal HDD vs a SSD.If you don't want to read the post, the summary is that SSDs are 5 to 7 times faster than a 7200RPM HDD drive under a pgbench load.Is this what everyone else is seeing?Thanks!
--
<admiral>
Michael F. March ----- mmarch@gmail.com