Thread: Plug-pull testing worked, diskchecker.pl failed
After reading the comments last week about SSDs, I did some testing of the ones we have at work - each of my test-boxes (three with SSDs, one with HDD) subjected to multiple stand-alone plug-pull tests, using pgbench to provide load. So far, there've been no instances of PostgreSQL data corruption, but diskchecker.pl reported huge numbers of errors. What exactly does this mean? Is Postgres doing something that diskchecker isn't, and is thus safe? Could data corruption occur but I've just never pulled the power out at the precise microsecond when it would cause problems? Or is it that we would lose entire transactions, but never experience corruption that the postmaster can't repair? Interestingly, disabling write-caching with 'hdparm -W 0 /dev/sda' (as per the llivejournal blog[1]) reduced the SSD's error rates without eliminating failures entirely, while on the HDD, there were no problems at all with write caching off. ChrisA
On Mon, Oct 22, 2012 at 6:17 AM, Chris Angelico <rosuav@gmail.com> wrote: > After reading the comments last week about SSDs, I did some testing of > the ones we have at work - each of my test-boxes (three with SSDs, one > with HDD) subjected to multiple stand-alone plug-pull tests, using > pgbench to provide load. So far, there've been no instances of > PostgreSQL data corruption, but diskchecker.pl reported huge numbers > of errors. What did you do to look for corruption? That PosgreSQL succeeds at going through crash-recovery and then starting up is not a good indicator that there is no corruption. Did you do something like compute the aggregates on pgbench_history and compare those aggregates to the balances in the other 3 tables? Cheers, Jeff
On Tue, Oct 23, 2012 at 6:26 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > What did you do to look for corruption? That PosgreSQL succeeds at > going through crash-recovery and then starting up is not a good > indicator that there is no corruption. I fired up Postgres and looked at the logs for any signs of failure. > Did you do something like compute the aggregates on pgbench_history > and compare those aggregates to the balances in the other 3 tables? No, didn't do that. My next check will be done over the network (similar to diskchecker), with a script that fires off requests, waits for them to be confirmed committed, and then records a local copy, and will check that local copy once the server's back up again. That'll tell me if transactions are being lost. I'm kinda feeling my way in the dark here. Will check out the aggregates on pgbench_history when I get to work today; thanks for the tip! ChrisA
On Mon, Oct 22, 2012 at 12:31 PM, Chris Angelico <rosuav@gmail.com> wrote: > On Tue, Oct 23, 2012 at 6:26 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >> What did you do to look for corruption? That PosgreSQL succeeds at >> going through crash-recovery and then starting up is not a good >> indicator that there is no corruption. > > I fired up Postgres and looked at the logs for any signs of failure. > >> Did you do something like compute the aggregates on pgbench_history >> and compare those aggregates to the balances in the other 3 tables? > > No, didn't do that. My next check will be done over the network > (similar to diskchecker), with a script that fires off requests, waits > for them to be confirmed committed, and then records a local copy, and > will check that local copy once the server's back up again. That'll > tell me if transactions are being lost. If you like Perl, the count.pl from this message might be a useful starting point: http://archives.postgresql.org/pgsql-hackers/2012-02/msg01227.php It was designed to check consistency after postmaster crashes, not OS crashes, so the checker runs on the same host as postgres does. Obviously for pull-the-plug test, you need run it on a different host; so all the DBI->connect(....) calls need to be changed to do that. > I'm kinda feeling my way in the dark here. Will check out the > aggregates on pgbench_history when I get to work today; thanks for the > tip! Here's an example with pgbench_accounts, the other 2 should look analogous. select aid, abalance, count(*) from (select aid,abalance from pgbench_accounts union all select aid, sum(delta) from pgbench_history group by aid) as foo group by aid, abalance having abalance!=0 and count(*)!=2; This should return zero rows. Any other result indicates corruption. pgbench truncates pgbench_history, but does not reset the balances to zero on the other tables. So if you want to run the test repeatedly, you have to do pgbench -i between runs, or manually reset the balance columns. Cheers, Jeff
On Mon, Oct 22, 2012 at 7:17 AM, Chris Angelico <rosuav@gmail.com> wrote: > After reading the comments last week about SSDs, I did some testing of > the ones we have at work - each of my test-boxes (three with SSDs, one > with HDD) subjected to multiple stand-alone plug-pull tests, using > pgbench to provide load. So far, there've been no instances of > PostgreSQL data corruption, but diskchecker.pl reported huge numbers > of errors. Try starting pgbench, and then halfway through the timeout for a checkpoint timeout issue a checkpoint and WHILE the checkpoint is still running THEN pull the plug. Then after bringing the server up (assuming pg starts up) see if pg_dump generates any errors.
On Tue, Oct 23, 2012 at 9:51 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Mon, Oct 22, 2012 at 7:17 AM, Chris Angelico <rosuav@gmail.com> wrote: >> After reading the comments last week about SSDs, I did some testing of >> the ones we have at work - each of my test-boxes (three with SSDs, one >> with HDD) subjected to multiple stand-alone plug-pull tests, using >> pgbench to provide load. So far, there've been no instances of >> PostgreSQL data corruption, but diskchecker.pl reported huge numbers >> of errors. > > Try starting pgbench, and then halfway through the timeout for a > checkpoint timeout issue a checkpoint and WHILE the checkpoint is > still running THEN pull the plug. > > Then after bringing the server up (assuming pg starts up) see if > pg_dump generates any errors. Thanks for the tip. I've been flat-out at work these past few days and haven't gotten around to testing in the middle of a checkpoint, but I have done something that might also be of interest. It's inspired by a combination of diskchecker and pgbench; a harness that puts the database under load and retains a record of what's been done. In brief: Create a table with N (eg 100) rows, then spin as fast as possible, incrementing a counter against one random row and also incrementing the "Total" counter. When the database goes down, wait for it to come up again; when it does, check against the local copy of the counters and report any discrepancies. The code's written in Pike, using the same database connection logic that we use in our actual application (well, some of our code is C++ and some is PHP, so this corresponds to one part of our app), so this is roughly representative of real usage. It's about a page or two of code: http://pastebin.com/UNTj642Y Currently, all the key parameters (database connection info (which has been censored for the pastebin version), pool size, thread count, etc) are just variables visible in the script, simpler than parsing command-line arguments. Is this a useful and plausible testing methodology? It's definitely showed up some failures. On a hard-disk, all is well as long as the write-back cache is disabled; on the SSDs, I can't make them reliable. Is a single table enough to test for corruption with? Chris Angelico
On Wed, Oct 24, 2012 at 8:04 AM, Chris Angelico <rosuav@gmail.com> wrote: > On Tue, Oct 23, 2012 at 9:51 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: >> On Mon, Oct 22, 2012 at 7:17 AM, Chris Angelico <rosuav@gmail.com> wrote: >>> After reading the comments last week about SSDs, I did some testing of >>> the ones we have at work - each of my test-boxes (three with SSDs, one >>> with HDD) subjected to multiple stand-alone plug-pull tests, using >>> pgbench to provide load. So far, there've been no instances of >>> PostgreSQL data corruption, but diskchecker.pl reported huge numbers >>> of errors. >> >> Try starting pgbench, and then halfway through the timeout for a >> checkpoint timeout issue a checkpoint and WHILE the checkpoint is >> still running THEN pull the plug. >> >> Then after bringing the server up (assuming pg starts up) see if >> pg_dump generates any errors. > > Thanks for the tip. I've been flat-out at work these past few days and > haven't gotten around to testing in the middle of a checkpoint, but I > have done something that might also be of interest. It's inspired by a > combination of diskchecker and pgbench; a harness that puts the > database under load and retains a record of what's been done. > > In brief: Create a table with N (eg 100) rows, then spin as fast as > possible, incrementing a counter against one random row and also > incrementing the "Total" counter. When the database goes down, wait > for it to come up again; when it does, check against the local copy of > the counters and report any discrepancies. > > The code's written in Pike, using the same database connection logic > that we use in our actual application (well, some of our code is C++ > and some is PHP, so this corresponds to one part of our app), so this > is roughly representative of real usage. > > It's about a page or two of code: http://pastebin.com/UNTj642Y Very cool. Nice little project. > Currently, all the key parameters (database connection info (which has > been censored for the pastebin version), pool size, thread count, etc) > are just variables visible in the script, simpler than parsing > command-line arguments. > > Is this a useful and plausible testing methodology? It's definitely > showed up some failures. On a hard-disk, all is well as long as the > write-back cache is disabled; on the SSDs, I can't make them reliable. Yes it seems to be quite a good idea actually. > Is a single table enough to test for corruption with? If it fails, definitely, if it passes maybe.
On 10/24/12 4:04 PM, Chris Angelico wrote: > Is this a useful and plausible testing methodology? It's definitely > showed up some failures. On a hard-disk, all is well as long as the > write-back cache is disabled; on the SSDs, I can't make them reliable. On Linux systems, you can tell when Postgres is busy writing data out during a checkpoint because the "Dirty:" amount will be dropping rapidly. At most other times, that number goes up. You can try to increase the odds of finding database level corruption during a pull the plug test by trying to yank during that most sensitive moment. Combine a reasonable write-heavy test like you've devised with that "optimization", and systems that don't write reliably will usually corrupt within a few tries. In general, through, diskchecker.pl is the more sensitive test. If it fails, storage is unreliable for PostgreSQL, period. It's good that you've followed up by confirming the real database corruption implied by that is also visible. In general, though, that's not needed. Diskchecker says the drive is bad, you're done--don't put a database on it. Doing the database level tests is more for finding false positives: where diskchecker says the drive is OK, but perhaps there is a filesystem problem that makes it unreliable, one that it doesn't test for. What SSD are you using? The Intel 320 and 710 series models are the only SATA-connected drives still on the market I know of that pass a serious test. The other good models are direct PCI-E storage units, like the FusionIO drives. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Sat, Oct 27, 2012 at 4:26 PM, Greg Smith <greg@2ndquadrant.com> wrote: > In general, through, diskchecker.pl is the more sensitive test. If it > fails, storage is unreliable for PostgreSQL, period. It's good that you've > followed up by confirming the real database corruption implied by that is > also visible. In general, though, that's not needed. Diskchecker says the > drive is bad, you're done--don't put a database on it. Doing the database > level tests is more for finding false positives: where diskchecker says the > drive is OK, but perhaps there is a filesystem problem that makes it > unreliable, one that it doesn't test for. Thanks. That's the conclusion we were coming to too, though all I've seen is lost transactions and not any other form of damage. > What SSD are you using? The Intel 320 and 710 series models are the only > SATA-connected drives still on the market I know of that pass a serious > test. The other good models are direct PCI-E storage units, like the > FusionIO drives. I don't have the specs to hand, but one of them is a Kingston drive. Our local supplier is out of 320 series drives, so we were looking for others; will check out the 710s. It's crazy that so few drives can actually be trusted. ChrisA
On Sat, Oct 27, 2012 at 05:41:02PM +1100, Chris Angelico wrote: > On Sat, Oct 27, 2012 at 4:26 PM, Greg Smith <greg@2ndquadrant.com> wrote: > > In general, through, diskchecker.pl is the more sensitive test. If it > > fails, storage is unreliable for PostgreSQL, period. It's good that you've > > followed up by confirming the real database corruption implied by that is > > also visible. In general, though, that's not needed. Diskchecker says the > > drive is bad, you're done--don't put a database on it. Doing the database > > level tests is more for finding false positives: where diskchecker says the > > drive is OK, but perhaps there is a filesystem problem that makes it > > unreliable, one that it doesn't test for. > > Thanks. That's the conclusion we were coming to too, though all I've > seen is lost transactions and not any other form of damage. > > > What SSD are you using? The Intel 320 and 710 series models are the only > > SATA-connected drives still on the market I know of that pass a serious > > test. The other good models are direct PCI-E storage units, like the > > FusionIO drives. > > I don't have the specs to hand, but one of them is a Kingston drive. > Our local supplier is out of 320 series drives, so we were looking for > others; will check out the 710s. It's crazy that so few drives can > actually be trusted. Yes. Welcome to our craziness! -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Wed, Nov 7, 2012 at 11:59 AM, Bruce Momjian <bruce@momjian.us> wrote: > On Sat, Oct 27, 2012 at 05:41:02PM +1100, Chris Angelico wrote: >> On Sat, Oct 27, 2012 at 4:26 PM, Greg Smith <greg@2ndquadrant.com> wrote: >> > In general, through, diskchecker.pl is the more sensitive test. If it >> > fails, storage is unreliable for PostgreSQL, period. It's good that you've >> > followed up by confirming the real database corruption implied by that is >> > also visible. In general, though, that's not needed. Diskchecker says the >> > drive is bad, you're done--don't put a database on it. Doing the database >> > level tests is more for finding false positives: where diskchecker says the >> > drive is OK, but perhaps there is a filesystem problem that makes it >> > unreliable, one that it doesn't test for. >> >> Thanks. That's the conclusion we were coming to too, though all I've >> seen is lost transactions and not any other form of damage. >> >> > What SSD are you using? The Intel 320 and 710 series models are the only >> > SATA-connected drives still on the market I know of that pass a serious >> > test. The other good models are direct PCI-E storage units, like the >> > FusionIO drives. >> >> I don't have the specs to hand, but one of them is a Kingston drive. >> Our local supplier is out of 320 series drives, so we were looking for >> others; will check out the 710s. It's crazy that so few drives can >> actually be trusted. > > Yes. Welcome to our craziness! Is there a comprehensive list of drives that have been tested on the wiki somewhere? Our current choices seem to be the Intel 3xx series which STILL suffer from the "whoops I'm now an 8MB drive" bug and the very expensive SLC 7xx series Intel drives, the Hitachi Ultrastar SSD400M, and the OCZ Vertex 2 Pro. Any particular recommendations from those or other series from anyone would be greatly appreciated.
On Wed, Nov 7, 2012 at 01:53:47PM -0700, Scott Marlowe wrote: > On Wed, Nov 7, 2012 at 11:59 AM, Bruce Momjian <bruce@momjian.us> wrote: > > On Sat, Oct 27, 2012 at 05:41:02PM +1100, Chris Angelico wrote: > >> On Sat, Oct 27, 2012 at 4:26 PM, Greg Smith <greg@2ndquadrant.com> wrote: > >> > In general, through, diskchecker.pl is the more sensitive test. If it > >> > fails, storage is unreliable for PostgreSQL, period. It's good that you've > >> > followed up by confirming the real database corruption implied by that is > >> > also visible. In general, though, that's not needed. Diskchecker says the > >> > drive is bad, you're done--don't put a database on it. Doing the database > >> > level tests is more for finding false positives: where diskchecker says the > >> > drive is OK, but perhaps there is a filesystem problem that makes it > >> > unreliable, one that it doesn't test for. > >> > >> Thanks. That's the conclusion we were coming to too, though all I've > >> seen is lost transactions and not any other form of damage. > >> > >> > What SSD are you using? The Intel 320 and 710 series models are the only > >> > SATA-connected drives still on the market I know of that pass a serious > >> > test. The other good models are direct PCI-E storage units, like the > >> > FusionIO drives. > >> > >> I don't have the specs to hand, but one of them is a Kingston drive. > >> Our local supplier is out of 320 series drives, so we were looking for > >> others; will check out the 710s. It's crazy that so few drives can > >> actually be trusted. > > > > Yes. Welcome to our craziness! > > Is there a comprehensive list of drives that have been tested on the > wiki somewhere? Our current choices seem to be the Intel 3xx series > which STILL suffer from the "whoops I'm now an 8MB drive" bug and the > very expensive SLC 7xx series Intel drives, the Hitachi Ultrastar > SSD400M, and the OCZ Vertex 2 Pro. Any particular recommendations > from those or other series from anyone would be greatly appreciated. No, I know of no official list. Greg Smith and I have tried to document some of this on the wiki: http://wiki.postgresql.org/wiki/Reliable_Writes -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Wed, Nov 7, 2012 at 2:01 PM, Bruce Momjian <bruce@momjian.us> wrote: > On Wed, Nov 7, 2012 at 01:53:47PM -0700, Scott Marlowe wrote: >> On Wed, Nov 7, 2012 at 11:59 AM, Bruce Momjian <bruce@momjian.us> wrote: >> > On Sat, Oct 27, 2012 at 05:41:02PM +1100, Chris Angelico wrote: >> >> On Sat, Oct 27, 2012 at 4:26 PM, Greg Smith <greg@2ndquadrant.com> wrote: >> >> > In general, through, diskchecker.pl is the more sensitive test. If it >> >> > fails, storage is unreliable for PostgreSQL, period. It's good that you've >> >> > followed up by confirming the real database corruption implied by that is >> >> > also visible. In general, though, that's not needed. Diskchecker says the >> >> > drive is bad, you're done--don't put a database on it. Doing the database >> >> > level tests is more for finding false positives: where diskchecker says the >> >> > drive is OK, but perhaps there is a filesystem problem that makes it >> >> > unreliable, one that it doesn't test for. >> >> >> >> Thanks. That's the conclusion we were coming to too, though all I've >> >> seen is lost transactions and not any other form of damage. >> >> >> >> > What SSD are you using? The Intel 320 and 710 series models are the only >> >> > SATA-connected drives still on the market I know of that pass a serious >> >> > test. The other good models are direct PCI-E storage units, like the >> >> > FusionIO drives. >> >> >> >> I don't have the specs to hand, but one of them is a Kingston drive. >> >> Our local supplier is out of 320 series drives, so we were looking for >> >> others; will check out the 710s. It's crazy that so few drives can >> >> actually be trusted. >> > >> > Yes. Welcome to our craziness! >> >> Is there a comprehensive list of drives that have been tested on the >> wiki somewhere? Our current choices seem to be the Intel 3xx series >> which STILL suffer from the "whoops I'm now an 8MB drive" bug and the >> very expensive SLC 7xx series Intel drives, the Hitachi Ultrastar >> SSD400M, and the OCZ Vertex 2 Pro. Any particular recommendations >> from those or other series from anyone would be greatly appreciated. > > No, I know of no official list. Greg Smith and I have tried to document > some of this on the wiki: > > http://wiki.postgresql.org/wiki/Reliable_Writes Well I may get a budget at work to do some testing so I'll update that list etc. This has been a good thread to get me motivated to get started.
On Wed, Nov 7, 2012 at 02:12:39PM -0700, Scott Marlowe wrote: > >> >> I don't have the specs to hand, but one of them is a Kingston drive. > >> >> Our local supplier is out of 320 series drives, so we were looking for > >> >> others; will check out the 710s. It's crazy that so few drives can > >> >> actually be trusted. > >> > > >> > Yes. Welcome to our craziness! > >> > >> Is there a comprehensive list of drives that have been tested on the > >> wiki somewhere? Our current choices seem to be the Intel 3xx series > >> which STILL suffer from the "whoops I'm now an 8MB drive" bug and the > >> very expensive SLC 7xx series Intel drives, the Hitachi Ultrastar > >> SSD400M, and the OCZ Vertex 2 Pro. Any particular recommendations > >> from those or other series from anyone would be greatly appreciated. > > > > No, I know of no official list. Greg Smith and I have tried to document > > some of this on the wiki: > > > > http://wiki.postgresql.org/wiki/Reliable_Writes > > Well I may get a budget at work to do some testing so I'll update that > list etc. This has been a good thread to get me motivated to get > started. Yes, it seems database people are the few who care about device sync reliability (or know to care). -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Wed, Nov 7, 2012 at 3:53 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
Is there a comprehensive list of drives that have been tested on the
wiki somewhere? Our current choices seem to be the Intel 3xx series
which STILL suffer from the "whoops I'm now an 8MB drive" bug and the
very expensive SLC 7xx series Intel drives, the Hitachi Ultrastar
SSD400M, and the OCZ Vertex 2 Pro. Any particular recommendations
from those or other series from anyone would be greatly appreciated.
My most recent big box(es) are built using all Intel 3xx series drives. Like you said, the 7xx series was way too expensive. The 5xx series looks totally right on paper, until you find out they don't have a durable cache. That just doesn't make sense in any universe... but that's the way they are.
They seem to be doing really well so far. I connected them to LSI RAID controllers, with the Fastpath option. I think they are pretty speedy.
On my general purpose boxes, I now spec the 3xx drives for boot (software RAID) and use other drives such as Seagate Constellation for data with ZFS. Sometimes I think that the ZFS volumes are faster than the SSD RAID volumes, but it is not a fair comparison because the RAID systems are CentOS 6 and the ZFS systems are FreeBSD 9.
On 11/7/2012 3:17 PM, Vick Khera wrote: > My most recent big box(es) are built using all Intel 3xx series > drives. Like you said, the 7xx series was way too expensive. I have to raise my hand to say that for us 710 series drives are an unbelievable bargain and we buy nothing else now for production servers. When you compare vs the setup you'd need to achieve the same tps using rotating media, and especially considering the power and cooling saved, they're really cheap. YMMV of course..