Re: Plug-pull testing worked, diskchecker.pl failed - Mailing list pgsql-general

From Chris Angelico
Subject Re: Plug-pull testing worked, diskchecker.pl failed
Date
Msg-id CAPTjJmpXC+FM5U=kDxv+k-iK9Az=po9agkh34LvihSbrLpz+ug@mail.gmail.com
Whole thread Raw
In response to Re: Plug-pull testing worked, diskchecker.pl failed  (Scott Marlowe <scott.marlowe@gmail.com>)
Responses Re: Plug-pull testing worked, diskchecker.pl failed  (Scott Marlowe <scott.marlowe@gmail.com>)
Re: Plug-pull testing worked, diskchecker.pl failed  (Greg Smith <greg@2ndQuadrant.com>)
List pgsql-general
On Tue, Oct 23, 2012 at 9:51 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Mon, Oct 22, 2012 at 7:17 AM, Chris Angelico <rosuav@gmail.com> wrote:
>> After reading the comments last week about SSDs, I did some testing of
>> the ones we have at work - each of my test-boxes (three with SSDs, one
>> with HDD) subjected to multiple stand-alone plug-pull tests, using
>> pgbench to provide load. So far, there've been no instances of
>> PostgreSQL data corruption, but diskchecker.pl reported huge numbers
>> of errors.
>
> Try starting pgbench, and then halfway through the timeout for a
> checkpoint timeout issue a checkpoint and WHILE the checkpoint is
> still running THEN pull the plug.
>
> Then after bringing the server up (assuming pg starts up) see if
> pg_dump generates any errors.

Thanks for the tip. I've been flat-out at work these past few days and
haven't gotten around to testing in the middle of a checkpoint, but I
have done something that might also be of interest. It's inspired by a
combination of diskchecker and pgbench; a harness that puts the
database under load and retains a record of what's been done.

In brief: Create a table with N (eg 100) rows, then spin as fast as
possible, incrementing a counter against one random row and also
incrementing the "Total" counter. When the database goes down, wait
for it to come up again; when it does, check against the local copy of
the counters and report any discrepancies.

The code's written in Pike, using the same database connection logic
that we use in our actual application (well, some of our code is C++
and some is PHP, so this corresponds to one part of our app), so this
is roughly representative of real usage.

It's about a page or two of code: http://pastebin.com/UNTj642Y

Currently, all the key parameters (database connection info (which has
been censored for the pastebin version), pool size, thread count, etc)
are just variables visible in the script, simpler than parsing
command-line arguments.

Is this a useful and plausible testing methodology? It's definitely
showed up some failures. On a hard-disk, all is well as long as the
write-back cache is disabled; on the SSDs, I can't make them reliable.

Is a single table enough to test for corruption with?

Chris Angelico


pgsql-general by date:

Previous
From: John Ashmead
Date:
Subject: Postgres 9.2 & PostGis 1.5/2.0
Next
From: Steve Litt
Date:
Subject: Re: Need sql to pull data from terribly architected table