Re: Data corruption after SAN snapshot - Mailing list pgsql-admin

From Terry Schmitt
Subject Re: Data corruption after SAN snapshot
Date
Msg-id CAOOcysxktC6qkc0j4cruhtA5Ec5BLDb9Q0hX9YWkb8G7F1_JdA@mail.gmail.com
Whole thread Raw
In response to Re: Data corruption after SAN snapshot  (Craig Ringer <ringerc@ringerc.id.au>)
Responses Re: Data corruption after SAN snapshot  (Stephen Frost <sfrost@snowman.net>)
List pgsql-admin
Thanks Craig.

"# Brad's el-ghetto do-our-storage-stacks-lie?-script" I like it already :)

I may play around with that. Looks interesting. For everyone else, here's a post describing the use of  diskchecker: http://brad.livejournal.com/2116715.html
I experimented with sysbench today, which was somewhat enlightening and it clearly shows the impact that fsync/fdatasync has on the file system performance. It's pretty obvious that fsync is writing out to disk simply based on the throughput of each test.
Using pgbench is a good idea, as I can throw a high transaction rate at the database and take a snapshot during the test. So far, executing pg_dumpall seems to be fairly reliable for finding the corrupt objects after my initial data load, but unfortunately much of the corruption has been with indexes which pgdump will not expose.

Thanks for the input,
T




On Tue, Aug 7, 2012 at 6:11 PM, Craig Ringer <ringerc@ringerc.id.au> wrote:
On 08/08/2012 06:23 AM, Terry Schmitt wrote:

Anyone have a solid method to test if fdatasync is working correctly or
thoughts on troubleshooting this?

Try diskchecker.pl

https://gist.github.com/3177656

The other obvious step is that you've changed three things, so start isolation testing.

- Test Postgres Plus Advanced Server 8.4, which you knew worked, on your new file system and OS.

- Test PP9.1 on your new OS but with ext3, which you knew worked

- Test PP9.1 on your new OS but with ext4, which should work if ext3 did

- Test PP9.1 on a copy of your *old* OS with the old file system setup.

- Test mainline PostgreSQL 9.1 on your new setup to see if it's PP specific.

Since each test sounds moderately time consuming, you'll probably need to find a way to automate. I'd first see if I could reproduce the problem when running PgBench against the same setup that's currently failing, and if that reproduces the fault you can use PgBench with the other tests.

--
Craig Ringer


pgsql-admin by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Data corruption after SAN snapshot
Next
From: Stephen Frost
Date:
Subject: Re: Data corruption after SAN snapshot