Re: silent data loss with ext4 / all current versions - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: silent data loss with ext4 / all current versions
Date
Msg-id 565B0CBB.4090406@2ndquadrant.com
Whole thread Raw
In response to Re: silent data loss with ext4 / all current versions  (Craig Ringer <craig@2ndquadrant.com>)
Responses Re: silent data loss with ext4 / all current versions  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
Hi,

On 11/29/2015 02:38 PM, Craig Ringer wrote:
> On 27 November 2015 at 21:28, Greg Stark <stark@mit.edu
> <mailto:stark@mit.edu>> wrote:
>
>     On Fri, Nov 27, 2015 at 11:17 AM, Tomas Vondra
>     <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>>
>     wrote:
>     > I plan to do more power failure testing soon, with more complex test
>     > scenarios. I suspect there might be other similar issues (e.g. when we
>     > rename a file before a checkpoint and don't fsync the directory - then the
>     > rename won't be replayed and will be lost).
>
>     I'm curious how you're doing this testing. The easiest way I can think
>     of would be to run a database on an LVM volume and take a large number
>     of LVM snapshots very rapidly and then see if the database can start
>     up from each snapshot. Bonus points for keeping track of the committed
>     transactions before each snaphsot and ensuring they're still there I
>     guess.
>
>
> I've had a few tries at implementing a qemu-based crashtester where it
> hard kills the qemu instance at a random point then starts it back up.

I've tried to reproduce the issue by killing a qemu VM, and so far I've 
been unsuccessful. On bare HW it was easily reproducible (I'd hit the 
issue 9 out of 10 attempts), so either I'm doing something wrong or qemu 
somehow interacts with the I/O.

> I always got stuck on the validation part - actually ensuring that the
> DB state is how we expect. I think I could probably get that right now,
> it's been a while.

Weel, I guess we can't really check all the details, but I guess the 
checksums make checking the general consistency somewhat simpler. And 
then you have to design the workload in a way that makes the check 
easier - for example remembering the committed values etc.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Craig Ringer
Date:
Subject: Re: How to add and use a static library within Postgres backend
Next
From: Tomas Vondra
Date:
Subject: Re: silent data loss with ext4 / all current versions