Re: silent data loss with ext4 / all current versions - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: silent data loss with ext4 / all current versions
Date
Msg-id 56A2E7F6.9090902@2ndquadrant.com
Whole thread Raw
In response to Re: silent data loss with ext4 / all current versions  (Michael Paquier <michael.paquier@gmail.com>)
Responses Re: silent data loss with ext4 / all current versions  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-hackers
On 01/23/2016 02:35 AM, Michael Paquier wrote:
> On Fri, Jan 22, 2016 at 9:41 PM, Greg Stark <stark@mit.edu> wrote:
>> On Fri, Jan 22, 2016 at 8:26 AM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>> On 01/22/2016 06:45 AM, Michael Paquier wrote:
>>>
>>>> So, I have been playing with a Linux VM with VMware Fusion and on
>>>> ext4 with data=ordered the renames are getting lost if the root
>>>> folder is not fsync. By killing-9 the VM I am able to reproduce that
>>>> really easily.
>>>
>>>
>>> Yep. Same experience here (with qemu-kvm VMs).
>>
>> I still think a better approach for this is to run the database on an
>> LVM volume and take lots of snapshots. No VM needed, though it doesn't
>> hurt. LVM volumes are below the level of the filesystem and a snapshot
>> captures the state of the raw blocks the filesystem has written to the
>> block layer. The block layer does no caching though the drive may but
>> neither the VM solution nor LVM would capture that.
>>
>> LVM snapshots would have the advantage that you can keep running the
>> database and you can take lots of snapshots with relatively little
>> overhead. Having dozens or hundreds of snapshots would be unacceptable
>> performance drain in production but for testing it should be practical
>> and they take relatively little space -- just the blocks changed since
>> the snapshot was taken.
>
> Another idea: hardcode a PANIC just after rename() with
> restart_after_crash = off (this needs is IsBootstrapProcess() checks).
> Once server crashes, kill-9 the VM. Then restart the VM and the
> Postgres instance with a new binary that does not have the PANIC, and
> see how things are moving on. There is a window of up to several
> seconds after the rename() call, so I guess that this would work.

I don't see how that would improve anything, as the PANIC has no impact 
on the I/O requests already issued to the system. What you need is some 
sort of coordination between the database and the script that kills the 
VM (or takes a LVM snapshot).

That can be done by simply emitting a particular log message, and the 
"kill script" may simply watch the file (for example over SSH). This has 
the benefit that you can also watch for additional conditions that are 
difficult to check from that particular part of the code (and only kill 
the VM when all of them trigger - for example only on the third 
checkpoint since start, and such).

The reason why I was not particularly thrilled about the LVM snapshot 
idea is that to identify this particular data loss issue, you need to be 
able to reason about the expected state of the database (what 
transactions are committed, how many segments are there). And my 
understanding was that Greg's idea was merely "try to start the DB on a 
snapshot and see if starts / is not corrupted," which would not work 
with this particular issue, as the database seemed just fine - the data 
loss is silent. Adding the "last XLOG segment" into pg_controldata would 
make it easier to detect without having to track details about which 
transactions got committed.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Haribabu Kommi
Date:
Subject: Re: Parallel Aggregate
Next
From: Steve Singer
Date:
Subject: Re: pglogical - logical replication contrib module