Re: corrupt pages detected by enabling checksums - Mailing list pgsql-hackers

From Jon Nelson
Subject Re: corrupt pages detected by enabling checksums
Date
Msg-id CAKuK5J1p=vUeW63+nhUM01T+i_HzdcfsmX0qeFXWe7ZhH=1=qQ@mail.gmail.com
Whole thread Raw
In response to Re: corrupt pages detected by enabling checksums  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: corrupt pages detected by enabling checksums
Re: corrupt pages detected by enabling checksums
List pgsql-hackers
On Mon, May 13, 2013 at 8:32 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-05-12 19:41:26 -0500, Jon Nelson wrote:
>> On Sun, May 12, 2013 at 3:46 PM, Jim Nasby <jim@nasby.net> wrote:
>> > On 5/10/13 1:06 PM, Jeff Janes wrote:
>> >>
>> >> Of course the paranoid DBA could turn off restart_after_crash and do a
>> >> manual investigation on every crash, but in that case the database would
>> >> refuse to restart even in the case where it perfectly clear that all the
>> >> following WAL belongs to the recycled file and not the current file.
>> >
>> >
>> > Perhaps we should also allow for zeroing out WAL files before reuse (or just
>> > disable reuse). I know there's a performance hit there, but the reuse idea
>> > happened before we had bgWriter. Theoretically the overhead creating a new
>> > file would always fall to bgWriter and therefore not be a big deal.
>>
>> For filesystems like btrfs, re-using a WAL file is suboptimal to
>> simply creating a new one and removing the old one when it's no longer
>> required. Using fallocate (or posix_fallocate) (I have a patch for
>> that!) to create a new one is - by my tests - 28 times faster than the
>> currently-used method.
>
> I don't think the comparison between just fallocate()ing and what we
> currently do is fair. fallocate() doesn't guarantee that the file is the
> same size after a crash, so you would still need an fsync() or we
> couldn't use fdatasync() anymore. And I'd guess the benefits aren't all
> that big anymore in that case?

fallocate (16MB) + fsync is still almost certainly faster than
write+write+write... + fsync.
The test I performed at the time did exactly that .. posix_fallocate + pg_fsync.

> That said, using posix_fallocate seems like a good idea in lots of
> places inside pg, its just not all that easy to do in some of the
> places.

I should not derail this thread any further. Perhaps, if interested
parties would like to discuss the use of fallocate/posix_fallocate, a
new thread might be more appropriate?


--
Jon



pgsql-hackers by date:

Previous
From: Steve Singer
Date:
Subject: Re: Re: [GENERAL] pg_upgrade fails, "mismatch of relation OID" - 9.1.9 to 9.2.4
Next
From: Mark Salter
Date:
Subject: Re: lock support for aarch64