Re: corrupt pages detected by enabling checksums - Mailing list pgsql-hackers

From Andres Freund
Subject Re: corrupt pages detected by enabling checksums
Date
Msg-id 20130513134922.GB27618@awork2.anarazel.de
Whole thread Raw
In response to Re: corrupt pages detected by enabling checksums  (Jon Nelson <jnelson+pgsql@jamponi.net>)
Responses Re: corrupt pages detected by enabling checksums  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
On 2013-05-13 08:45:41 -0500, Jon Nelson wrote:
> On Mon, May 13, 2013 at 8:32 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-05-12 19:41:26 -0500, Jon Nelson wrote:
> >> On Sun, May 12, 2013 at 3:46 PM, Jim Nasby <jim@nasby.net> wrote:
> >> > On 5/10/13 1:06 PM, Jeff Janes wrote:
> >> >>
> >> >> Of course the paranoid DBA could turn off restart_after_crash and do a
> >> >> manual investigation on every crash, but in that case the database would
> >> >> refuse to restart even in the case where it perfectly clear that all the
> >> >> following WAL belongs to the recycled file and not the current file.
> >> >
> >> >
> >> > Perhaps we should also allow for zeroing out WAL files before reuse (or just
> >> > disable reuse). I know there's a performance hit there, but the reuse idea
> >> > happened before we had bgWriter. Theoretically the overhead creating a new
> >> > file would always fall to bgWriter and therefore not be a big deal.
> >>
> >> For filesystems like btrfs, re-using a WAL file is suboptimal to
> >> simply creating a new one and removing the old one when it's no longer
> >> required. Using fallocate (or posix_fallocate) (I have a patch for
> >> that!) to create a new one is - by my tests - 28 times faster than the
> >> currently-used method.
> >
> > I don't think the comparison between just fallocate()ing and what we
> > currently do is fair. fallocate() doesn't guarantee that the file is the
> > same size after a crash, so you would still need an fsync() or we
> > couldn't use fdatasync() anymore. And I'd guess the benefits aren't all
> > that big anymore in that case?
> 
> fallocate (16MB) + fsync is still almost certainly faster than
> write+write+write... + fsync.
> The test I performed at the time did exactly that .. posix_fallocate + pg_fsync.
Sure, the initial file creation will be faster. But are the actual
individual wal writes (small, frequently fdatasync()ed) still faster?
That's the critical path currently.
Whether it is pretty much depends on how the filesystem manages
allocated but not initialized blocks...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Mark Salter
Date:
Subject: Re: lock support for aarch64
Next
From: Bruce Momjian
Date:
Subject: Re: Re: [GENERAL] pg_upgrade fails, "mismatch of relation OID" - 9.1.9 to 9.2.4