Home > mailing lists

Re: corrupt pages detected by enabling checksums - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: corrupt pages detected by enabling checksums
Date	May 13, 2013 13:49:31
Msg-id	20130513134922.GB27618@awork2.anarazel.de Whole thread Raw
In response to	Re: corrupt pages detected by enabling checksums (Jon Nelson <jnelson+pgsql@jamponi.net>)
Responses	Re: corrupt pages detected by enabling checksums
List	pgsql-hackers

Tree view

On 2013-05-13 08:45:41 -0500, Jon Nelson wrote:
> On Mon, May 13, 2013 at 8:32 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-05-12 19:41:26 -0500, Jon Nelson wrote:
> >> On Sun, May 12, 2013 at 3:46 PM, Jim Nasby <jim@nasby.net> wrote:
> >> > On 5/10/13 1:06 PM, Jeff Janes wrote:
> >> >>
> >> >> Of course the paranoid DBA could turn off restart_after_crash and do a
> >> >> manual investigation on every crash, but in that case the database would
> >> >> refuse to restart even in the case where it perfectly clear that all the
> >> >> following WAL belongs to the recycled file and not the current file.
> >> >
> >> >
> >> > Perhaps we should also allow for zeroing out WAL files before reuse (or just
> >> > disable reuse). I know there's a performance hit there, but the reuse idea
> >> > happened before we had bgWriter. Theoretically the overhead creating a new
> >> > file would always fall to bgWriter and therefore not be a big deal.
> >>
> >> For filesystems like btrfs, re-using a WAL file is suboptimal to
> >> simply creating a new one and removing the old one when it's no longer
> >> required. Using fallocate (or posix_fallocate) (I have a patch for
> >> that!) to create a new one is - by my tests - 28 times faster than the
> >> currently-used method.
> >
> > I don't think the comparison between just fallocate()ing and what we
> > currently do is fair. fallocate() doesn't guarantee that the file is the
> > same size after a crash, so you would still need an fsync() or we
> > couldn't use fdatasync() anymore. And I'd guess the benefits aren't all
> > that big anymore in that case?
> 
> fallocate (16MB) + fsync is still almost certainly faster than
> write+write+write... + fsync.
> The test I performed at the time did exactly that .. posix_fallocate + pg_fsync.
Sure, the initial file creation will be faster. But are the actual
individual wal writes (small, frequently fdatasync()ed) still faster?
That's the critical path currently.
Whether it is pretty much depends on how the filesystem manages
allocated but not initialized blocks...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

pgsql-hackers by date:

From: Mark Salter
Date: 13 May 2013, 13:48:44
Subject: Re: lock support for aarch64

From: Bruce Momjian
Date: 13 May 2013, 13:53:03
Subject: Re: Re: [GENERAL] pg_upgrade fails, "mismatch of relation OID" - 9.1.9 to 9.2.4

Re: corrupt pages detected by enabling checksums - Mailing list pgsql-hackers

Previous

Next