Re: Checkpoint cost, looks like it is WAL/CRC - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: Checkpoint cost, looks like it is WAL/CRC
Date
Msg-id 200507161148.j6GBmgA15331@candle.pha.pa.us
Whole thread Raw
In response to Re: Checkpoint cost, looks like it is WAL/CRC  (Kevin Brown <kevin@sysexperts.com>)
Responses Re: Checkpoint cost, looks like it is WAL/CRC
List pgsql-hackers
I don't think our problem is partial writes of WAL, which we already
check, but heap/index page writes, which we currently do not check for
partial writes.

---------------------------------------------------------------------------

Kevin Brown wrote:
> Tom Lane wrote:
> > Simon Riggs <simon@2ndquadrant.com> writes:
> > > I don't think we should care too much about indexes. We can rebuild
> > > them...but losing heap sectors means *data loss*.
> > 
> > If you're so concerned about *data loss* then none of this will be
> > acceptable to you at all.  We are talking about going from a system
> > that can actually survive torn-page cases to one that can only tell
> > you whether you've lost data to such a case.  Arguing about the
> > probability with which we can detect the loss seems beside the
> > point.
> 
> I realize I'm coming into this discussion a bit late, and perhaps my
> thinking on this is simplistically naive.  That said, I think I have
> an idea of how to solve the torn page problem.
> 
> If the hardware lies to you about the data being written to the disk,
> then no amount of work on our part can guarantee data integrity.  So
> the below assumes that the hardware doesn't ever lie about this.
> 
> If you want to prevent a torn page, you have to make the last
> synchronized write to the disk as part of the checkpoint process a
> write that *cannot* result in a torn page.  So it has to be a write of
> a buffer that is no larger than the sector size of the disk.  I'd make
> it 256 bytes, to be sure of accomodating pretty much any disk hardware
> out there.
> 
> In any case, the modified sequence would go something like:
> 
> 1.  write the WAL entry, and encode in it a unique magic number
> 2.  sync()
> 3.  append the unique magic number to the WAL again (or to a separate
>     file if you like, it doesn't matter as long as you know where to
>     look for it during recovery), using a 256 byte (at most) write
>     buffer.
> 4.  sync()
> 
> 
> After the first sync(), the OS guarantees that the data you've written
> so far is committed to the platters, with the possible exception of a
> torn page during the write operation, which will only happen during a
> crash during step 2.  But if a crash happens here, then the second
> occurrance of the unique magic number will not appear in the WAL (or
> separate file, if that's the mechanism chosen), and you'll *know* that
> you can't trust that the WAL entry was completely committed to the
> platter.
> 
> If a crash happens during step 4, then either the appended magic
> number won't appear during recovery, in which case the recovery
> process can assume that the WAL entry is incomplete, or it will
> appear, in which case it's *guaranteed by the hardware* that the WAL
> entry is complete, because you'll know for sure that the previous
> sync() completed successfully.
> 
> 
> The amount of time between steps 2 and 4 should be small enough that
> there should be no significant performance penalty involved, relative
> to the time it takes for the first sync() to complete.
> 
> 
> Thoughts?
> 
> 
> 
> -- 
> Kevin Brown                          kevin@sysexperts.com
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
> 
>                http://archives.postgresql.org
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


pgsql-hackers by date:

Previous
From: Marko Kreen
Date:
Subject: Re: 4 pgcrypto regressions failures - 1 unsolved
Next
From: Bruce Momjian
Date:
Subject: Re: Autovacuum name