Home > mailing lists

Re: Checkpoint cost, looks like it is WAL/CRC - Mailing list pgsql-hackers

From	Kevin Brown
Subject	Re: Checkpoint cost, looks like it is WAL/CRC
Date	July 16, 2005 03:38:16
Msg-id	20050716063801.GA25389@filer Whole thread
In response to	Re: Checkpoint cost, looks like it is WAL/CRC (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Checkpoint cost, looks like it is WAL/CRC
List	pgsql-hackers

Tree view

Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > I don't think we should care too much about indexes. We can rebuild
> > them...but losing heap sectors means *data loss*.
> 
> If you're so concerned about *data loss* then none of this will be
> acceptable to you at all.  We are talking about going from a system
> that can actually survive torn-page cases to one that can only tell
> you whether you've lost data to such a case.  Arguing about the
> probability with which we can detect the loss seems beside the
> point.

I realize I'm coming into this discussion a bit late, and perhaps my
thinking on this is simplistically naive.  That said, I think I have
an idea of how to solve the torn page problem.

If the hardware lies to you about the data being written to the disk,
then no amount of work on our part can guarantee data integrity.  So
the below assumes that the hardware doesn't ever lie about this.

If you want to prevent a torn page, you have to make the last
synchronized write to the disk as part of the checkpoint process a
write that *cannot* result in a torn page.  So it has to be a write of
a buffer that is no larger than the sector size of the disk.  I'd make
it 256 bytes, to be sure of accomodating pretty much any disk hardware
out there.

In any case, the modified sequence would go something like:

1.  write the WAL entry, and encode in it a unique magic number
2.  sync()
3.  append the unique magic number to the WAL again (or to a separate   file if you like, it doesn't matter as long as
youknow where to   look for it during recovery), using a 256 byte (at most) write   buffer.

4.  sync()

After the first sync(), the OS guarantees that the data you've written
so far is committed to the platters, with the possible exception of a
torn page during the write operation, which will only happen during a
crash during step 2.  But if a crash happens here, then the second
occurrance of the unique magic number will not appear in the WAL (or
separate file, if that's the mechanism chosen), and you'll *know* that
you can't trust that the WAL entry was completely committed to the
platter.

If a crash happens during step 4, then either the appended magic
number won't appear during recovery, in which case the recovery
process can assume that the WAL entry is incomplete, or it will
appear, in which case it's *guaranteed by the hardware* that the WAL
entry is complete, because you'll know for sure that the previous
sync() completed successfully.

The amount of time between steps 2 and 4 should be small enough that
there should be no significant performance penalty involved, relative
to the time it takes for the first sync() to complete.

Thoughts?

-- 
Kevin Brown                          kevin@sysexperts.com

pgsql-hackers by date:

From: Christopher Kings-Lynne
Date: 16 July 2005, 02:52:13
Subject: Re: pg_get_prepared?

From: Mario Weilguni
Date: 16 July 2005, 06:22:24
Subject: Re: pg_get_prepared?

Re: Checkpoint cost, looks like it is WAL/CRC - Mailing list pgsql-hackers

Previous

Next