Re: Checkpoint cost, looks like it is WAL/CRC - Mailing list pgsql-hackers
From | Kevin Brown |
---|---|
Subject | Re: Checkpoint cost, looks like it is WAL/CRC |
Date | |
Msg-id | 20050716063801.GA25389@filer Whole thread Raw |
In response to | Re: Checkpoint cost, looks like it is WAL/CRC (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Checkpoint cost, looks like it is WAL/CRC
|
List | pgsql-hackers |
Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > I don't think we should care too much about indexes. We can rebuild > > them...but losing heap sectors means *data loss*. > > If you're so concerned about *data loss* then none of this will be > acceptable to you at all. We are talking about going from a system > that can actually survive torn-page cases to one that can only tell > you whether you've lost data to such a case. Arguing about the > probability with which we can detect the loss seems beside the > point. I realize I'm coming into this discussion a bit late, and perhaps my thinking on this is simplistically naive. That said, I think I have an idea of how to solve the torn page problem. If the hardware lies to you about the data being written to the disk, then no amount of work on our part can guarantee data integrity. So the below assumes that the hardware doesn't ever lie about this. If you want to prevent a torn page, you have to make the last synchronized write to the disk as part of the checkpoint process a write that *cannot* result in a torn page. So it has to be a write of a buffer that is no larger than the sector size of the disk. I'd make it 256 bytes, to be sure of accomodating pretty much any disk hardware out there. In any case, the modified sequence would go something like: 1. write the WAL entry, and encode in it a unique magic number 2. sync() 3. append the unique magic number to the WAL again (or to a separate file if you like, it doesn't matter as long as youknow where to look for it during recovery), using a 256 byte (at most) write buffer. 4. sync() After the first sync(), the OS guarantees that the data you've written so far is committed to the platters, with the possible exception of a torn page during the write operation, which will only happen during a crash during step 2. But if a crash happens here, then the second occurrance of the unique magic number will not appear in the WAL (or separate file, if that's the mechanism chosen), and you'll *know* that you can't trust that the WAL entry was completely committed to the platter. If a crash happens during step 4, then either the appended magic number won't appear during recovery, in which case the recovery process can assume that the WAL entry is incomplete, or it will appear, in which case it's *guaranteed by the hardware* that the WAL entry is complete, because you'll know for sure that the previous sync() completed successfully. The amount of time between steps 2 and 4 should be small enough that there should be no significant performance penalty involved, relative to the time it takes for the first sync() to complete. Thoughts? -- Kevin Brown kevin@sysexperts.com
pgsql-hackers by date: