Re: Checkpoint cost, looks like it is WAL/CRC - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Re: Checkpoint cost, looks like it is WAL/CRC |
Date | |
Msg-id | 200507161148.j6GBmgA15331@candle.pha.pa.us Whole thread Raw |
In response to | Re: Checkpoint cost, looks like it is WAL/CRC (Kevin Brown <kevin@sysexperts.com>) |
Responses |
Re: Checkpoint cost, looks like it is WAL/CRC
|
List | pgsql-hackers |
I don't think our problem is partial writes of WAL, which we already check, but heap/index page writes, which we currently do not check for partial writes. --------------------------------------------------------------------------- Kevin Brown wrote: > Tom Lane wrote: > > Simon Riggs <simon@2ndquadrant.com> writes: > > > I don't think we should care too much about indexes. We can rebuild > > > them...but losing heap sectors means *data loss*. > > > > If you're so concerned about *data loss* then none of this will be > > acceptable to you at all. We are talking about going from a system > > that can actually survive torn-page cases to one that can only tell > > you whether you've lost data to such a case. Arguing about the > > probability with which we can detect the loss seems beside the > > point. > > I realize I'm coming into this discussion a bit late, and perhaps my > thinking on this is simplistically naive. That said, I think I have > an idea of how to solve the torn page problem. > > If the hardware lies to you about the data being written to the disk, > then no amount of work on our part can guarantee data integrity. So > the below assumes that the hardware doesn't ever lie about this. > > If you want to prevent a torn page, you have to make the last > synchronized write to the disk as part of the checkpoint process a > write that *cannot* result in a torn page. So it has to be a write of > a buffer that is no larger than the sector size of the disk. I'd make > it 256 bytes, to be sure of accomodating pretty much any disk hardware > out there. > > In any case, the modified sequence would go something like: > > 1. write the WAL entry, and encode in it a unique magic number > 2. sync() > 3. append the unique magic number to the WAL again (or to a separate > file if you like, it doesn't matter as long as you know where to > look for it during recovery), using a 256 byte (at most) write > buffer. > 4. sync() > > > After the first sync(), the OS guarantees that the data you've written > so far is committed to the platters, with the possible exception of a > torn page during the write operation, which will only happen during a > crash during step 2. But if a crash happens here, then the second > occurrance of the unique magic number will not appear in the WAL (or > separate file, if that's the mechanism chosen), and you'll *know* that > you can't trust that the WAL entry was completely committed to the > platter. > > If a crash happens during step 4, then either the appended magic > number won't appear during recovery, in which case the recovery > process can assume that the WAL entry is incomplete, or it will > appear, in which case it's *guaranteed by the hardware* that the WAL > entry is complete, because you'll know for sure that the previous > sync() completed successfully. > > > The amount of time between steps 2 and 4 should be small enough that > there should be no significant performance penalty involved, relative > to the time it takes for the first sync() to complete. > > > Thoughts? > > > > -- > Kevin Brown kevin@sysexperts.com > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
pgsql-hackers by date: