Re: Checkpoint cost, looks like it is WAL/CRC - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Checkpoint cost, looks like it is WAL/CRC
Date
Msg-id 18494.1120415712@sss.pgh.pa.us
Whole thread Raw
In response to Re: Checkpoint cost, looks like it is WAL/CRC  (Greg Stark <gsstark@mit.edu>)
List pgsql-hackers
Greg Stark <gsstark@mit.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> Partial writes.  Without the full-page image, we do not have enough
>> information in WAL to reconstruct the correct page contents.

> Sure, but why not?

> If a 8k page contains 16 low level segments on disk and the old data is
> AAAAAAAAAAAAAAAA and the new data is AAABAAACAAADAAAE then the WAL would
> contain the B, C, D, and E. Shouldn't that be enough to reconstruct the page?

It might contain parts of it ... scattered across a large number of WAL
entries ... but I don't think that's enough to reconstruct the page.
As an example, a btree insert WAL record will say "insert this tuple
at position N, shifting the other entries accordingly"; that does not
give you the ability to reconstruct entries that shifted across sector
boundaries, as they may not be present in the on-disk data of either
sector.  You're also going to have fairly serious problems interpreting
the page contents if what's on disk includes the effects of multiple
WAL records beyond the record you are currently looking at.

We could possibly do it if we added more information to the WAL records,
but that strikes me as a net loss: essentially it would pay the penalty
all the time instead of only on the first write after a checkpoint.

Also, you are assuming that the content of each sector is uniquely ---
and determinably --- either "old data" or "new data", not for example
"unreadable because partially written".
        regards, tom lane


pgsql-hackers by date:

Previous
From: Greg Stark
Date:
Subject: Re: Checkpoint cost, looks like it is WAL/CRC
Next
From: Peter Eisentraut
Date:
Subject: Re: Fix for cross compilation