Spreading full-page writes - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Spreading full-page writes
Date
Msg-id 53826614.7040809@vmware.com
Whole thread Raw
Responses Re: Spreading full-page writes
Re: Spreading full-page writes
Re: Spreading full-page writes
List pgsql-hackers
Here's an idea I tried to explain to Andres and Simon at the pub last 
night, on how to reduce the spikes in the amount of WAL written at 
beginning of a checkpoint that full-page writes cause. I'm just writing 
this down for the sake of the archives; I'm not planning to work on this 
myself.


When you are replaying a WAL record that lies between the Redo-pointer 
of a checkpoint and the checkpoint record itself, there are two 
possibilities:

a) You started WAL replay at that checkpoint's Redo-pointer.

b) You started WAL replay at some earlier checkpoint, and are already in 
a consistent state.

In case b), you wouldn't need to replay any full-page images, normal 
differential WAL records would be enough. In case a), you do, and you 
won't be consistent until replaying all the WAL up to the checkpoint record.

We can exploit those properties to spread out the spike. When you modify 
a page and you're about to write a WAL record, check if the page has the 
BM_CHECKPOINT_NEEDED flag set. If it does, compare the LSN of the page 
against the *previous* checkpoints redo-pointer, instead of the one's 
that's currently in-progress. If no full-page image is required based on 
that comparison, IOW if the page was modified and a full-page image was 
already written after the earlier checkpoint, write a normal WAL record 
without full-page image and set a new flag in the buffer header 
(BM_NEEDS_FPW). Also set a new flag on the WAL record, XLR_FPW_SKIPPED.

When checkpointer (or any other backend that needs to evict a buffer) is 
about to flush a page from the buffer cache that has the BM_NEEDS_FPW 
flag set, write a new WAL record, containing a full-page-image of the 
page, before flushing the page.

Here's how this works out during replay:

a) You start WAL replay from the latest checkpoint's Redo-pointer.

When you see a WAL record that's been marked with XLR_FPW_SKIPPED, don't 
replay that record at all. It's OK because we know that there will be a 
separate record containing the full-page image of the page later in the 
stream.

b) You are continuing WAL replay that started from an earlier 
checkpoint, and have already reached consistency.

When you see a WAL record that's been marked with XLR_FPW_SKIPPED, 
replay it normally. It's OK, because the flag means that the page was 
modified after the earlier checkpoint already, and hence we must have 
seen a full-page image of it already. When you see one of the WAL 
records containing a separate full-page-image, ignore it.

This scheme make the b-case behave just as if the new checkpoint was 
never started. The regular WAL records in the stream are identical to 
what they would've been if the redo-pointer pointed to the earlier 
checkpoint. And the additional FPW records are simply ignored.

In the a-case, it's not be safe to replay the records marked with 
XLR_FPW_SKIPPED, because they don't contain FPWs, and you have all the 
usual torn-page hazards that comes with that. However, the separate FPW 
records that come later in the stream will fix-up those pages.


Now, I'm sure there are issues with this scheme I haven't thought about, 
but I wanted to get this written down. Note this does not reduce the 
overall WAL volume - on the contrary - but it ought to reduce the spike.

- Heikki



pgsql-hackers by date:

Previous
From: Jeff Janes
Date:
Subject: Re: 9.4 btree index corruption
Next
From: Heikki Linnakangas
Date:
Subject: Re: 9.4 btree index corruption