Re: Spreading full-page writes - Mailing list pgsql-hackers

From Greg Stark
Subject Re: Spreading full-page writes
Date
Msg-id CAM-w4HPnbzEP0QZrc7ELkAWUEyEmYfGrE0164dEqmt7KhP4a9A@mail.gmail.com
Whole thread Raw
In response to Re: Spreading full-page writes  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: Spreading full-page writes
Re: Spreading full-page writes
List pgsql-hackers
On Tue, May 27, 2014 at 10:07 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>
> On 05/26/2014 02:26 PM, Greg Stark wrote:
>>
>>> Another idea would be to have separate checkpoints for each buffer
>> partition. You would have to start recovery from the oldest checkpoint of
>> any of the partitions.
>
> Yeah. Simon suggested that when we talked about this, but I didn't understand how that works at the time. I think I
donow. The key to making it work is distinguishing, when starting recovery from the latest checkpoint, whether a record
fora given page can be replayed safely. I used flags on WAL records in my proposal to achieve this, but using buffer
partitionsis simpler. 

Interesting. I just thought of it independently.

Incidentally you wouldn't actually want to use the buffer partitions
per se since the new server might start up with a different number of
partitions. You would want an algorithm for partitioning the block
space that xlog replay can reliably reproduce regardless of the size
of the buffer lock partition table. It might make sense to set it up
so it coincidentally ensures all the buffers being flushed are in the
same partition or maybe the reverse would be better. Probably it
doesn't actually matter.

> For simplicity, let's imagine that we have two Redo-pointers for each checkpoint record: one for even-numbered pages,
andanother for odd-numbered pages. When checkpoint begins, we first update the Even-redo pointer to the current WAL
insertlocation, and then flush all the even-numbered buffers in the buffer cache. Then we do the same for Odd. 

Hm, I had convinced myself that the LSN on the pages would mean you
skip the replay anyways but I think I was wrong and you would need to
keep a bitmap of which partitions were in recovery mode as you replay
and keep adding partitions until they're all in recovery mode and then
keep going until you've seen the checkpoint record for all of them.

I'm assuming you would keep N checkpoint positions in the control
file. That also means we can double the checkpoint timeout with only a
marginal increase in the worst case recovery time. Since the worst
case will be (1 + 1/n)*timeout's worth of wal to replay rather than
2*n. The amount of time for recovery would be much more predictable.

> Recovery begins at the Even-redo pointer. Replay works as normal, but until you reach the Odd-pointer, you refrain
fromreplaying any changes to Odd-numbered pages. After reaching the odd-pointer, you replay everything as normal. 
>
> Hmm, that seems actually doable...



--
greg



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Spreading full-page writes
Next
From: Ronan Dunklau
Date:
Subject: Re: IMPORT FOREIGN SCHEMA statement