Re: Spreading full-page writes - Mailing list pgsql-hackers
From | Greg Stark |
---|---|
Subject | Re: Spreading full-page writes |
Date | |
Msg-id | CAM-w4HPnbzEP0QZrc7ELkAWUEyEmYfGrE0164dEqmt7KhP4a9A@mail.gmail.com Whole thread Raw |
In response to | Re: Spreading full-page writes (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Responses |
Re: Spreading full-page writes
Re: Spreading full-page writes |
List | pgsql-hackers |
On Tue, May 27, 2014 at 10:07 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > > On 05/26/2014 02:26 PM, Greg Stark wrote: >> >>> Another idea would be to have separate checkpoints for each buffer >> partition. You would have to start recovery from the oldest checkpoint of >> any of the partitions. > > Yeah. Simon suggested that when we talked about this, but I didn't understand how that works at the time. I think I donow. The key to making it work is distinguishing, when starting recovery from the latest checkpoint, whether a record fora given page can be replayed safely. I used flags on WAL records in my proposal to achieve this, but using buffer partitionsis simpler. Interesting. I just thought of it independently. Incidentally you wouldn't actually want to use the buffer partitions per se since the new server might start up with a different number of partitions. You would want an algorithm for partitioning the block space that xlog replay can reliably reproduce regardless of the size of the buffer lock partition table. It might make sense to set it up so it coincidentally ensures all the buffers being flushed are in the same partition or maybe the reverse would be better. Probably it doesn't actually matter. > For simplicity, let's imagine that we have two Redo-pointers for each checkpoint record: one for even-numbered pages, andanother for odd-numbered pages. When checkpoint begins, we first update the Even-redo pointer to the current WAL insertlocation, and then flush all the even-numbered buffers in the buffer cache. Then we do the same for Odd. Hm, I had convinced myself that the LSN on the pages would mean you skip the replay anyways but I think I was wrong and you would need to keep a bitmap of which partitions were in recovery mode as you replay and keep adding partitions until they're all in recovery mode and then keep going until you've seen the checkpoint record for all of them. I'm assuming you would keep N checkpoint positions in the control file. That also means we can double the checkpoint timeout with only a marginal increase in the worst case recovery time. Since the worst case will be (1 + 1/n)*timeout's worth of wal to replay rather than 2*n. The amount of time for recovery would be much more predictable. > Recovery begins at the Even-redo pointer. Replay works as normal, but until you reach the Odd-pointer, you refrain fromreplaying any changes to Odd-numbered pages. After reaching the odd-pointer, you replay everything as normal. > > Hmm, that seems actually doable... -- greg
pgsql-hackers by date: