Thread: Spreading full-page writes

Spreading full-page writes

From
Heikki Linnakangas
Date:
Here's an idea I tried to explain to Andres and Simon at the pub last 
night, on how to reduce the spikes in the amount of WAL written at 
beginning of a checkpoint that full-page writes cause. I'm just writing 
this down for the sake of the archives; I'm not planning to work on this 
myself.


When you are replaying a WAL record that lies between the Redo-pointer 
of a checkpoint and the checkpoint record itself, there are two 
possibilities:

a) You started WAL replay at that checkpoint's Redo-pointer.

b) You started WAL replay at some earlier checkpoint, and are already in 
a consistent state.

In case b), you wouldn't need to replay any full-page images, normal 
differential WAL records would be enough. In case a), you do, and you 
won't be consistent until replaying all the WAL up to the checkpoint record.

We can exploit those properties to spread out the spike. When you modify 
a page and you're about to write a WAL record, check if the page has the 
BM_CHECKPOINT_NEEDED flag set. If it does, compare the LSN of the page 
against the *previous* checkpoints redo-pointer, instead of the one's 
that's currently in-progress. If no full-page image is required based on 
that comparison, IOW if the page was modified and a full-page image was 
already written after the earlier checkpoint, write a normal WAL record 
without full-page image and set a new flag in the buffer header 
(BM_NEEDS_FPW). Also set a new flag on the WAL record, XLR_FPW_SKIPPED.

When checkpointer (or any other backend that needs to evict a buffer) is 
about to flush a page from the buffer cache that has the BM_NEEDS_FPW 
flag set, write a new WAL record, containing a full-page-image of the 
page, before flushing the page.

Here's how this works out during replay:

a) You start WAL replay from the latest checkpoint's Redo-pointer.

When you see a WAL record that's been marked with XLR_FPW_SKIPPED, don't 
replay that record at all. It's OK because we know that there will be a 
separate record containing the full-page image of the page later in the 
stream.

b) You are continuing WAL replay that started from an earlier 
checkpoint, and have already reached consistency.

When you see a WAL record that's been marked with XLR_FPW_SKIPPED, 
replay it normally. It's OK, because the flag means that the page was 
modified after the earlier checkpoint already, and hence we must have 
seen a full-page image of it already. When you see one of the WAL 
records containing a separate full-page-image, ignore it.

This scheme make the b-case behave just as if the new checkpoint was 
never started. The regular WAL records in the stream are identical to 
what they would've been if the redo-pointer pointed to the earlier 
checkpoint. And the additional FPW records are simply ignored.

In the a-case, it's not be safe to replay the records marked with 
XLR_FPW_SKIPPED, because they don't contain FPWs, and you have all the 
usual torn-page hazards that comes with that. However, the separate FPW 
records that come later in the stream will fix-up those pages.


Now, I'm sure there are issues with this scheme I haven't thought about, 
but I wanted to get this written down. Note this does not reduce the 
overall WAL volume - on the contrary - but it ought to reduce the spike.

- Heikki



Re: Spreading full-page writes

From
Fujii Masao
Date:
On Mon, May 26, 2014 at 6:52 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Here's an idea I tried to explain to Andres and Simon at the pub last night,
> on how to reduce the spikes in the amount of WAL written at beginning of a
> checkpoint that full-page writes cause. I'm just writing this down for the
> sake of the archives; I'm not planning to work on this myself.
>
>
> When you are replaying a WAL record that lies between the Redo-pointer of a
> checkpoint and the checkpoint record itself, there are two possibilities:
>
> a) You started WAL replay at that checkpoint's Redo-pointer.
>
> b) You started WAL replay at some earlier checkpoint, and are already in a
> consistent state.
>
> In case b), you wouldn't need to replay any full-page images, normal
> differential WAL records would be enough. In case a), you do, and you won't
> be consistent until replaying all the WAL up to the checkpoint record.
>
> We can exploit those properties to spread out the spike. When you modify a
> page and you're about to write a WAL record, check if the page has the
> BM_CHECKPOINT_NEEDED flag set. If it does, compare the LSN of the page
> against the *previous* checkpoints redo-pointer, instead of the one's that's
> currently in-progress. If no full-page image is required based on that
> comparison, IOW if the page was modified and a full-page image was already
> written after the earlier checkpoint, write a normal WAL record without
> full-page image and set a new flag in the buffer header (BM_NEEDS_FPW). Also
> set a new flag on the WAL record, XLR_FPW_SKIPPED.
>
> When checkpointer (or any other backend that needs to evict a buffer) is
> about to flush a page from the buffer cache that has the BM_NEEDS_FPW flag
> set, write a new WAL record, containing a full-page-image of the page,
> before flushing the page.

How does this mechanism work during base backup? pg_stop_backup needs
to flush all buffers with BM_NEEDS_FPW flag?

>
> Here's how this works out during replay:
>
> a) You start WAL replay from the latest checkpoint's Redo-pointer.
>
> When you see a WAL record that's been marked with XLR_FPW_SKIPPED, don't
> replay that record at all. It's OK because we know that there will be a
> separate record containing the full-page image of the page later in the
> stream.
>
> b) You are continuing WAL replay that started from an earlier checkpoint,
> and have already reached consistency.
>
> When you see a WAL record that's been marked with XLR_FPW_SKIPPED, replay it
> normally. It's OK, because the flag means that the page was modified after
> the earlier checkpoint already, and hence we must have seen a full-page
> image of it already. When you see one of the WAL records containing a
> separate full-page-image, ignore it.
>
> This scheme make the b-case behave just as if the new checkpoint was never
> started. The regular WAL records in the stream are identical to what they
> would've been if the redo-pointer pointed to the earlier checkpoint. And the
> additional FPW records are simply ignored.
>
> In the a-case, it's not be safe to replay the records marked with
> XLR_FPW_SKIPPED, because they don't contain FPWs, and you have all the usual
> torn-page hazards that comes with that. However, the separate FPW records
> that come later in the stream will fix-up those pages.
>
>
> Now, I'm sure there are issues with this scheme I haven't thought about, but
> I wanted to get this written down. Note this does not reduce the overall WAL
> volume - on the contrary - but it ought to reduce the spike.

ISTM that this can increase WAL volume because one data change can
generate both normal WAL and FPW. No?

Regards,

-- 
Fujii Masao



Re: Spreading full-page writes

From
Robert Haas
Date:
On May 25, 2014, at 5:52 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> Here's how this works out during replay:
>
> a) You start WAL replay from the latest checkpoint's Redo-pointer.
>
> When you see a WAL record that's been marked with XLR_FPW_SKIPPED, don't replay that record at all. It's OK because
weknow that there will be a separate record containing the full-page image of the page later in the stream. 

I don't think we know that. The server might have crashed before that second record got generated.  (This appears to be
anunfixable flaw in this proposal.) 

...Robert


Re: Spreading full-page writes

From
Heikki Linnakangas
Date:
On 26 May 2014 20:16:33 EEST, Robert Haas <robertmhaas@gmail.com> wrote:
>On May 25, 2014, at 5:52 PM, Heikki Linnakangas
><hlinnakangas@vmware.com> wrote:
>> Here's how this works out during replay:
>> 
>> a) You start WAL replay from the latest checkpoint's Redo-pointer.
>> 
>> When you see a WAL record that's been marked with XLR_FPW_SKIPPED,
>don't replay that record at all. It's OK because we know that there
>will be a separate record containing the full-page image of the page
>later in the stream.
>
>I don't think we know that. The server might have crashed before that
>second record got generated.  (This appears to be an unfixable flaw in
>this proposal.)

The second record is generated before the checkpoint is finished and the checkpoint record is written.  So it will be
there.

(if you crash before the checkpoint is finished, the in-progress checkpoint is no good for recovery anyway, and won't
beused)
 

- Heikki



Re: Spreading full-page writes

From
Greg Stark
Date:

On Mon, May 26, 2014 at 1:22 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
The second record is generated before the checkpoint is finished and the checkpoint record is written.  So it will be there.

(if you crash before the checkpoint is finished, the in-progress checkpoint is no good for recovery anyway, and won't be used)

Another idea would be to have separate checkpoints for each buffer partition. You would have to start recovery from the oldest checkpoint of any of the partitions.

--
greg

Re: Spreading full-page writes

From
Robert Haas
Date:
On Mon, May 26, 2014 at 1:22 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>>I don't think we know that. The server might have crashed before that
>>second record got generated.  (This appears to be an unfixable flaw in
>>this proposal.)
>
> The second record is generated before the checkpoint is finished and the checkpoint record is written.  So it will be
there.
>
> (if you crash before the checkpoint is finished, the in-progress checkpoint is no good for recovery anyway, and won't
beused)
 

Hmm, I see.

It's not great to have to generate WAL at buffer-eviction time,
though.  Normally, when we go to evict a buffer, the WAL is already
written.  We might have to wait for it to be flushed, but if the WAL
writer is doing its job, hopefully not.  But here we'll definitely
have to wait for the WAL flush.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Spreading full-page writes

From
Simon Riggs
Date:
On 25 May 2014 17:52, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

> Here's an idea I tried to explain to Andres and Simon at the pub last night,
> on how to reduce the spikes in the amount of WAL written at beginning of a
> checkpoint that full-page writes cause. I'm just writing this down for the
> sake of the archives; I'm not planning to work on this myself.
...

Thanks for that idea, and dinner. It looks useful.

I'll call this idea "Background FPWs"

> Now, I'm sure there are issues with this scheme I haven't thought about, but
> I wanted to get this written down. Note this does not reduce the overall WAL
> volume - on the contrary - but it ought to reduce the spike.

The requirements we were discussing were around

A) reducing WAL volume
B) reducing foreground overhead of writing FPWs - which spikes badly
after checkpoint and the overhead is paid by the user processes
themselves
C) need for FPWs during base backup

So that gives us a few approaches

* Compressing FPWs gives A
* Background FPWs gives us B  which look like we can combine both ideas

* Double-buffering would give us A and B, but not C  and would be incompatible with other two ideas

Will think some more.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Spreading full-page writes

From
Fujii Masao
Date:
On Tue, May 27, 2014 at 3:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 25 May 2014 17:52, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>
>> Here's an idea I tried to explain to Andres and Simon at the pub last night,
>> on how to reduce the spikes in the amount of WAL written at beginning of a
>> checkpoint that full-page writes cause. I'm just writing this down for the
>> sake of the archives; I'm not planning to work on this myself.
> ...
>
> Thanks for that idea, and dinner. It looks useful.
>
> I'll call this idea "Background FPWs"
>
>> Now, I'm sure there are issues with this scheme I haven't thought about, but
>> I wanted to get this written down. Note this does not reduce the overall WAL
>> volume - on the contrary - but it ought to reduce the spike.
>
> The requirements we were discussing were around
>
> A) reducing WAL volume
> B) reducing foreground overhead of writing FPWs - which spikes badly
> after checkpoint and the overhead is paid by the user processes
> themselves
> C) need for FPWs during base backup
>
> So that gives us a few approaches
>
> * Compressing FPWs gives A
> * Background FPWs gives us B
>    which look like we can combine both ideas
>
> * Double-buffering would give us A and B, but not C
>    and would be incompatible with other two ideas

Double-buffering would allow us to disable FPW safely but which would make
a recovery slow. So if we adopt double-buffering, I think that we would also
need to overhaul the recovery.

Regards,

-- 
Fujii Masao



Re: Spreading full-page writes

From
Heikki Linnakangas
Date:
On 05/26/2014 11:15 PM, Robert Haas wrote:
> On Mon, May 26, 2014 at 1:22 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>>> I don't think we know that. The server might have crashed before that
>>> second record got generated.  (This appears to be an unfixable flaw in
>>> this proposal.)
>>
>> The second record is generated before the checkpoint is finished and the checkpoint record is written.  So it will
bethere.
 
>>
>> (if you crash before the checkpoint is finished, the in-progress checkpoint is no good for recovery anyway, and
won'tbe used)
 
>
> Hmm, I see.
>
> It's not great to have to generate WAL at buffer-eviction time,
> though.  Normally, when we go to evict a buffer, the WAL is already
> written.  We might have to wait for it to be flushed, but if the WAL
> writer is doing its job, hopefully not.  But here we'll definitely
> have to wait for the WAL flush.

Yeah. You would want to batch the flushes somehow, instead of flushing 
the WAL for every buffer being flushed. For example, after writing the 
FPW WAL record, just continue with the checkpoint without flushing the 
buffer, and do a second pass later doing buffer flushes.

- Heikki



Re: Spreading full-page writes

From
Heikki Linnakangas
Date:
On 05/26/2014 02:26 PM, Greg Stark wrote:
> On Mon, May 26, 2014 at 1:22 PM, Heikki Linnakangas <hlinnakangas@vmware.com
>> wrote:
>
>> The second record is generated before the checkpoint is finished and the
>> checkpoint record is written.  So it will be there.
>>
>> (if you crash before the checkpoint is finished, the in-progress
>> checkpoint is no good for recovery anyway, and won't be used)
>
> Another idea would be to have separate checkpoints for each buffer
> partition. You would have to start recovery from the oldest checkpoint of
> any of the partitions.

Yeah. Simon suggested that when we talked about this, but I didn't 
understand how that works at the time. I think I do now. The key to 
making it work is distinguishing, when starting recovery from the latest 
checkpoint, whether a record for a given page can be replayed safely. I 
used flags on WAL records in my proposal to achieve this, but using 
buffer partitions is simpler.

For simplicity, let's imagine that we have two Redo-pointers for each 
checkpoint record: one for even-numbered pages, and another for 
odd-numbered pages. When checkpoint begins, we first update the 
Even-redo pointer to the current WAL insert location, and then flush all 
the even-numbered buffers in the buffer cache. Then we do the same for Odd.

Recovery begins at the Even-redo pointer. Replay works as normal, but 
until you reach the Odd-pointer, you refrain from replaying any changes 
to Odd-numbered pages. After reaching the odd-pointer, you replay 
everything as normal.

Hmm, that seems actually doable...

- Heikki



Re: Spreading full-page writes

From
Greg Stark
Date:
On Tue, May 27, 2014 at 10:07 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>
> On 05/26/2014 02:26 PM, Greg Stark wrote:
>>
>>> Another idea would be to have separate checkpoints for each buffer
>> partition. You would have to start recovery from the oldest checkpoint of
>> any of the partitions.
>
> Yeah. Simon suggested that when we talked about this, but I didn't understand how that works at the time. I think I
donow. The key to making it work is distinguishing, when starting recovery from the latest checkpoint, whether a record
fora given page can be replayed safely. I used flags on WAL records in my proposal to achieve this, but using buffer
partitionsis simpler. 

Interesting. I just thought of it independently.

Incidentally you wouldn't actually want to use the buffer partitions
per se since the new server might start up with a different number of
partitions. You would want an algorithm for partitioning the block
space that xlog replay can reliably reproduce regardless of the size
of the buffer lock partition table. It might make sense to set it up
so it coincidentally ensures all the buffers being flushed are in the
same partition or maybe the reverse would be better. Probably it
doesn't actually matter.

> For simplicity, let's imagine that we have two Redo-pointers for each checkpoint record: one for even-numbered pages,
andanother for odd-numbered pages. When checkpoint begins, we first update the Even-redo pointer to the current WAL
insertlocation, and then flush all the even-numbered buffers in the buffer cache. Then we do the same for Odd. 

Hm, I had convinced myself that the LSN on the pages would mean you
skip the replay anyways but I think I was wrong and you would need to
keep a bitmap of which partitions were in recovery mode as you replay
and keep adding partitions until they're all in recovery mode and then
keep going until you've seen the checkpoint record for all of them.

I'm assuming you would keep N checkpoint positions in the control
file. That also means we can double the checkpoint timeout with only a
marginal increase in the worst case recovery time. Since the worst
case will be (1 + 1/n)*timeout's worth of wal to replay rather than
2*n. The amount of time for recovery would be much more predictable.

> Recovery begins at the Even-redo pointer. Replay works as normal, but until you reach the Odd-pointer, you refrain
fromreplaying any changes to Odd-numbered pages. After reaching the odd-pointer, you replay everything as normal. 
>
> Hmm, that seems actually doable...



--
greg



Re: Spreading full-page writes

From
Heikki Linnakangas
Date:
On 05/27/2014 02:42 PM, Greg Stark wrote:
> On Tue, May 27, 2014 at 10:07 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>>
>> On 05/26/2014 02:26 PM, Greg Stark wrote:
>>>
>>>> Another idea would be to have separate checkpoints for each buffer
>>> partition. You would have to start recovery from the oldest checkpoint of
>>> any of the partitions.
>>
>> Yeah. Simon suggested that when we talked about this, but I didn't understand how that works at the time. I think I
donow. The key to making it work is distinguishing, when starting recovery from the latest checkpoint, whether a record
fora given page can be replayed safely. I used flags on WAL records in my proposal to achieve this, but using buffer
partitionsis simpler.
 
>
> Interesting. I just thought of it independently.
>
> Incidentally you wouldn't actually want to use the buffer partitions
> per se since the new server might start up with a different number of
> partitions. You would want an algorithm for partitioning the block
> space that xlog replay can reliably reproduce regardless of the size
> of the buffer lock partition table. It might make sense to set it up
> so it coincidentally ensures all the buffers being flushed are in the
> same partition or maybe the reverse would be better. Probably it
> doesn't actually matter.

Since you will be flushing the buffers one "redo partition" at a time, 
you would want to allow the OS to do merge the writes within a partition 
as much as possible. So my even-odd split would in fact be pretty bad. 
Some sort of striping, e.g. mapping each contiguous 1 MB chunk to the 
same partition, would be better.

> I'm assuming you would keep N checkpoint positions in the control
> file. That also means we can double the checkpoint timeout with only a
> marginal increase in the worst case recovery time. Since the worst
> case will be (1 + 1/n)*timeout's worth of wal to replay rather than
> 2*n. The amount of time for recovery would be much more predictable.

Good point.

- Heikki



Re: Spreading full-page writes

From
Simon Riggs
Date:
On 27 May 2014 03:49, Fujii Masao <masao.fujii@gmail.com> wrote:

>> So that gives us a few approaches
>>
>> * Compressing FPWs gives A
>> * Background FPWs gives us B
>>    which look like we can combine both ideas
>>
>> * Double-buffering would give us A and B, but not C
>>    and would be incompatible with other two ideas
>
> Double-buffering would allow us to disable FPW safely but which would make
> a recovery slow. So if we adopt double-buffering, I think that we would also
> need to overhaul the recovery.

Which is also true of Background FPWs

So our options are

1. Compressed FPWs only

2. Compressed FPWs plus BackgroundFPWs plus Recovery Buffer Prefetch

3. Double Buffering plus Recovery Buffer Prefetch

IIRC Koichi had a patch for prefetch during recovery. Heikki, is that
the reason you also discussed changing the WAL record format to allow
us to identify the blocks touched by recovery more easily?

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Spreading full-page writes

From
Heikki Linnakangas
Date:
On 05/27/2014 03:18 PM, Simon Riggs wrote:
> IIRC Koichi had a patch for prefetch during recovery. Heikki, is that
> the reason you also discussed changing the WAL record format to allow
> us to identify the blocks touched by recovery more easily?

Yeah, that was one use case I had in mind for the WAL format changes. 
See http://www.postgresql.org/message-id/533D6CBF.6080203@vmware.com.

- Heikki



Re: Spreading full-page writes

From
Simon Riggs
Date:
On 27 May 2014 07:42, Greg Stark <stark@mit.edu> wrote:
> On Tue, May 27, 2014 at 10:07 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>>
>> On 05/26/2014 02:26 PM, Greg Stark wrote:
>>>
>>>> Another idea would be to have separate checkpoints for each buffer
>>> partition. You would have to start recovery from the oldest checkpoint of
>>> any of the partitions.
>>
>> Yeah. Simon suggested that when we talked about this, but I didn't understand how that works at the time. I think I
donow. The key to making it work is distinguishing, when starting recovery from the latest checkpoint, whether a record
fora given page can be replayed safely. I used flags on WAL records in my proposal to achieve this, but using buffer
partitionsis simpler. 
>
> Interesting. I just thought of it independently.

Actually, I heard it from Doug Tolbert in 2005, based on how another
DBMS coped with that issue.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Spreading full-page writes

From
Jeff Janes
Date:
On Mon, May 26, 2014 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, May 26, 2014 at 1:22 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>>I don't think we know that. The server might have crashed before that
>>second record got generated.  (This appears to be an unfixable flaw in
>>this proposal.)
>
> The second record is generated before the checkpoint is finished and the checkpoint record is written.  So it will be there.
>
> (if you crash before the checkpoint is finished, the in-progress checkpoint is no good for recovery anyway, and won't be used)

Hmm, I see.

It's not great to have to generate WAL at buffer-eviction time,
though.  Normally, when we go to evict a buffer, the WAL is already
written.  We might have to wait for it to be flushed, but if the WAL
writer is doing its job, hopefully not.  But here we'll definitely
have to wait for the WAL flush.

I'm not sure we do need to flush it.  If the checkpoint finishes, then the WAL surely got flushed as part of the process of recording the end of the checkpoint.  If the checkpoint does not finish, recovery will start from the previous checkpoint, which does contain the FPW (because if it didn't, the page would not be eligible for this treatment) and so the possibly torn page will get overwritten in full.
 
Cheers,

Jeff

Re: Spreading full-page writes

From
Amit Kapila
Date:
On Tue, May 27, 2014 at 1:19 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Tue, May 27, 2014 at 3:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > The requirements we were discussing were around
> >
> > A) reducing WAL volume
> > B) reducing foreground overhead of writing FPWs - which spikes badly
> > after checkpoint and the overhead is paid by the user processes
> > themselves
> > C) need for FPWs during base backup
> >
> > So that gives us a few approaches
> >
> > * Compressing FPWs gives A
> > * Background FPWs gives us B
> >    which look like we can combine both ideas
> >
> > * Double-buffering would give us A and B, but not C
> >    and would be incompatible with other two ideas
>
> Double-buffering would allow us to disable FPW safely but which would make
> a recovery slow.

Is it due to the fact that during recovery, it needs to check the
contents of double buffer as well as the page in original location
for consistency or there is something else also which will lead
to slow recovery?

Won't DBW (double buffer write) reduce the need for number of
pages that needs to be read from disk as compare to FPW which
will suffice the performance degradation due to any other impact?

IIUC in DBW mechanism, we need to have a temporary sequential
log file of fixed size which will be used to write data before the data
gets written to its actual location in tablespace.  Now as the temporary
log file is of fixed size, the number of pages that needs to be read
during recovery should be less as compare to FPW because in FPW
it needs to read all the pages written in WAL log after last successful
checkpoint.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Spreading full-page writes

From
Simon Riggs
Date:
On 27 May 2014 13:20, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> On 05/27/2014 03:18 PM, Simon Riggs wrote:
>>
>> IIRC Koichi had a patch for prefetch during recovery. Heikki, is that
>> the reason you also discussed changing the WAL record format to allow
>> us to identify the blocks touched by recovery more easily?
>
>
> Yeah, that was one use case I had in mind for the WAL format changes. See
> http://www.postgresql.org/message-id/533D6CBF.6080203@vmware.com.

Those proposals suggest some very big changes to the way WAL works.

Prefetch can work easily enough for most records - do we really need
that much churn?

You mentioned Btree vacuum records, but I'm planning to optimize those
another way.

Why don't we just have the prefetch code in core and forget the WAL
format changes?

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Spreading full-page writes

From
Simon Riggs
Date:
On 27 May 2014 18:18, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Mon, May 26, 2014 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Mon, May 26, 2014 at 1:22 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>> >>I don't think we know that. The server might have crashed before that
>> >>second record got generated.  (This appears to be an unfixable flaw in
>> >>this proposal.)
>> >
>> > The second record is generated before the checkpoint is finished and the
>> > checkpoint record is written.  So it will be there.
>> >
>> > (if you crash before the checkpoint is finished, the in-progress
>> > checkpoint is no good for recovery anyway, and won't be used)
>>
>> Hmm, I see.
>>
>> It's not great to have to generate WAL at buffer-eviction time,
>> though.  Normally, when we go to evict a buffer, the WAL is already
>> written.  We might have to wait for it to be flushed, but if the WAL
>> writer is doing its job, hopefully not.  But here we'll definitely
>> have to wait for the WAL flush.
>
>
> I'm not sure we do need to flush it.  If the checkpoint finishes, then the
> WAL surely got flushed as part of the process of recording the end of the
> checkpoint.  If the checkpoint does not finish, recovery will start from the
> previous checkpoint, which does contain the FPW (because if it didn't, the
> page would not be eligible for this treatment) and so the possibly torn page
> will get overwritten in full.

I think Robert is correct, you would need to flush WAL before writing
the disk buffer. That is the current invariant of WAL before data.

However, we don't need to do this in a simple way: FPW-flush-buffer,
we can do that with more buffering.

So it seems like a reasonable idea to do this using a 64 buffer
BulkAccessStrategy object and flush the WAL every 64 buffers. That's
beginning to look more like double buffering though...

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Spreading full-page writes

From
Heikki Linnakangas
Date:
On 05/28/2014 09:41 AM, Simon Riggs wrote:
> On 27 May 2014 13:20, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>> On 05/27/2014 03:18 PM, Simon Riggs wrote:
>>>
>>> IIRC Koichi had a patch for prefetch during recovery. Heikki, is that
>>> the reason you also discussed changing the WAL record format to allow
>>> us to identify the blocks touched by recovery more easily?
>>
>>
>> Yeah, that was one use case I had in mind for the WAL format changes. See
>> http://www.postgresql.org/message-id/533D6CBF.6080203@vmware.com.
>
> Those proposals suggest some very big changes to the way WAL works.
>
> Prefetch can work easily enough for most records - do we really need
> that much churn?
>
> You mentioned Btree vacuum records, but I'm planning to optimize those
> another way.
>
> Why don't we just have the prefetch code in core and forget the WAL
> format changes?

Well, the prefetching was just one example of why the proposed WAL 
format changes are a good idea. The changes will make life easier for 
any external (or internal, for that matter) tool that wants to read WAL 
records. The thing that finally really got me into doing that was 
pg_rewind. For pg_rewind it's not enough to cover most records, you have 
to catch all modifications to data pages for correctness, and that's 
difficult to maintain as new WAL record types are added and old ones are 
modified in every release.

Also, the changes make WAL-logging and -replaying code easier to write. 
Which reduces the potential for bugs.

- Heikki



Re: Spreading full-page writes

From
Robert Haas
Date:
On Tue, May 27, 2014 at 8:15 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Since you will be flushing the buffers one "redo partition" at a time, you
> would want to allow the OS to do merge the writes within a partition as much
> as possible. So my even-odd split would in fact be pretty bad. Some sort of
> striping, e.g. mapping each contiguous 1 MB chunk to the same partition,
> would be better.

I suspect you'd actually want to stripe by segment (1GB partition).
If you striped by 1MB partitions, there might still be writes to the
parts of the file you weren't checkpointing that would be flushed by
the fsync().  That would lead to more physical I/O overall, if those
pages were written again before you did the next half-checkpoint.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Spreading full-page writes

From
Fujii Masao
Date:
On Wed, May 28, 2014 at 1:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, May 27, 2014 at 1:19 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Tue, May 27, 2014 at 3:57 PM, Simon Riggs <simon@2ndquadrant.com>
>> wrote:
>> > The requirements we were discussing were around
>> >
>> > A) reducing WAL volume
>> > B) reducing foreground overhead of writing FPWs - which spikes badly
>> > after checkpoint and the overhead is paid by the user processes
>> > themselves
>> > C) need for FPWs during base backup
>> >
>> > So that gives us a few approaches
>> >
>> > * Compressing FPWs gives A
>> > * Background FPWs gives us B
>> >    which look like we can combine both ideas
>> >
>> > * Double-buffering would give us A and B, but not C
>> >    and would be incompatible with other two ideas
>>
>> Double-buffering would allow us to disable FPW safely but which would make
>> a recovery slow.
>
> Is it due to the fact that during recovery, it needs to check the
> contents of double buffer as well as the page in original location
> for consistency or there is something else also which will lead
> to slow recovery?
>
> Won't DBW (double buffer write) reduce the need for number of
> pages that needs to be read from disk as compare to FPW which
> will suffice the performance degradation due to any other impact?
>
> IIUC in DBW mechanism, we need to have a temporary sequential
> log file of fixed size which will be used to write data before the data
> gets written to its actual location in tablespace.  Now as the temporary
> log file is of fixed size, the number of pages that needs to be read
> during recovery should be less as compare to FPW because in FPW
> it needs to read all the pages written in WAL log after last successful
> checkpoint.

Hmm... maybe I'm misunderstanding how WAL replay works in DBW case.
Imagine the case where we try to replay two WAL records for the page A and
the page has not been cached in shared_buffers yet. If FPW is enabled,
the first WAL record is FPW and firstly it's just read to shared_buffers.
The page doesn't neeed to be read from the disk. Then the second WAL record
will be applied.

OTOH, in DBW case, how does this example case work? I was thinking that
firstly we try to apply the first WAL record but find that the page A doesn't
exist in shared_buffers yet. We try to read the page from the disk, check
whether its CRC is valid or not, and read the same page from double buffer
if it's invalid. After reading the page into shared_buffers, the first WAL
record can be applied. Then the second WAL record will be applied. Is my
understanding right?

Regards,

-- 
Fujii Masao



Re: Spreading full-page writes

From
Amit Kapila
Date:
On Mon, Jun 2, 2014 at 6:04 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, May 28, 2014 at 1:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > IIUC in DBW mechanism, we need to have a temporary sequential
> > log file of fixed size which will be used to write data before the data
> > gets written to its actual location in tablespace.  Now as the temporary
> > log file is of fixed size, the number of pages that needs to be read
> > during recovery should be less as compare to FPW because in FPW
> > it needs to read all the pages written in WAL log after last successful
> > checkpoint.
>
> Hmm... maybe I'm misunderstanding how WAL replay works in DBW case.
> Imagine the case where we try to replay two WAL records for the page A and
> the page has not been cached in shared_buffers yet. If FPW is enabled,
> the first WAL record is FPW and firstly it's just read to shared_buffers.
> The page doesn't neeed to be read from the disk. Then the second WAL record
> will be applied.
>
> OTOH, in DBW case, how does this example case work? I was thinking that
> firstly we try to apply the first WAL record but find that the page A doesn't
> exist in shared_buffers yet. We try to read the page from the disk, check
> whether its CRC is valid or not, and read the same page from double buffer
> if it's invalid. After reading the page into shared_buffers, the first WAL
> record can be applied. Then the second WAL record will be applied. Is my
> understanding right?

I think the way DBW works is that before reading WAL, it first makes
data pages consistent.  It will first check the doublewrite buffer
contents and pages in their original location.  If page is inconsistent
in double write buffer it is simply discarded, if it is inconsistent in
the tablespace it is recovered from double write buffer.  After reaching
the double buffer end, it will start reading WAL.

So in above example case, it will read the first record from WAL
and check if page is already in shared_buffers, then apply WAL
change, else read the page into shared_buffers, then apply WAL.
For second record, it doesn't need to read the page.

The saving during recovery will come from the fact that in case
of DBW, it will not read the FPI from WAL, rather just 2 records
(it has to read a WAL page, but that will contain many records).
So it seems to be a net win.

Now incase of DBW, the extra workdone (reading the double buffer,
checking the consistency of same with actual page) is always fixed
as size of double buffer is fixed, so the impact due to it should
be much less than reading FPI's from WAL after last successful
checkpoint.

If my above understanding is right, then performance of recovery
should be better with DBW in most cases.

I think the cases where DBW might need to take care is when
there are lot of backend evictions.  For such scenario's backend
might itself need to write both to double buffer and actual page.
It can have more impact during bulk reads (when it has to set hint
bit) and Vacuum which gets performed in ring buffer.

One of the improvement that can be done here is to change the buffer
eviction algorithm such that it can give up the buffer which needs
to be written to double buffer.  There can be other improvements as
well depending on DBW implementation.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com