Re: Spreading full-page writes - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Spreading full-page writes
Date
Msg-id CAA4eK1+o3rpboe37uPYVBC9bFWAhUE89dL_E058fP3LJ25w4Ww@mail.gmail.com
Whole thread Raw
In response to Re: Spreading full-page writes  (Fujii Masao <masao.fujii@gmail.com>)
List pgsql-hackers
On Mon, Jun 2, 2014 at 6:04 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, May 28, 2014 at 1:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > IIUC in DBW mechanism, we need to have a temporary sequential
> > log file of fixed size which will be used to write data before the data
> > gets written to its actual location in tablespace.  Now as the temporary
> > log file is of fixed size, the number of pages that needs to be read
> > during recovery should be less as compare to FPW because in FPW
> > it needs to read all the pages written in WAL log after last successful
> > checkpoint.
>
> Hmm... maybe I'm misunderstanding how WAL replay works in DBW case.
> Imagine the case where we try to replay two WAL records for the page A and
> the page has not been cached in shared_buffers yet. If FPW is enabled,
> the first WAL record is FPW and firstly it's just read to shared_buffers.
> The page doesn't neeed to be read from the disk. Then the second WAL record
> will be applied.
>
> OTOH, in DBW case, how does this example case work? I was thinking that
> firstly we try to apply the first WAL record but find that the page A doesn't
> exist in shared_buffers yet. We try to read the page from the disk, check
> whether its CRC is valid or not, and read the same page from double buffer
> if it's invalid. After reading the page into shared_buffers, the first WAL
> record can be applied. Then the second WAL record will be applied. Is my
> understanding right?

I think the way DBW works is that before reading WAL, it first makes
data pages consistent.  It will first check the doublewrite buffer
contents and pages in their original location.  If page is inconsistent
in double write buffer it is simply discarded, if it is inconsistent in
the tablespace it is recovered from double write buffer.  After reaching
the double buffer end, it will start reading WAL.

So in above example case, it will read the first record from WAL
and check if page is already in shared_buffers, then apply WAL
change, else read the page into shared_buffers, then apply WAL.
For second record, it doesn't need to read the page.

The saving during recovery will come from the fact that in case
of DBW, it will not read the FPI from WAL, rather just 2 records
(it has to read a WAL page, but that will contain many records).
So it seems to be a net win.

Now incase of DBW, the extra workdone (reading the double buffer,
checking the consistency of same with actual page) is always fixed
as size of double buffer is fixed, so the impact due to it should
be much less than reading FPI's from WAL after last successful
checkpoint.

If my above understanding is right, then performance of recovery
should be better with DBW in most cases.

I think the cases where DBW might need to take care is when
there are lot of backend evictions.  For such scenario's backend
might itself need to write both to double buffer and actual page.
It can have more impact during bulk reads (when it has to set hint
bit) and Vacuum which gets performed in ring buffer.

One of the improvement that can be done here is to change the buffer
eviction algorithm such that it can give up the buffer which needs
to be written to double buffer.  There can be other improvements as
well depending on DBW implementation.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Haribabu Kommi
Date:
Subject: Re: Priority table or Cache table
Next
From: Martijn van Oosterhout
Date:
Subject: Re: Re-create dependent views on ALTER TABLE ALTER COLUMN ... TYPE?