Re: tackling full page writes - Mailing list pgsql-hackers

From Robert Haas
Subject Re: tackling full page writes
Date
Msg-id BANLkTimQeo_r=mMzT-O-RykBh7Hozoq7Lw@mail.gmail.com
Whole thread Raw
In response to Re: tackling full page writes  (Fujii Masao <masao.fujii@gmail.com>)
Responses Re: tackling full page writes  (Fujii Masao <masao.fujii@gmail.com>)
List pgsql-hackers
On Wed, May 25, 2011 at 10:09 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, May 25, 2011 at 9:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, May 24, 2011 at 10:52 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>>> On Tue, 2011-05-24 at 16:34 -0400, Robert Haas wrote:
>>>> As I think about it a bit more, we'd
>>>> need to XLOG not only the parts of the page we actually modifying, but
>>>> any that the WAL record would need to be correct on replay.
>>>
>>> I don't understand that statement. Can you clarify?
>>
>> I'll try.  Suppose we have two WAL records A and B, with no
>> intervening checkpoint, that both modify the same page.  A reads chunk
>> 1 of that page and then modifies chunk 2.  B modifies chunk 1.  Now,
>> suppose we make A do a "partial page write" on chunk 2 only, and B do
>> the same for chunk 1.  At the point the system crashes, A and B are
>> both on disk, and the page has already been written to disk as well.
>> Replay begins from a checkpoint preceding A.
>>
>> Now, when we get to the record for A, what are we to do?  If it were a
>> full page image, we could just restore it, and everything would be
>> fine after that.  But if we replay the partial page write, we've got
>> trouble.  A will now see the state of the chunk 1 as it existed after
>> the action protected by B occurred, and will presumably do the wrong
>> thing.
>
> If this is really true, full page writes would also cause the similar problem.
> No? Imagine the case where A reads page 1, then modifies page 2, and B
> modifies page 1. At the recovery, A will see the state of page 1 as it existed
> after the action by B.

Yeah, but it won't matter, because the LSN interlock will prevent A
from taking any action.  If you only write parts of the page, though,
the concept of "the" LSN of the page becomes a bit murky, because you
may have different parts of the page from different points in the WAL
stream.  I believe it's possible to cope with that if we design it
carefully, but it does seem rather complex and error-prone (which is
not necessarily the best design for a recovery system, but hey).

Anyway, you can either have the partial page write for A restore the
older LSN, or not.  If you do, then you have the problem as I
described it.  If you don't, then the effects of A vanish into the
either.  Either way, it doesn't work.

> The replay of the WAL record for A doesn't rely on the content of chunk 1
> which B modified. So I don't think that "partial page writes" has such
> a problem.
> No?

Sorry.  WAL records today DO rely on the prior state of the page.  If
they didn't, we wouldn't need full page writes.  They don't rely on
them terribly heavily - things like where pd_upper is pointing, and
what the page LSN is.  But they do rely on them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Proposal: Another attempt at vacuum improvements
Next
From: Fujii Masao
Date:
Subject: Re: tackling full page writes