Thread: FPI

FPI

From
Robert Haas
Date:
I was thinking about full-page writes again tonight.  I'm still
wondering about the feasibility of getting rid of full-page writes for
certain operations.  We can do this, I think, in any case where we can
convince ourselves that if the original operation, or a redo of the
original operation, leaves behind a torn page, a subsequent redo will
still DTRT.

I think that both tuple freezing (XLOG_HEAP2_FREEZE) and heap
deletions (XLOG_HEAP_DELETE) are close to having this property.  The
updates are pretty much unconditional - they certainly don't depend on
existing xmin/xmax/ctid/infomask bits being in any particular state.
But this (in the redo logic) is a problem:
       if (XLByteLE(lsn, PageGetLSN(page)))       {               UnlockReleaseBuffer(buffer);               return;
  }
 

A crash might leave the updated LSN on disk, while the rest of the
page isn't.  If this is workable at all, we'd need to at least
suppress the above behavior (I believe it would still be OK after we
reach consistency, but not before).

Tonight I realized that I think we could also make this property hold
for updates, only with respect to the block containing the *old* tuple
(which is basically getting the equivalent of a delete).  While
reducing WAL just for deletes and freezing might be judged not worth
the additional complexity, maybe being able to improve the update case
would put it over the top.

Thoughts?  Any idea how much of our WAL volume comes from FPIs vs.
everything else?  Index changes vs. heap changes?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: FPI

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I was thinking about full-page writes again tonight.  I'm still
> wondering about the feasibility of getting rid of full-page writes for
> certain operations.  We can do this, I think, in any case where we can
> convince ourselves that if the original operation, or a redo of the
> original operation, leaves behind a torn page, a subsequent redo will
> still DTRT.

> I think that both tuple freezing (XLOG_HEAP2_FREEZE) and heap
> deletions (XLOG_HEAP_DELETE) are close to having this property.

Say what?  A heap deletion compacts the page --- it will certainly fail
badly on torn-page.
        regards, tom lane


Re: FPI

From
Robert Haas
Date:
On Fri, Jan 28, 2011 at 10:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I was thinking about full-page writes again tonight.  I'm still
>> wondering about the feasibility of getting rid of full-page writes for
>> certain operations.  We can do this, I think, in any case where we can
>> convince ourselves that if the original operation, or a redo of the
>> original operation, leaves behind a torn page, a subsequent redo will
>> still DTRT.
>
>> I think that both tuple freezing (XLOG_HEAP2_FREEZE) and heap
>> deletions (XLOG_HEAP_DELETE) are close to having this property.
>
> Say what?  A heap deletion compacts the page --- it will certainly fail
> badly on torn-page.

What do you mean by "compacts the page"?  I would interpret that to
mean "reclaims the space formerly used by the tuple being deleted",
but it certainly can't do that.  The transaction might not commit, and
in any case the tuple will still be visible to concurrent snapshots.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: FPI

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Jan 28, 2011 at 10:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Say what? �A heap deletion compacts the page --- it will certainly fail
>> badly on torn-page.

> What do you mean by "compacts the page"?  I would interpret that to
> mean "reclaims the space formerly used by the tuple being deleted",
> but it certainly can't do that.

Oh, I'm sorry, I was confusing the case with vacuum cleanup.  Obviously
hadn't consumed any caffeine yet.  Nevermind ...
        regards, tom lane


Re: FPI

From
Robert Haas
Date:
On Fri, Jan 28, 2011 at 1:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Jan 28, 2011 at 10:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Say what?  A heap deletion compacts the page --- it will certainly fail
>>> badly on torn-page.
>
>> What do you mean by "compacts the page"?  I would interpret that to
>> mean "reclaims the space formerly used by the tuple being deleted",
>> but it certainly can't do that.
>
> Oh, I'm sorry, I was confusing the case with vacuum cleanup.  Obviously
> hadn't consumed any caffeine yet.  Nevermind ...

OK.  You had me worried for a minute there.  :-)

Any substantive comments, besides the obvious "this is not 9.1 material"?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: FPI

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Any substantive comments, besides the obvious "this is not 9.1 material"?

Now that I've absorbed a bit more caffeine, let's see if I can think
straight this time.

General principle you want to assert: any WAL entry that merely results
in setting a deterministic field to a deterministic value shouldn't need
a FPI, since it is easy to check whether the field has that value and
re-apply the update if needed.  The way this would have to work is:

1. Page LSN < WAL location: apply field update, set page LSN = WAL location.

2. Page LSN = WAL location: check if field matches, apply update if not.

3. Page LSN > WAL location: do NOT apply field update or change LSN.

Now the issue is what happens if a torn-page event causes the LSN to be
out of sync with the page contents.  If the LSN is too small (ie, the
actual page contents come from some later WAL entry) we may mistakenly
apply action 1 or 2 to bytes that don't actually represent the field we
think they do.  Now, that would be all right once we replay the later
WAL entry and replace the page data from its FPI.  But there are two
huge risks here: one being that in a PITR operation the user might tell
us to stop short of applying the later WAL entry, and the other being
that in any case we'll have an interval where the page is corrupt, which
is a problem if any hot-standby queries try to look at it.

I think we might be all right with that if we can guarantee that any such
inconsistencies only exist before the system believes that it's reached
a consistent database state, but I'm not quite sure if that's true or
not.

Hmm, no, it doesn't work, because the above argument assumes there is a
later FPI-containing WAL entry at all.  Suppose we have a sequence of
several single-field-setting WAL entries, and there is no FPI-containing
WAL entry for the page before end of WAL.  We could have a torn page
such that the LSN comes from one of the later entries but the field that
should be set from an earlier entry is old.  When we replay the earlier
entry, case 3 will apply, so we do nothing ... incorrectly.  And there
will be no FPI to fix it.

It doesn't work.
        regards, tom lane


Re: FPI

From
Robert Haas
Date:
On Fri, Jan 28, 2011 at 3:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Any substantive comments, besides the obvious "this is not 9.1 material"?
>
> Now that I've absorbed a bit more caffeine, let's see if I can think
> straight this time.
>
> General principle you want to assert: any WAL entry that merely results
> in setting a deterministic field to a deterministic value shouldn't need
> a FPI, since it is easy to check whether the field has that value and
> re-apply the update if needed.  The way this would have to work is:
>
> 1. Page LSN < WAL location: apply field update, set page LSN = WAL location.
>
> 2. Page LSN = WAL location: check if field matches, apply update if not.
>
> 3. Page LSN > WAL location: do NOT apply field update or change LSN.
>
> Now the issue is what happens if a torn-page event causes the LSN to be
> out of sync with the page contents.  If the LSN is too small (ie, the
> actual page contents come from some later WAL entry) we may mistakenly
> apply action 1 or 2 to bytes that don't actually represent the field we
> think they do.  Now, that would be all right once we replay the later
> WAL entry and replace the page data from its FPI.  But there are two
> huge risks here: one being that in a PITR operation the user might tell
> us to stop short of applying the later WAL entry, and the other being
> that in any case we'll have an interval where the page is corrupt, which
> is a problem if any hot-standby queries try to look at it.
>
> I think we might be all right with that if we can guarantee that any such
> inconsistencies only exist before the system believes that it's reached
> a consistent database state, but I'm not quite sure if that's true or
> not.

I wasn't too sure either, at first, but I think it must be true.  If
it were possible to reach consistency while there were still torn
pages on disk, then we could enter normal running with those pages
still on disk by stopping recovery at that point.  And clearly that's
not going to fly unless we're talking about something like hint bits,
where nothing's actually busted if we only get half the update.

> Hmm, no, it doesn't work, because the above argument assumes there is a
> later FPI-containing WAL entry at all.  Suppose we have a sequence of
> several single-field-setting WAL entries, and there is no FPI-containing
> WAL entry for the page before end of WAL.  We could have a torn page
> such that the LSN comes from one of the later entries but the field that
> should be set from an earlier entry is old.  When we replay the earlier
> entry, case 3 will apply, so we do nothing ... incorrectly.  And there
> will be no FPI to fix it.

What happens if we (a) keep the current rule after reaching
consistency and (b) apply any such updates *unconditionally* - that
is, without reference to the LSN - prior to reaching consistency?
Under that rule, if we encounter an FPI before reaching consistency,
we're OK.  So let's suppose we don't.  No matter how many times we
replay any initial prefix of any such updates between the redo pointer
and the point at which we reach consistency, the state of the page
when we finally reach consistency will be identical.  But we could get
hosed if replay progressed *past* the minimum recovery point and then
started over at the previous redo pointer.  If we forced an immediate
restartpoint on reaching consistency, that seems like it might prevent
that scenario.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: FPI

From
Robert Haas
Date:
On Fri, Jan 28, 2011 at 3:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> What happens if we (a) keep the current rule after reaching
> consistency and (b) apply any such updates *unconditionally* - that
> is, without reference to the LSN - prior to reaching consistency?
> Under that rule, if we encounter an FPI before reaching consistency,
> we're OK.  So let's suppose we don't.  No matter how many times we
> replay any initial prefix of any such updates between the redo pointer
> and the point at which we reach consistency, the state of the page
> when we finally reach consistency will be identical.  But we could get
> hosed if replay progressed *past* the minimum recovery point and then
> started over at the previous redo pointer.  If we forced an immediate
> restartpoint on reaching consistency, that seems like it might prevent
> that scenario.

Actually, I'm wrong, and this doesn't work at all.  At the time of the
crash, there could already be pages on disk with LSNs greater than the
minimum recovery point.  Duh.

It was such a good idea in my head...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: FPI

From
Greg Stark
Date:
On Fri, Jan 28, 2011 at 8:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 3. Page LSN > WAL location: do NOT apply field update or change LSN.
>

I don't think this works. There could be multiple writes to a page for
different records before the crash occurs. The LSN could be far in the
future and yet still have a torn page missing the update.


Another thing to consider is that there's still a plan on the table to
implement block checksums. Does any of this block that?

Or do checksums make it *easier* to implement any of this? You can
check the checksum and if it's valid assume there isn't a torn page.
If the LSN >= current you can skip the log record. If the checksum is
invalid you could try replaying the log entry and if it makes it valid
then you're golden. If not then you could continue and hopefully there
will be more unconditional records and eventually the block will
become consistent or a FP write will come along later.

Just for reference the Oracle solution is to ignore the problem but
provide recovery tools to recover an individual block. You go to the
last consistent backup, pull the old version of the block from there,
then apply the logs from that point forward replaying any records for
that block.



-- 
greg


Re: FPI

From
Robert Haas
Date:
On Mon, Jan 31, 2011 at 10:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 28, 2011 at 3:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> What happens if we (a) keep the current rule after reaching
>> consistency and (b) apply any such updates *unconditionally* - that
>> is, without reference to the LSN - prior to reaching consistency?
>> Under that rule, if we encounter an FPI before reaching consistency,
>> we're OK.  So let's suppose we don't.  No matter how many times we
>> replay any initial prefix of any such updates between the redo pointer
>> and the point at which we reach consistency, the state of the page
>> when we finally reach consistency will be identical.  But we could get
>> hosed if replay progressed *past* the minimum recovery point and then
>> started over at the previous redo pointer.  If we forced an immediate
>> restartpoint on reaching consistency, that seems like it might prevent
>> that scenario.
>
> Actually, I'm wrong, and this doesn't work at all.  At the time of the
> crash, there could already be pages on disk with LSNs greater than the
> minimum recovery point.  Duh.
>
> It was such a good idea in my head...

I should mention that most of this idea was Heikki's, originally.
Except for the crappy parts that don't work - those are all me.  But
I'm back to thinking this can work.  Heikki pointed out to me on IM
today that in crash recovery, we always replay to end-of-WAL before
opening for connections, and for Hot Standby every block we write
advances the minimum recovery point to its LSN.  This implies that if
we're accepting connections (either regular or Hot Standby) or at a
valid stopping point for PITR, there are no unreplayed WAL records
whose changes are reflected in blocks on disk.

So I'm back to proposing that we just apply FPI-free WAL records
unconditionally, without regard to the LSN.  This could potentially
corrupt the page, of course.  Consider delete (no FPI) - vacuum (with
FPI) - crash, leaving the vacuum page half on disk.  Now the replay of
the delete is probably going to do the wrong thing, because the page
is torn.  But it doesn't matter, because the vacuum's FPI will
overwrite the page anyway, and whatever stupid thing the delete replay
did will become irrelevant - BEFORE we can begin processing any
queries.  On the other hand, if the delete record *isn't* followed by
an FPI, but just, by, say, a bunch more deletes, then it should all
Just Work (TM).  As long as the page header (excluding LSN and TLI,
which we're ignoring by stipulation) and item pointer list are intact,
we can redo those deletes and clean things up.  And if they're not
intact, then we must've done something that emits an FPI, and so any
temporary page corruption will get overwritten when we get to that
point in the WAL stream...

That is a bit ugly, though, because it means the XLOG replay of
FPI-free records would have to be prepared to just punt if they
encounter any sort of corruption, in the sure hope that any such
corruption must imply the presence of a future FPI that will be
replayed - since if there is no such future FPI, it should be
impossible for the page to be corrupted in the first place.  But that
might reduce our chances of being able to detect real corruption.

Heikki also came up with another idea that might be worth exploring:
at the point when we currently emit FPIs, emit an image of just the
part of the page that precedes pd_lower - the page header and item
IDs.  To make this work, we'd have to make a rule that redo isn't
allowed to rely for correctness on any bits following the pd_lower
boundary - it can write those bits, but it can't read them.  But most
of the XLOG_HEAP records follow that rule already - we look at the
item pointers to figure out where we're putting a new tuple or to
locate an existing tuple and unconditionally overwrite some of its
bits.  The obvious exception is XLOG_HEAP2_CLEAN, emitted by VACUUM,
which would probably need to just log the entire page.  Also, we'd
again need to apply records unconditionally, without reference to the
page LSN, until we reached the minimum recovery point.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: FPI

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> So I'm back to proposing that we just apply FPI-free WAL records
> unconditionally, without regard to the LSN.  This could potentially
> corrupt the page, of course.

Yes.  So you're still assuming that there will be a later FPI-containing
WAL record to fix up the mess you created.  What if there isn't?
        regards, tom lane


Re: FPI

From
Robert Haas
Date:
On Tue, Feb 1, 2011 at 12:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> So I'm back to proposing that we just apply FPI-free WAL records
>> unconditionally, without regard to the LSN.  This could potentially
>> corrupt the page, of course.
>
> Yes.  So you're still assuming that there will be a later FPI-containing
> WAL record to fix up the mess you created.  What if there isn't?

In that case, the page shouldn't be corrupted.  The possibility of
corruption comes from the fact that a future WAL record might
rearrange the page contents so that the current WAL record is no
longer applying to the set of tuples it expects to be seeing.  But any
such action would necessarily induce an FPI.  If there is no such
action, then how can the page get into a state where replaying a heap
delete will corrupt it?  For that to happen, the item pointer list has
to have changed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company