Re: right sibling is not next child - Mailing list pgsql-bugs

From Tom Lane
Subject Re: right sibling is not next child
Date
Msg-id 6646.1144891859@sss.pgh.pa.us
Whole thread Raw
In response to Re: right sibling is not next child  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
I wrote:
> Does that index contain any sensitive data, and if not could I trouble
> you for a copy?  I'm still not clear on the mechanism by which the
> indexes got corrupted like this.

Oh, never mind ... I've sussed it.

nbtxlog.c's forget_matching_split() assumes it can look into the page
that was just updated to get the block number associated with a non-leaf
insertion.  This is OK *only if the page has exactly its state at the
time of the WAL record*.  However, btree_xlog_insert() is coded to do
nothing if the page has an LSN larger than the WAL record's LSN --- that
is, if the page reflects a state *later than* this insertion.  So if the
page is newer than that --- say, there were some subsequent insertions at
earlier positions in the page --- forget_matching_split() would pick up
the wrong downlink and hence fail to erase the pending split it should
have erased.

I believe this bug is only latent whenever full_page_writes = on,
because in that situation the first touch of any index page after a
checkpoint will rewrite the whole page, and so we'll never be looking
at an index page state newer than the WAL record.  That explains why
no one has tripped over it before.

The particular case we are looking at in Panel_pkey seems to require
some additional assumptions to explain the state of the index, but
I've got no doubt this is the core of the problem.

Since we're not going to support full_page_writes = off in 8.1.*,
there's no need for a back-patched fix, but I'll see about making it
safer in HEAD.

            regards, tom lane

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: right sibling is not next child
Next
From: "James M Doherty"
Date:
Subject: BUG #2389: function within function return value