Re: Failure while inserting parent tuple to B-tree is not fun - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Failure while inserting parent tuple to B-tree is not fun
Date
Msg-id 20131022183442.GG7435@awork2.anarazel.de
Whole thread Raw
In response to Re: Failure while inserting parent tuple to B-tree is not fun  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: Failure while inserting parent tuple to B-tree is not fun  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On 2013-10-22 21:29:13 +0300, Heikki Linnakangas wrote:
> On 22.10.2013 21:25, Andres Freund wrote:
> >On 2013-10-22 19:55:09 +0300, Heikki Linnakangas wrote:
> >>Splitting a B-tree page is a two-stage process: First, the page is split,
> >>and then a downlink for the new right page is inserted into the parent
> >>(which might recurse to split the parent page, too). What happens if
> >>inserting the downlink fails for some reason? I tried that out, and it turns
> >>out that it's not nice.
> >>
> >>I used this to cause a failure:
> >>
> >>>--- a/src/backend/access/nbtree/nbtinsert.c
> >>>+++ b/src/backend/access/nbtree/nbtinsert.c
> >>>@@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel,
> >>>            _bt_relbuf(rel, pbuf);
> >>>        }
> >>>
> >>>+        elog(ERROR, "fail!");
> >>>+
> >>>        /* get high key from left page == lowest key on new right page */
> >>>        ritem = (IndexTuple) PageGetItem(page,
> >>>                                         PageGetItemId(page, P_HIKEY));
> >>
> >>postgres=# create table foo (i int4 primary key);
> >>CREATE TABLE
> >>postgres=# insert into foo select generate_series(1, 10000);
> >>ERROR:  fail!
> >>
> >>That's not surprising. But when I removed that elog again and restarted the
> >>server, I still can't insert. The index is permanently broken:
> >>
> >>postgres=# insert into foo select generate_series(1, 10000);
> >>ERROR:  failed to re-find parent key in index "foo_pkey" for split pages 4/5
> >>
> >>In real life, you would get a failure like this e.g if you run out of memory
> >>or disk space while inserting the downlink to the parent. Although rare in
> >>practice, it's no fun if it happens.
> >
> >Why doesn't the incomplete split mechanism prevent this? Because we do
> >not delay checkpoints on the primary and a checkpoint happened just
> >befor your elog(ERROR) above?
> 
> Because there's no recovery involved. The failure I injected (or an
> out-of-memory or out-of-disk-space in the real world) doesn't cause a PANIC,
> just an ERROR that rolls back the current transaction, nothing more.
> 
> We could put a critical section around the whole recursion that inserts the
> downlinks, so that you would get a PANIC and the incomplete split mechanism
> would fix it at recovery. But that would hardly be an improvement.

You were talking about restarting the server, that's why I assumed
recovery had been involved... But you just were talking about removing
the elog() again.

For me this clearly *has* to be in a critical section with the current
code. I had always assumed all multi-part actions would be.

Do you forsee the fix with ignoring missing downlinks to be
back-patchable? FWIW, I think I might have seen real-world cases of
this.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Jeff Janes
Date:
Subject: Re: Add min and max execute statement time in pg_stat_statement
Next
From: Stephen Frost
Date:
Subject: Re: Reasons not to like asprintf