Re: infinite loop in _bt_getstackbuf - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: infinite loop in _bt_getstackbuf
Date
Msg-id CAM3SWZT7dCes=uOA3NAHYBA1kth=b4pXkszNLMPVtNAAYUp_wg@mail.gmail.com
Whole thread Raw
In response to infinite loop in _bt_getstackbuf  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: infinite loop in _bt_getstackbuf
List pgsql-hackers
On Thu, Oct 30, 2014 at 10:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> (9.3.5 problem report)

I think I saw a similar issue, by a 9.3.5 instance that was affected
by the "in pg_upgrade, remove pg_multixact files left behind by
initdb" issue (I ran the remediation recommended in the 9.3.5 release
notes). Multiple anti-wraparound vacuums were stuck following a PITR.
I resolved this (as far as I can tell) by killing the autovacuum
workers, and manually running VACUUM FREEZE. I have yet to do any root
cause analysis, but I think I could reproduce the problem.

> The fundamental structure of that function is an infinite loop.  We
> break out of that loop when BTEntrySame(item, &stack->bts_btentry) or
> P_RIGHTMOST(opaque) and I'm sure that it's correct to think that, in
> theory, one of those things will eventually happen.

Not in theory - only in practice. L&Y specifically state:

"We wish to point out here that our algorithms do not prevent the
possibility of livelock (where one process rrms indefinitely). This
can happen if a process never terminates because it keeps having to
follow link pointers created by other processes. This might happen in
the case of a process being run on a (relatively) very slow processor
in a multiprocessor system".

> But the index
> could be corrupted, most obviously by having a page where
> opaque->btpo_next points pack to the current block number.  If that
> happens, you need an immediate shutdown (or some clever gdb hackery)
> to terminate the VACUUM.  That's unfortunate and unnecessary.

Merlin reported a bug that looked exactly like this. Hardware failure
may now explain the problem.

> It also looks likes something we can fix, at a minimum by adding a
> CHECK_FOR_INTERRUPTS() at the top of that loop, or in some function
> that it calls, like _bt_getbuf(), so that if it goes into an infinite
> loop, it can at least be killed.

I think that it might be a good idea to have circular _bt_moveright()
moves (the direct offender in Merlin's case, which has very similar
logic to your _bt_getstackbuf() problem case) detected. I'm pretty
sure that it's exceptional for there to be more than 2 or 3 retries in
_bt_moveright(). It would probably be fine to consider the possibility
that we'll never finish once we get past 5 retries or something like
that. We'd then start keeping track of blocks visited, and raise an
error when a page was visited a second time.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: hung backends stuck in spinlock heavy endless loop
Next
From: Merlin Moncure
Date:
Subject: Re: hung backends stuck in spinlock heavy endless loop