infinite loop in _bt_getstackbuf - Mailing list pgsql-hackers

From Robert Haas
Subject infinite loop in _bt_getstackbuf
Date
Msg-id CA+TgmoZzd1MB3qqMeJiUXM569JySqYd_uJ9KiBByy6w0iMUrXg@mail.gmail.com
Whole thread Raw
Responses Re: infinite loop in _bt_getstackbuf
Re: infinite loop in _bt_getstackbuf
List pgsql-hackers
A colleague at EnterpriseDB today ran into a situation on PostgreSQL
9.3.5 where the server went into an infinite loop while attempting a
VACUUM FREEZE; it couldn't escape _bt_getstackbuf(), and it couldn't
be killed with ^C.   I think we should add a check for interrupts into
that loop somewhere; and possibly make some attempt to notice if we've
been iterating for longer than, say, the lifetime of the universe
until now.

The fundamental structure of that function is an infinite loop.  We
break out of that loop when BTEntrySame(item, &stack->bts_btentry) or
P_RIGHTMOST(opaque) and I'm sure that it's correct to think that, in
theory, one of those things will eventually happen.  But the index
could be corrupted, most obviously by having a page where
opaque->btpo_next points pack to the current block number.  If that
happens, you need an immediate shutdown (or some clever gdb hackery)
to terminate the VACUUM.  That's unfortunate and unnecessary.

It also looks likes something we can fix, at a minimum by adding a
CHECK_FOR_INTERRUPTS() at the top of that loop, or in some function
that it calls, like _bt_getbuf(), so that if it goes into an infinite
loop, it can at least be killed.  We could also onsider adding a check
at the bottom of the loop, just before setting blkno =
opaque->btpo_next, that those values are unequal.  If they are,
elog().  Clearly it's possible to have a cycle of length >1, and such
a check wouldn't catch that, but it might still be worth checking for
the trivial case.  Or, we could try to put an upper bound on the
number of iterations that are reasonable and error out if we exceed
that value.  That might be tricky, though; it's not obvious to me that
there's any comfortably small upper bound.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: TAP test breakage on MacOS X
Next
From: Heikki Linnakangas
Date:
Subject: Re: WAL format and API changes (9.5)