Thread: Corrupt index stopping autovacuum system wide

Corrupt index stopping autovacuum system wide

From
Aaron Pelz
Date:
Hello,

Over the weekend we noticed that our max transaction IDs have been continuously increasing - got an alert it passed 1B - and we believe that no autovacuums were running for a period of about a month by looking at pg_stat_user_tables. We had not updated any autovac tuning parameters over that time period and many tables were very much over the threshold for needing an autovac.

When investigating I located the table with the oldest transaction ID with:

SELECT c.oid::regclass as table_name,
        greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age,
        pg_size_pretty(pg_table_size(c.oid)) as table_size,
        c.relkind,
        c.relpersistence
FROM pg_class c
LEFT JOIN pg_class t ON c.reltoastrelid = t.oid
WHERE c.relkind in ('r', 't','m')
order by age desc limit 40;

I vacuumed that table manually with `vacuum freeze verbose table_xx` and got this error:

INFO:  aggressively vacuuming "public.table_xx"
INFO:  scanned index "table_xx_pkey" to remove 168443 row versions
DETAIL:  CPU: user: 0.13 s, system: 0.00 s, elapsed: 0.13 s
INFO:  scanned index "table_xx_col_id" to remove 168443 row versions
DETAIL:  CPU: user: 0.16 s, system: 0.00 s, elapsed: 0.16 s
ERROR:  failed to re-find parent key in index "table_xx_col_idx" for deletion target page 217

I replaced the index (create a new concurrently, delete the old concurrently), vacuumed the table, and immediately autovacs started across the system and our XIDs started falling. To me it looks like a *single* corrupt index held up autovacuums across our entire server, even other in other databases on the same server. Am I interpreting this correctly? Would love guidance on diagnosing this type of thing and strategies for preventing it.

Thanks,
Aaron

Re: Corrupt index stopping autovacuum system wide

From
Peter Geoghegan
Date:
On Wed, Jul 17, 2019 at 8:54 AM Aaron Pelz <aaronepelz@gmail.com> wrote:
> To me it looks like a *single* corrupt index held up autovacuums across our entire server, even other in other
databaseson the same server. Am I interpreting this correctly?
 

Yes -- that is correct.

What PostgreSQL version are you on? Was this an INCLUDE index on PostgreSQL 11?

> Would love guidance on diagnosing this type of thing and strategies for preventing it.

It's possible that amcheck would have given you an accurate diagnosis
of the problem -- especially if you used bt_index_parent_check():

https://www.postgresql.org/docs/current/amcheck.html
-- 
Peter Geoghegan



Re: Corrupt index stopping autovacuum system wide

From
Aaron Pelz
Date:
> What PostgreSQL version are you on? Was this an INCLUDE index on PostgreSQL 11?

On 11, and no it was just a normal btree.

> It's possible that amcheck would have given you an accurate diagnosis
> of the problem -- especially if you used bt_index_parent_check():

I'll look into this, seems helpful. Thanks!

On Wed, Jul 17, 2019 at 12:21 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Jul 17, 2019 at 8:54 AM Aaron Pelz <aaronepelz@gmail.com> wrote:
> To me it looks like a *single* corrupt index held up autovacuums across our entire server, even other in other databases on the same server. Am I interpreting this correctly?

Yes -- that is correct.

What PostgreSQL version are you on? Was this an INCLUDE index on PostgreSQL 11?

> Would love guidance on diagnosing this type of thing and strategies for preventing it.

It's possible that amcheck would have given you an accurate diagnosis
of the problem -- especially if you used bt_index_parent_check():

https://www.postgresql.org/docs/current/amcheck.html
--
Peter Geoghegan

Re: Corrupt index stopping autovacuum system wide

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Wed, Jul 17, 2019 at 8:54 AM Aaron Pelz <aaronepelz@gmail.com> wrote:
>> To me it looks like a *single* corrupt index held up autovacuums across our entire server, even other in other
databaseson the same server. Am I interpreting this correctly? 

> Yes -- that is correct.

It looks that way, but how would a broken non-shared index have held up
autovacuuming in other databases?  Maybe, as this one's xmin horizon
got further and further behind, the launcher eventually stopped
considering launching workers into any other databases?  That seems
like a bad thing; it's postponing work that will need to be done
eventually.

            regards, tom lane



Re: Corrupt index stopping autovacuum system wide

From
Peter Geoghegan
Date:
On Wed, Jul 17, 2019 at 9:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It looks that way, but how would a broken non-shared index have held up
> autovacuuming in other databases?  Maybe, as this one's xmin horizon
> got further and further behind, the launcher eventually stopped
> considering launching workers into any other databases?  That seems
> like a bad thing; it's postponing work that will need to be done
> eventually.

I don't know exactly how the launcher would behave offhand, but it's
clear that not being able to VACUUM one table in one database (because
it has a corrupt index) ultimately risks the availability of every
database in the cluster. Many installations receive little to no
supervision, so it may just be a matter of time there. That is
certainly a bad thing.

-- 
Peter Geoghegan



Re: Corrupt index stopping autovacuum system wide

From
Peter Geoghegan
Date:
On Wed, Jul 17, 2019 at 9:21 AM Peter Geoghegan <pg@bowt.ie> wrote:
> It's possible that amcheck would have given you an accurate diagnosis
> of the problem -- especially if you used bt_index_parent_check():
>
> https://www.postgresql.org/docs/current/amcheck.html

BTW, be sure to use the 'heapallindexed' option with
bt_index_parent_check() to detect missing downlinks, which is exactly
the problem that VACUUM complained about. Hopefully this probably will
be limited to the single index that you've already REINDEXed. In
theory the same problem could be hiding in other indexes, though I
don't consider that particularly likely.

Note that bt_index_parent_check() requires a lock on tables that
effectively blocks writes, but not reads, so verification may require
planning or coordination. bt_index_check() doesn't have any of these
problems, but also won't detect missing downlinks specifically.

-- 
Peter Geoghegan



Re: Corrupt index stopping autovacuum system wide

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Wed, Jul 17, 2019 at 9:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It looks that way, but how would a broken non-shared index have held up
>> autovacuuming in other databases?  Maybe, as this one's xmin horizon
>> got further and further behind, the launcher eventually stopped
>> considering launching workers into any other databases?  That seems
>> like a bad thing; it's postponing work that will need to be done
>> eventually.

> I don't know exactly how the launcher would behave offhand, but it's
> clear that not being able to VACUUM one table in one database (because
> it has a corrupt index) ultimately risks the availability of every
> database in the cluster. Many installations receive little to no
> supervision, so it may just be a matter of time there. That is
> certainly a bad thing.

Right, you're eventually going to get to a forced shutdown if vacuum never
succeeds on one table; no question that that's bad.  My concern here is
that if we have blinders on to the extent of only processing that one
table or DB, we're unnecessarily allowing bloat to occur in other tables,
and causing that missed vacuuming work to pile up so that there's more of
it to be done once the breakage is cleared.  If the DBA doesn't notice the
problem until getting into a forced shutdown, that is going to extend his
outage time --- and, in a really bad worst case, maybe make the difference
between being able to recover at all and not.

            regards, tom lane



Re: Corrupt index stopping autovacuum system wide

From
Peter Geoghegan
Date:
On Wed, Jul 17, 2019 at 10:27 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Right, you're eventually going to get to a forced shutdown if vacuum never
> succeeds on one table; no question that that's bad.

It occurs to me that we use operator class/insertion scankey
comparisons within page deletion, to relocate a leaf page that looks
like a candidate for deletion. Despite this, README.hot claims:

"Standard vacuuming scans the indexes to ensure all such index entries
are removed, amortizing the index scan cost across as many dead tuples
as possible; this approach does not scale down well to the case of
reclaiming just a few tuples.  In principle one could recompute the
index keys and do standard index searches to find the index entries,
but this is risky in the presence of possibly-buggy user-defined
functions in functional indexes.  An allegedly immutable function that
in fact is not immutable might prevent us from re-finding an index
entry"

That probably wasn't the problem in Aaron's case, but it is worth
considering as a possibility.

> My concern here is
> that if we have blinders on to the extent of only processing that one
> table or DB, we're unnecessarily allowing bloat to occur in other tables,
> and causing that missed vacuuming work to pile up so that there's more of
> it to be done once the breakage is cleared.  If the DBA doesn't notice the
> problem until getting into a forced shutdown, that is going to extend his
> outage time --- and, in a really bad worst case, maybe make the difference
> between being able to recover at all and not.

The comment about "...any db at risk of Xid wraparound..." within
do_start_worker() hints at such a problem.

Maybe nbtree VACUUM should do something more aggressive than give up
when there is a "failed to re-find parent key" or similar condition.
Perhaps it would make more sense to make the index inactive (for some
value of "inactive") instead of just complaining. That might be the
least worst option, all things considered.

--
Peter Geoghegan



Re: Corrupt index stopping autovacuum system wide

From
Alvaro Herrera
Date:
On 2019-Jul-17, Peter Geoghegan wrote:

> Maybe nbtree VACUUM should do something more aggressive than give up
> when there is a "failed to re-find parent key" or similar condition.
> Perhaps it would make more sense to make the index inactive (for some
> value of "inactive") instead of just complaining. That might be the
> least worst option, all things considered.

Maybe we can mark an index as unvacuumable in some way?  As far as I
understand, all queries using that index work, as do index updates; it's
just vacuuming that fails.  If we mark the index as unvacuumable, then
vacuum just skips it (and does not run phase 3 for that table), and
things can proceed; the table's age can still be advanced.  Obviously
it'll result in more bloat than in normal condition, but it shouldn't
cause the whole cluster to go down.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Corrupt index stopping autovacuum system wide

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2019-Jul-17, Peter Geoghegan wrote:
>> Maybe nbtree VACUUM should do something more aggressive than give up
>> when there is a "failed to re-find parent key" or similar condition.
>> Perhaps it would make more sense to make the index inactive (for some
>> value of "inactive") instead of just complaining. That might be the
>> least worst option, all things considered.

> Maybe we can mark an index as unvacuumable in some way?  As far as I
> understand, all queries using that index work, as do index updates; it's
> just vacuuming that fails.  If we mark the index as unvacuumable, then
> vacuum just skips it (and does not run phase 3 for that table), and
> things can proceed; the table's age can still be advanced.  Obviously
> it'll result in more bloat than in normal condition, but it shouldn't
> cause the whole cluster to go down.

If an index is corrupt enough to break vacuum, I think it takes a rather
large leap of faith to believe that it's not going to cause problems for
inserts or searches.  I'd go with just marking the index broken and
insisting that it be REINDEX'd before we touch it again.

Not sure how we make that happen automatically though, or if we even
should.
(a) once the transaction's failed, you can't go making catalog updates; 
(b) even when you know the transaction's failed, blaming it on a
particular index seems a bit chancy; 
(c) automatically disabling constraint indexes seems less than desirable.

            regards, tom lane



Re: Corrupt index stopping autovacuum system wide

From
Alvaro Herrera
Date:
On 2019-Jul-17, Tom Lane wrote:

> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > On 2019-Jul-17, Peter Geoghegan wrote:
> >> Maybe nbtree VACUUM should do something more aggressive than give up
> >> when there is a "failed to re-find parent key" or similar condition.
> >> Perhaps it would make more sense to make the index inactive (for some
> >> value of "inactive") instead of just complaining. That might be the
> >> least worst option, all things considered.
> 
> > Maybe we can mark an index as unvacuumable in some way?  As far as I
> > understand, all queries using that index work, as do index updates; it's
> > just vacuuming that fails.  If we mark the index as unvacuumable, then
> > vacuum just skips it (and does not run phase 3 for that table), and
> > things can proceed; the table's age can still be advanced.  Obviously
> > it'll result in more bloat than in normal condition, but it shouldn't
> > cause the whole cluster to go down.
> 
> If an index is corrupt enough to break vacuum, I think it takes a rather
> large leap of faith to believe that it's not going to cause problems for
> inserts or searches.

Maybe, but it's what happened in the reported case.  (Note Aaron was
careful to do the index replacement concurrently -- he wouldn't have
done that if the table wasn't in active use.)

> I'd go with just marking the index broken and
> insisting that it be REINDEX'd before we touch it again.

This might make things worse operationally, though.  If searches aren't
failing but vacuum is, we'd break a production system that currently
works.

> (a) once the transaction's failed, you can't go making catalog updates; 

Maybe we can defer the actual update to some other transaction -- say
register an autovacuum work-item, which can be executed separately.

> (b) even when you know the transaction's failed, blaming it on a
> particular index seems a bit chancy; 

Well, vacuum knows what index is being processed.  Maybe you're thinking
that autovac can get an out-of-memory condition or something like that;
perhaps we can limit the above only when an ERRCODE_DATA_CORRUPTED
condition is reported (and make sure all such conditions do that.  As
far as I remember we have a patch for this particular error to be
reported as such.)

> (c) automatically disabling constraint indexes seems less than desirable.

Disabling them for writes, yeah.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Corrupt index stopping autovacuum system wide

From
Peter Geoghegan
Date:
On Wed, Jul 17, 2019 at 11:43 AM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> This might make things worse operationally, though.  If searches aren't
> failing but vacuum is, we'd break a production system that currently
> works.

If searches aren't failing and VACUUM works, then that's probably down
to dumb luck. The user's luck could change at any time (actually, it's
quite possible that the index is already giving wrong answers without
anybody realizing). That's not always true, of course -- you could
have an OOM condition in VACUUM, where it really does make sense to
retry. But it should be true for the category of errors where we
behave more aggressively than just giving up, such as "failed to
re-find parent key" error Aaron noticed.

> Well, vacuum knows what index is being processed.  Maybe you're thinking
> that autovac can get an out-of-memory condition or something like that;
> perhaps we can limit the above only when an ERRCODE_DATA_CORRUPTED
> condition is reported (and make sure all such conditions do that.  As
> far as I remember we have a patch for this particular error to be
> reported as such.)

I don't think that it would be that hard to identify errors that
nbtree VACUUM could throw that clearly indicate corruption, without
any hope of the problem self-correcting without the DBA running a
REINDEX. There will be a small amount of gray area, perhaps, but
probably not enough to matter.

> > (c) automatically disabling constraint indexes seems less than desirable.
>
> Disabling them for writes, yeah.

I think that it's fair to say that all bets are off once you see the
"failed to re-find parent key" error, or any other such error that
indicates corruption. Admittedly it isn't 100% clear that disabling
constraint enforcement to unblock autovacuum for the whole cluster is
better than any available alternative; it's really hard to reason
about things when we already know that the database has corruption.

I think that it's okay that almost anything can break when somebody
creates an index on a non-immutable expression (including VACUUM),
provided that all the problems only affect the table with the broken
index. OTOH, taking down the entire Postgres cluster as an indirect
consequence of one person's ill-considered CREATE INDEX really sucks.
That distinction seems important to me.

-- 
Peter Geoghegan



Re: Corrupt index stopping autovacuum system wide

From
Peter Geoghegan
Date:
On Wed, Jul 17, 2019 at 10:27 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > It's possible that amcheck would have given you an accurate diagnosis
> > of the problem -- especially if you used bt_index_parent_check():
> >
> > https://www.postgresql.org/docs/current/amcheck.html
>
> BTW, be sure to use the 'heapallindexed' option with
> bt_index_parent_check() to detect missing downlinks, which is exactly
> the problem that VACUUM complained about.

Can you tell us more about this index? Can you share its definition
(i.e. what does \d show in psql)?

Is it an expression index, or a partial index? A composite? What
datatypes are indexed?

Thanks
--
Peter Geoghegan



Re: Corrupt index stopping autovacuum system wide

From
Andres Freund
Date:
Hi,

On 2019-07-17 13:27:23 -0400, Tom Lane wrote:
> My concern here is that if we have blinders on to the extent of only
> processing that one table or DB, we're unnecessarily allowing bloat to
> occur in other tables, and causing that missed vacuuming work to pile
> up so that there's more of it to be done once the breakage is cleared.

That a pretty common problem in the real world, with or without problems
like corrupted indexes. Autovacuum's scheduling is just too simplistic
to avoid that.

Common problems:

- If one table in one database has an xmin older than
  autovacuum_freeze_max_age, the autovacuum launcher continuously throws
  more workers at that database. Even though there's a lot of work to be
  done in other databases.

  That's because do_start_worker() ignores the "adl_next_worker"
  mechanism, and *always* starts more workers for the database with the
  oldest datfrozenxid (same with datminmxid), and because we explicitly
  don't start workers for other databases ("/* ignore not-at-risk DBs
  */").

  That's especially bad if that database has a longrunning transaction
  preventing cleanup from happening - the vacuuming will not be able to
  advance the horizon, and all others are starved.


- If there's a table "early in pg_class", and vacuum on it failes, it
  will prevent vacuuming other tables. It's pretty common for vacuum to
  fail often for a table, e.g. because it's sometimes exclusively locked,
  which then causes autovacuum to kill itself.  There's absolutely no
  mechanism for autovacuum workers to skip over that table for a while,
  leading to all other tables in the database not being vacuumed, unless
  there happens to be second worker in the database, while the first
  vacuum hasn't failed.

  This obviously also applies to the corrupted index case.

  The 'kill itself' behaviour is exascerbated by lazy_truncate_heap()'s
  exclusive lock - which will obviously trigger other backend to send
  cancellation requests. There's unfortunately a number of places where
  that leads us to just throw all the work done away, and not update
  pg_class.relfrozenxid/datfrozenxid


- Anti-wraparound vacuums are more impactful (don't cancel themselves
  upon lock conflicts, cause more IO, wait for cleanup locks), often
  emit scary messages ("oldest xmin is far in the past").  But we don't
  have *any* mechanism that avoids them for very common scenarios.

  E.g. for insert-mostly workloads, there'll never be enough dead tuples
  to cause a vacuum to happen before autovacuum_freeze_max_age is
  reached. That prevents vacuum_freeze_table_age from pre-empting the
  need to do an anti-wraparound vacuum, by increasing the xid horizon.

  We literally call anti-wraparound autovacuums "emergency" in the code,
  yet they're an almost inevitablepart of running postgres.


- There's no meaningful in-database prioritization. Which means that
  autovacuum workers might happily vacuum the table just a bit over the
  thresholds, even though there's much worse tables around.

  Especially on a busy and large databases that can lead to
  anti-wraparound started launchers effectively never getting to
  vacuuming tables above autovacuum_freeze_max_age, because tables
  earlier in pg_class are modified heavily enough that they have dead
  tuples above the thresholds by the time vacuum finishes.  To get to
  the anti-wraparound vacuum needing table, a single launcher needs to
  go through all tables preceding the table in pg_class that need
  vacuuming (only skipping over ones that are concurrently vacuumed by
  somebody else, but not ones that have *recently* been vacuumed).

  I kinda forgot how bad this one was until looking at the code again.

And there are plenty more.


My impression is that these are really hard to fix unless we develop a
new scheduling approach. And that scheduling approach probably needs to
be more stateful than the current code.

IMO these are the main concerns for how work needs to be distributed:

- work in different databases needs to be scheduled in a smarter way, in
  particular anti-wraparound cannot simply cause only the database with
  the to oldest datfrozenxid to be vacuumed until the wraparound
  "danger" is over (especially not if there's other database needing to
  be anti-wrap vacuumed)

- tables within a database need to be prioritized in a smarter way, so
  databases with a lot of bloat get vacuumed before ones with a lot less
  bloat, and similarly tables with the oldest horizon need to be
  vacuumed before ones with newer horizons, even if all of the tables
  are above the autovacuum thresholds.

- tables need to be prioritized across databases, to avoid problems like
  one currently vacuumed table causing unnecessary anti-wraparound
  workers to be launched for a database where they can't perform any
  work.

- there needs to be a separate rung between normal autovacuums and
  anti-wraparound vacuums. The inbetween level does 'aggressive'
  vacuuming (so it freezes), even for tables with just inserts, but
  keeps interruptible, and doesn't have a scary 'anti-wraparound' name.

- Repeated failures on one table may not starve all work for other
  tables in a database.

- errors vacuuming one table may not cause vacuuming on tables with a
  lower priority from happening until that error is fixed.



To me that means that we need prioritization across databases, and
between tables, and probably by multiple criteria. I suspect there need
to be multiple criteria how urgent vacuuming is, and autovacuum ought to
try to make progress on all of them.


I suspect the way to implement that would be to have avlauncher start
workers for all databases independent of tasks needing to be done (but
at a slower schedule than autovacuum_naptime). Those workers then would
put the tables needing vacuuming in the database into the global
prioritized list, displacing entries that are less important once the
list is full.  Proper autovacuuming work would then be done strictly in
priority order.

To avoid one priority criteria (e.g. age(relfrozenxid) from preventing
any work done on other criteria (e.g. number of dead tuples), I suspect
we ought to round robin between different criterias.  E.g. have one list
of tables needing vacuum ordered by age(relfrozenxid), one by
%dead-tuples, and one by last-failure time, and have workers try to make
progress on all.

If that sounds familiar to what is being argued about around UNDO
processing in workers, that'd not be an accident.




A lesser, but still pretty pressing concern, is that we currently have
no system that maintains the visibilitymap readily for insert-mostly
tables. That means that a) index-only scans aren't going to happen for
such tables, unless they're manually vacuumed b) the anti-wraparound
vacuum that eventually is going to happen, is going to be extremely
expensive/slow.


Regards,

Andres



Re: Corrupt index stopping autovacuum system wide

From
shauncutts@factfiber.com
Date:
On 17.07.2019 18:14, Andres Freund wrote:
>
>
> To me that means that we need prioritization across databases, and
> between tables, and probably by multiple criteria. I suspect there 
> need
> to be multiple criteria how urgent vacuuming is, and autovacuum ought 
> to
> try to make progress on all of them.
>

I'm not a contributor, and don't know the code base, so don't normally 
comment. But it occurs to me in the present case that there may be a 
short-term workaround for the current problem: could autovacuum visit 
tables in a random order? Then at least all tables would get vacuumed 
eventually (in the current case). If, later, more complex priorities are 
implemented, they could be probabilistic weights.

Best,

-- Shaun Cutts



Re: Corrupt index stopping autovacuum system wide

From
Aaron Pelz
Date:
> Can you tell us more about this index? Can you share its definition
> (i.e. what does \d show in psql)?

> Is it an expression index, or a partial index? A composite? What
> datatypes are indexed?

It's a simple btree expression on a geometry(Point,4326) , no expression no partial no composite.

On Wed, Jul 17, 2019 at 3:58 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Jul 17, 2019 at 10:27 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > It's possible that amcheck would have given you an accurate diagnosis
> > of the problem -- especially if you used bt_index_parent_check():
> >
> > https://www.postgresql.org/docs/current/amcheck.html
>
> BTW, be sure to use the 'heapallindexed' option with
> bt_index_parent_check() to detect missing downlinks, which is exactly
> the problem that VACUUM complained about.

Can you tell us more about this index? Can you share its definition
(i.e. what does \d show in psql)?

Is it an expression index, or a partial index? A composite? What
datatypes are indexed?

Thanks
--
Peter Geoghegan

Re: Corrupt index stopping autovacuum system wide

From
Peter Geoghegan
Date:
On Thu, Jul 18, 2019 at 9:06 AM Aaron Pelz <aaronepelz@gmail.com> wrote:
> It's a simple btree expression on a geometry(Point,4326) , no expression no partial no composite.

The cause of the corruption may be a bug in a Postgis B-Tree operator
class. I reported a bug in the Geography type that could lead to
corrupt B-Tree indexes (not Geometry):

https://trac.osgeo.org/postgis/ticket/3841

Though I also see what could be a comparable bug in Geometry:

https://trac.osgeo.org/postgis/ticket/3777

These bugs are from about 3 years ago. If I'm right you should be able
to isolate the bug using amcheck.

-- 
Peter Geoghegan