Re: 9.4 btree index corruption - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: 9.4 btree index corruption
Date
Msg-id 5381FB3D.9010605@vmware.com
Whole thread Raw
In response to 9.4 btree index corruption  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: 9.4 btree index corruption
List pgsql-hackers
On 05/21/2014 10:22 PM, Jeff Janes wrote:
> Testing partial-write crash-recovery in 9.4 (e12d7320ca494fd05134847e30)
> with foreign keys, I found some btree index corruption.
>
> 28807 VACUUM 2014-05-21 15:33:46.878 PDT:ERROR:  right sibling 4044 of
> block 460 is not next child 23513 of block 1264 in index "foo_p_id_idx"
> 28807 VACUUM 2014-05-21 15:33:46.878 PDT:STATEMENT:  VACUUM;
>
> It took ~8 hours on 8 cores to encounter this problem.  This is a single
> occurrence, it has not yet been reproduced.
>
> I don't know that the partial-writes, or the crash recovery, or the foreign
> key, parts of this test are important--it could be a more generic problem
> that only happened to be observed here.  Nor do I know yet if it occurs in
> 9_3_STABLE.
>
> Below is the testing harness and the data directory (massively bloated at
> 3.7GB once uncompressed).  It is currently in wrap-around shutdown, but
> that is the effect of persistent vacuum failures, not the cause of them.
>   You can restart the data directory and it will repeat the above sibling
> error once autovac kicks in.  I don't know if the bloat is due to the
> vacuum failure or if it was already in process before the failures started.
>   I've cranked up the logging on that front future efforts.
>
> I'm using some fast-foward code on the xid consumption so that freezing
> occurs more often, and some people have expressed reservations that the
> code might be imperfect, and I can't rule that out as the cause (but I've
> never traced any other problems back to that code).  But it did make it
> through 4 complete wraps before this problem was encountered, so if that is
> the problem it must be probabilistic rather than deterministic.
>
> https://drive.google.com/folderview?id=0Bzqrh1SO9FcENWd6ZXlwVWpxU0E&usp=sharing

I downloaded the data directory and investigated. I got this message
when I started it up:

20392  2014-05-25 05:51:37.835 PDT:ERROR:  right sibling 4044 of block
460 is not next child 23513 of block 86458 in index "foo_p_id_idx"
20392  2014-05-25 05:51:37.835 PDT:CONTEXT:  automatic vacuum of table
"jjanes.public.foo"

Interestingly, it's complaining about parent page 86458, while yours
claimed it was 1264. I don't know why; perhaps a huge number of
insertions happened after that error, causing the parent level pages to
be split, moving the downlinks it complains about to the right. Did you
continue running the test after that error occurred?

This is what the tree looks like around those pages:

Level 1:
+-------------+     +-------------+     +-------------+
| Blk 1264    |     | Blk 160180  |     | Blk 86458   |
|             |     |             |     |             |
| Downlinks:  | <-> | Downlinks:  | <-> | Downlinks:  |
| ...         |     | ...         |     |     1269    |
|             |     |             |     |      460    |
|             |     |             |     |    23513    |
+-------------+     +-------------+     +-------------+


Leaf level:
+-------------+     +---------+     +-----------+     +-------------+
| Blk 1269    |     | Blk 460 |     | Blk 4044  |     | Blk 23513   |
|             | <-> |         | <-> | HALF_DEAD | <-> |             |
| (314 items) |     | (empty) |     |           |     | (212 items) |
+-------------+     +---------+     +-----------+     +-------------+

Leaf block 4044 is marked as half-dead, and there is no downlink to it
on the parent block, 86458. That is normal, when a vacuum is interrupted
during page deletion (in 9.4). The next vacuum ought to continue it.
However, because block 460 is empty, the next vacuum that comes along
will try to start the page deletion process on that block. Page deletion
does a cross-check at the parent level, checking that the next downlink
on the parent page points to the right sibling of the page we're about
to delete. However, because the right sibling is already marked as
half-dead, and the downlink pointing to it has already been removed, the
cross-check fails.

So, the sanity check is a bit overzealous. It's normal that the right
sibling has no downlink, if it's marked as half-dead. The trivial fix is
to just remove the cross-check, but I'd rather not do that because
generally-speaking more sanity checks are good.

I think we could fetch the right sibling page to see if it's marked as
half-dead, and if it is, just give up on deleting the current page. This
vacuum will eventually get to the right sibling page and finish the
deletion, and the next vacuum will then be able to delete the empty
page. Or we could finish the deletion of the right page immediately, but
I'd like to keep the rarely-occurring corner-case codepaths to the minimum.


With the attached patch, I was able to get past that error, but when
VACUUM reaches the end, I got this:

jjanes=# vacuum foo;
ERROR:  database is not accepting commands to avoid wraparound data loss
in database "jjanes"
HINT:  Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions.

That seems like an unrelated issue, I'll start a new thread about that.

Many thanks for the testing, Jeff!

- Heikki


Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: pg_recvlogical not accepting -I to specify start LSN position
Next
From: Heikki Linnakangas
Date:
Subject: Could not finish anti-wraparound VACUUM when stop limit is reached