Re: 9.4 btree index corruption - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: 9.4 btree index corruption |
Date | |
Msg-id | 5381FB3D.9010605@vmware.com Whole thread Raw |
In response to | 9.4 btree index corruption (Jeff Janes <jeff.janes@gmail.com>) |
Responses |
Re: 9.4 btree index corruption
|
List | pgsql-hackers |
On 05/21/2014 10:22 PM, Jeff Janes wrote: > Testing partial-write crash-recovery in 9.4 (e12d7320ca494fd05134847e30) > with foreign keys, I found some btree index corruption. > > 28807 VACUUM 2014-05-21 15:33:46.878 PDT:ERROR: right sibling 4044 of > block 460 is not next child 23513 of block 1264 in index "foo_p_id_idx" > 28807 VACUUM 2014-05-21 15:33:46.878 PDT:STATEMENT: VACUUM; > > It took ~8 hours on 8 cores to encounter this problem. This is a single > occurrence, it has not yet been reproduced. > > I don't know that the partial-writes, or the crash recovery, or the foreign > key, parts of this test are important--it could be a more generic problem > that only happened to be observed here. Nor do I know yet if it occurs in > 9_3_STABLE. > > Below is the testing harness and the data directory (massively bloated at > 3.7GB once uncompressed). It is currently in wrap-around shutdown, but > that is the effect of persistent vacuum failures, not the cause of them. > You can restart the data directory and it will repeat the above sibling > error once autovac kicks in. I don't know if the bloat is due to the > vacuum failure or if it was already in process before the failures started. > I've cranked up the logging on that front future efforts. > > I'm using some fast-foward code on the xid consumption so that freezing > occurs more often, and some people have expressed reservations that the > code might be imperfect, and I can't rule that out as the cause (but I've > never traced any other problems back to that code). But it did make it > through 4 complete wraps before this problem was encountered, so if that is > the problem it must be probabilistic rather than deterministic. > > https://drive.google.com/folderview?id=0Bzqrh1SO9FcENWd6ZXlwVWpxU0E&usp=sharing I downloaded the data directory and investigated. I got this message when I started it up: 20392 2014-05-25 05:51:37.835 PDT:ERROR: right sibling 4044 of block 460 is not next child 23513 of block 86458 in index "foo_p_id_idx" 20392 2014-05-25 05:51:37.835 PDT:CONTEXT: automatic vacuum of table "jjanes.public.foo" Interestingly, it's complaining about parent page 86458, while yours claimed it was 1264. I don't know why; perhaps a huge number of insertions happened after that error, causing the parent level pages to be split, moving the downlinks it complains about to the right. Did you continue running the test after that error occurred? This is what the tree looks like around those pages: Level 1: +-------------+ +-------------+ +-------------+ | Blk 1264 | | Blk 160180 | | Blk 86458 | | | | | | | | Downlinks: | <-> | Downlinks: | <-> | Downlinks: | | ... | | ... | | 1269 | | | | | | 460 | | | | | | 23513 | +-------------+ +-------------+ +-------------+ Leaf level: +-------------+ +---------+ +-----------+ +-------------+ | Blk 1269 | | Blk 460 | | Blk 4044 | | Blk 23513 | | | <-> | | <-> | HALF_DEAD | <-> | | | (314 items) | | (empty) | | | | (212 items) | +-------------+ +---------+ +-----------+ +-------------+ Leaf block 4044 is marked as half-dead, and there is no downlink to it on the parent block, 86458. That is normal, when a vacuum is interrupted during page deletion (in 9.4). The next vacuum ought to continue it. However, because block 460 is empty, the next vacuum that comes along will try to start the page deletion process on that block. Page deletion does a cross-check at the parent level, checking that the next downlink on the parent page points to the right sibling of the page we're about to delete. However, because the right sibling is already marked as half-dead, and the downlink pointing to it has already been removed, the cross-check fails. So, the sanity check is a bit overzealous. It's normal that the right sibling has no downlink, if it's marked as half-dead. The trivial fix is to just remove the cross-check, but I'd rather not do that because generally-speaking more sanity checks are good. I think we could fetch the right sibling page to see if it's marked as half-dead, and if it is, just give up on deleting the current page. This vacuum will eventually get to the right sibling page and finish the deletion, and the next vacuum will then be able to delete the empty page. Or we could finish the deletion of the right page immediately, but I'd like to keep the rarely-occurring corner-case codepaths to the minimum. With the attached patch, I was able to get past that error, but when VACUUM reaches the end, I got this: jjanes=# vacuum foo; ERROR: database is not accepting commands to avoid wraparound data loss in database "jjanes" HINT: Stop the postmaster and vacuum that database in single-user mode. You might also need to commit or roll back old prepared transactions. That seems like an unrelated issue, I'll start a new thread about that. Many thanks for the testing, Jeff! - Heikki
Attachment
pgsql-hackers by date: