Re: Hanging backends and possible index corruption - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Hanging backends and possible index corruption |
Date | |
Msg-id | 20130125152816.GH4289@awork2.anarazel.de Whole thread Raw |
In response to | Hanging backends and possible index corruption (Bernd Helmle <mailings@oopsware.de>) |
Responses |
Re: Hanging backends and possible index corruption
|
List | pgsql-hackers |
On 2013-01-25 16:24:52 +0100, Bernd Helmle wrote: > We are currently analyzing an issue at one of our customers PostgreSQL > database. > > The current version is 9.1.6 (update to 9.1.7 is scheduled for next monday, > no downtime possible before). It runs on POWER7 (pSeries 740) on an RHEL6.3 > 64-bit LPAR. The packages are built from PGDG SVN sources, no special tweaks > added. We saw no hardware related errors on this machine, nor any crashes. > > What currently happens on this machine are hanging statements (SELECTs and > INSERTs occasionally) with 100% CPU. After some investigation it turned out > that the corresponding backends are seeking within an index file over and > over again in a loop. Looking into the hanging queries i've recognized > certain keys which seems to have the problem, other keys used in the WHERE > condition run smoothly. Turning off index and bitmap index scans caused the > suspicious keys to return results, too. > > So i've saved the index file (normal BTree index with a single bigint > column), did a REINDEX and the problem was gone. Looking at the index file > with pg_filedump and pgbtreecheck from Alvaro gave me the following output: > > pgbtreecheck gives warnings about pages' parents and then loops visiting the > same pages over and over again: > > NOTICE: fast root: block 290 at level 2 > NOTICE: setting page parents > WARNING: block 12709 already had a parent (8840); new parent is 12177 > WARNING: block 12710 already had a parent (12439); new parent is 10835 > NOTICE: done setting parents > NOTICE: Checking forward scan of level 0, starting at block 1 > > -- loop starts > > WARNING: right sibling 12710 does not point left to me (11680); points to > 10924 instead > > Looking into the relevant pages and their prev and next pointers give me the > following: > > pg_filedump -i -R 11680 ~/tmp/100252789 | grep Blocks > Blocks: Previous (12651) Next (12710) Level (0) CycleId (0) > > pg_filedump -i -R 10924 ~/tmp/100252789 | grep Blocks > Blocks: Previous (10923) Next (12710) Level (0) CycleId (0) > > pg_filedump -i -R 12710 ~/tmp/100252789 | grep Blocks > Blocks: Previous (10924) Next (10925) Level (0) CycleId (0) > > pg_filedump -i -R 10925 ~/tmp/100252789 | grep Blocks > Blocks: Previous (12710) Next (10926) Level (0) CycleId (0) > > pg_filedump -i -R 12709 ~/tmp/100252789 | grep Blocks > Blocks: Previous (8849) Next (8850) Level (0) CycleId (0) > > pg_filedump -i -R 8840 ~/tmp/100252789 | grep Blocks > Blocks: Previous (8555) Next (9125) Level (1) CycleId (0) > > pg_filedump -i -R 12439 ~/tmp/100252789 | grep Blocks > Blocks: Previous (11405) Next (11690) Level (1) CycleId (0) > > $ pg_filedump -i -R 12177 ~/tmp/100252789 | grep Blocks > Blocks: Previous (11690) Next (0) Level (1) CycleId (0) > > $ pg_filedump -i -R 10835 ~/tmp/100252789 | grep Blocks > Blocks: Previous (10550) Next (11120) Level (1) CycleId (0) > > This symptom happened three times in the last couple of weeks now. Looking > at the numbers doesn't give me the impression that some flaky hardware could > be involved. What else can we do to track down this problem, any > suggestions? Did you reindex after upgrading to 9.1.6? Did you ever have any crashes or failovers before upgrading to 9.1.6? I have seen pretty similar symptoms caused by "Fix persistence marking of shared buffers during WAL replay" in 9.1.6. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: