Thread: Suggestions for post-mortem...
We just had a DB die quite nastily, and have no clear idea why. Looking in the system logs shows nothing out of the ordinary, and looking in the db logs shows a few odd records: 2006-01-25 12:25:31 EST [mail,5017]: ERROR: failed to fetch new tuple for AFTER trigger 2006-01-25 12:26:01 EST [mail,93689]: WARNING: index "XXXX_pkey" contains 1416 row versions, but table contains 1410 row versions 2006-01-25 12:26:01 EST [mail,93689]: HINT: Rebuild the index with REINDEX. 2006-01-25 12:26:01 EST [mail,93689]: WARNING: index "YYYY" contains 1416 row versions, but table contains 1410 row versions ...repeated several times for several indexes of the same table. These messages occurred almost immediately before we noticed the dead state of the DB. Over an hour before these messages there was a deadlock, but that's not too worrying -- the DB was still OK. After the above messages, about 80 rows were missing from the table, and a REINDEX did not restore them (not really surprising). The table in question has only a small number of rows (1400-ish), but gets updated up to 5 to 10 times per second. Thankfully, we had replication in place and just failed over, but we'd like to try to understand what happened to the old DB. Any suggestions where to start? Or what the first error might signify? Or what to put in place to catch more details next time? It's been running fine for several months (until now) using PG 8.0.3.
Philip Warner wrote: > We just had a DB die quite nastily, and have no clear idea why. > > Looking in the system logs shows nothing out of the ordinary, and > looking in the db logs shows a few odd records: > > 2006-01-25 12:25:31 EST [mail,5017]: ERROR: failed to fetch new tuple > for AFTER trigger > 2006-01-25 12:26:01 EST [mail,93689]: WARNING: index "XXXX_pkey" > contains 1416 row versions, but table contains 1410 row versions > 2006-01-25 12:26:01 EST [mail,93689]: HINT: Rebuild the index with REINDEX. > 2006-01-25 12:26:01 EST [mail,93689]: WARNING: index "YYYY" contains > 1416 row versions, but table contains 1410 row versions Seems like a rather severe bug. Maybe race condition somewhere. Note that the trigger problem occurs 30 seconds before the VACUUM error shows up, and being a small table I doubt a vacuum could take so long. Can you confirm how long does the vacuum take to run? Is this problem isolated to this one table, or does it manifest somewhere else? Do you have other errors that may indicate a hardware problem? It would be interesting to see where do the extra index entries point to. To do this we would need to read the complete pg_filedump of the index however ... and if you REINDEXed it, the evidence is already gone. -- Alvaro Herrera http://www.advogato.org/person/alvherre "The Postgresql hackers have what I call a "NASA space shot" mentality.Quite refreshing in a world of "weekend drag racer"developers." (Scott Marlowe)
Philip Warner <pjw@rhyme.com.au> writes: > We just had a DB die quite nastily, and have no clear idea why. Define "die quite nastily" ... you haven't really said what went wrong. > 2006-01-25 12:25:31 EST [mail,5017]: ERROR: failed to fetch new tuple > for AFTER trigger > 2006-01-25 12:26:01 EST [mail,93689]: WARNING: index "XXXX_pkey" > contains 1416 row versions, but table contains 1410 row versions > 2006-01-25 12:26:01 EST [mail,93689]: HINT: Rebuild the index with REINDEX. > 2006-01-25 12:26:01 EST [mail,93689]: WARNING: index "YYYY" contains > 1416 row versions, but table contains 1410 row versions > ...repeated several times for several indexes of the same table. These could all be manifestations of the ReadBuffer bug fixed in 8.0.6. Tickling that bug would result in zeroing out a recently-added table page, which would result in (a) more index entries than table entries, and (b) possible bleating from other processes wondering where their freshly inserted tuples went. Seems to square with the facts presented, especially if the indexes all agree on how many tuples there should be --- that makes it pretty unlikely that index corruption is the cause. regards, tom lane
Tom Lane wrote: > Define "die quite nastily" ... you haven't really said what went wrong. > We lost data from this table, and ended up with transactions rolled back that were in fact committed (ie. data was in DB, program code thought they were rolled back). End result was we deemed the database to be in an unknown, uncorrectable and unstable state. > These could all be manifestations of the ReadBuffer bug fixed in 8.0.6. > Tickling that bug would result in zeroing out a recently-added table page, > which would result in (a) more index entries than table entries, and > (b) possible bleating from other processes wondering where their freshly > inserted tuples went. This sounds consistent, I'd guess the 80 missing records correspond to 80 most recently updated. Not sure about the missing user -- I'll see what I can find. Thanks for the help -- we now have a probable cause, and a way forward. Alvaro Herrera wrote: > Can you confirm how long does the vacuum take to run? Usually very quick, runs every minute -- so the problem occurred between the two vacuums. > isolated to this one table, or does it manifest somewhere else? Do you > have other errors that may indicate a hardware problem? > No.