Re: hung backends stuck in spinlock heavy endless loop - Mailing list pgsql-hackers
From | Merlin Moncure |
---|---|
Subject | Re: hung backends stuck in spinlock heavy endless loop |
Date | |
Msg-id | CAHyXU0zj5wedt0eXo3jMjKshQV-mUhLdcOEuOaD9caZ4OucPFA@mail.gmail.com Whole thread Raw |
In response to | Re: hung backends stuck in spinlock heavy endless loop (Andres Freund <andres@2ndquadrant.com>) |
Responses |
Re: hung backends stuck in spinlock heavy endless loop
|
List | pgsql-hackers |
On Thu, Jan 15, 2015 at 1:15 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Hi, > >> The plot thickens! I looped the test, still stock 9.4 as of this time >> and went to lunch. When I came back, the database was in recovery >> mode. Here is the rough sequence of events. >> > > Whoa. That looks scary. Did you see (some of) those errors before? Most > of them should have been emitted independently of being built with > assertions. Nope. >> 1) REINDEXED pg_class (wanted clean slate for full reproduction) >> 2) before too long (unfortunately did not configure timestamps in the >> log) starting seeing: >> >> ERROR: root page 3 of index "pg_class_oid_index" has level 0, >> expected 1 at character 8 >> QUERY: UPDATE CDSRunTable SET >> Finished = clock_timestamp(), >> DidSucceed = _DidSucceed, >> ErrorMessage = _ErrorMessage >> WHERE CDSRunTableId = _CdsRunTableId >> CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 82 at >> SQL statement >> STATEMENT: SELECT CDSReconcileRunTable(2020) >> >> ..and again with a FATAL >> FATAL: root page 3 of index "pg_class_oid_index" has level 0, expected 1 >> ERROR: root page 3 of index "pg_class_oid_index" has level 0, expected 1 >> CONTEXT: SQL statement "UPDATE CDSRunTable SET >> Finished = clock_timestamp(), >> DidSucceed = _DidSucceed, >> ErrorMessage = _ErrorMessage >> WHERE CDSRunTableId = _CdsRunTableId" >> PL/pgSQL function cdsreconcileruntable(bigint) line 82 at SQL statement > > Just to be absolutely clear, those happened *before* the following > errors? And there were no 'during exception cleanup' like errors before? correct. unfortunately, I wasn't logging times so I'm not 100% sure if the 'root page' error happened during the event or sometime before it. I'm addressing that with the log prefix. > Were there any 'LOG: Handling deadlock detected on CdsRunTableId' log > entries before? It's hard to know from here, but the 'during exception > cleanup' indicates a problem in abort handling. Were there any deadlock > detected errors closeby? > > You're catching deadlock errors in a subtransaction. Hm. yup, exactly 1: WARNING: LOG: Handling deadlock detected on CdsRunTableId 1626 Based on the id that is proven to be before the crashing started, although at least one manual server restart happened in the interim. Since it's possible the database is a loss, do you see any value in bootstrappinng it again with checksums turned on? One point of note is that this is a brand spanking new SSD, maybe we nee to rule out hardware based corruption? > lternatively were there any 'LOG: CdsRunTableId %s Failed' messages? If so, what was the cause? There was one, but unfortunately since the error was trapped the relevant detail to catch the other side of the deadlock is missed. Barring this analysis, I'm inclined to remove that guard: it was mostly there to deal with what turned out to be some bugs in the code (specifically nonconstrained update on CDSTable). > * Do you also use lock_timeout/statement_timeout? not on this database -- this is a coding environment and not set up with standard configuration > were any processes killed at that time? later on, yes, by manual restart, and the query was cancelled first: to wit: WARNING: LOG: Handling deadlock detected on CdsRunTableId 1626 ERROR: canceling statement due to user request STATEMENT: SELECT CDSReconcileRunTable(1626) followed by restart -m fast merlin
pgsql-hackers by date: