Re: hung backends stuck in spinlock heavy endless loop - Mailing list pgsql-hackers

From Merlin Moncure
Subject Re: hung backends stuck in spinlock heavy endless loop
Date
Msg-id CAHyXU0zj5wedt0eXo3jMjKshQV-mUhLdcOEuOaD9caZ4OucPFA@mail.gmail.com
Whole thread Raw
In response to Re: hung backends stuck in spinlock heavy endless loop  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: hung backends stuck in spinlock heavy endless loop
List pgsql-hackers
On Thu, Jan 15, 2015 at 1:15 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Hi,
>
>> The plot thickens!  I looped the test, still stock 9.4 as of this time
>> and went to lunch. When I came back, the database was in recovery
>> mode.  Here is the rough sequence of events.
>>
>
> Whoa. That looks scary. Did you see (some of) those errors before? Most
> of them should have been emitted independently of being built with
> assertions.

Nope.

>> 1) REINDEXED pg_class (wanted clean slate for full reproduction)
>> 2) before too long (unfortunately did not configure timestamps in the
>> log) starting seeing:
>>
>> ERROR:  root page 3 of index "pg_class_oid_index" has level 0,
>> expected 1 at character 8
>> QUERY:  UPDATE CDSRunTable SET
>>           Finished = clock_timestamp(),
>>           DidSucceed = _DidSucceed,
>>           ErrorMessage = _ErrorMessage
>>         WHERE CDSRunTableId = _CdsRunTableId
>> CONTEXT:  PL/pgSQL function cdsreconcileruntable(bigint) line 82 at
>> SQL statement
>> STATEMENT:  SELECT CDSReconcileRunTable(2020)
>>
>> ..and again with a FATAL
>> FATAL:  root page 3 of index "pg_class_oid_index" has level 0, expected 1
>> ERROR:  root page 3 of index "pg_class_oid_index" has level 0, expected 1
>> CONTEXT:  SQL statement "UPDATE CDSRunTable SET
>>           Finished = clock_timestamp(),
>>           DidSucceed = _DidSucceed,
>>           ErrorMessage = _ErrorMessage
>>         WHERE CDSRunTableId = _CdsRunTableId"
>>     PL/pgSQL function cdsreconcileruntable(bigint) line 82 at SQL statement
>
> Just to be absolutely clear, those happened *before* the following
> errors? And there were no 'during exception cleanup' like errors before?

correct.  unfortunately, I wasn't logging times so I'm not 100% sure
if the 'root page' error happened during the event or sometime before
it.  I'm addressing that with the log prefix.

> Were there any 'LOG: Handling deadlock detected on CdsRunTableId' log
> entries before? It's hard to know from here, but the 'during exception
> cleanup' indicates a problem in abort handling.  Were there any deadlock
> detected errors closeby?
>
> You're catching deadlock errors in a subtransaction. Hm.

yup, exactly 1:
WARNING:  LOG: Handling deadlock detected on CdsRunTableId 1626

Based on the id that is proven to be before the crashing started,
although at least one manual server restart happened in the interim.

Since it's possible the database is a loss, do you see any value in
bootstrappinng it again with checksums turned on?  One point of note
is that this is a brand spanking new SSD, maybe we nee to rule out
hardware based corruption?

> lternatively were there any 'LOG: CdsRunTableId %s Failed' messages? If
so, what was the cause?

There was one, but unfortunately since the error was trapped the
relevant detail to catch the other side of the deadlock is missed.
Barring this analysis, I'm inclined to remove that guard: it was
mostly there to deal with what turned out to be some bugs in the code
(specifically nonconstrained update on CDSTable).

> * Do you also use lock_timeout/statement_timeout?
not on this database -- this is a coding environment and not set up
with standard configuration

>  were any processes killed at that time?
later on, yes, by manual restart, and the query was cancelled first: to wit:

WARNING:  LOG: Handling deadlock detected on CdsRunTableId 1626
ERROR:  canceling statement due to user request
STATEMENT:  SELECT CDSReconcileRunTable(1626)

followed by restart -m fast

merlin



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [BUGS] BUG #12070: hstore extension: hstore_to_json_loose produces invalid JSON
Next
From: Robert Haas
Date:
Subject: Re: advance local xmin more aggressively