Re: hung backends stuck in spinlock heavy endless loop - Mailing list pgsql-hackers

From Merlin Moncure
Subject Re: hung backends stuck in spinlock heavy endless loop
Date
Msg-id CAHyXU0x7MPmW1v1kqB5Trb_z0no5w5QpK7_qFo0CYvNngyYsbA@mail.gmail.com
Whole thread Raw
In response to Re: hung backends stuck in spinlock heavy endless loop  (Peter Geoghegan <pg@heroku.com>)
Responses Re: hung backends stuck in spinlock heavy endless loop
Re: hung backends stuck in spinlock heavy endless loop
Re: hung backends stuck in spinlock heavy endless loop
Re: hung backends stuck in spinlock heavy endless loop
List pgsql-hackers
On Fri, Jan 16, 2015 at 5:20 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Fri, Jan 16, 2015 at 10:33 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> ISTM the next step is to bisect the problem down over the weekend in
>> order to to narrow the search.  If that doesn't turn up anything
>> productive I'll look into taking other steps.
>
> That might be the quickest way to do it, provided you can isolate the
> bug fairly reliably. It might be a bit tricky to write a shell script
> that assumes a certain amount of time having passed without the bug
> tripping indicates that it doesn't exist, and have that work
> consistently. I'm slightly concerned that you'll hit other bugs that
> have since been fixed, given the large number of possible symptoms
> here.

Quick update:  not done yet, but I'm making consistent progress, with
several false starts.  (for example, I had a .conf problem with the
new dynamic shared memory setting and git merrily bisected down to the
introduction of the feature.).
I have to triple check everything :(. The problem is generally
reproducible but I get false negatives that throws off the bisection.
I estimate that early next week I'll have it narrowed down
significantly if not to the exact offending revision.

So far, the 'nasty' damage seems to generally if not always follow a
checksum failure and the checksum failures are always numerically
adjacent.  For example:

[cds2 12707 2015-01-22 12:51:11.032 CST 2754]WARNING:  page
verification failed, calculated checksum 9465 but expected 9477 at
character 20
[cds2 21202 2015-01-22 13:10:18.172 CST 3196]WARNING:  page
verification failed, calculated checksum 61889 but expected 61903 at
character 20
[cds2 29153 2015-01-22 14:49:04.831 CST 4803]WARNING:  page
verification failed, calculated checksum 27311 but expected 27316

I'm not up on the intricacies of our checksum algorithm but this is
making me suspicious that we are looking at a improperly flipped
visibility bit via some obscure problem -- almost certainly with
vacuum playing a role.  This fits the profile of catastrophic damage
that masquerades as numerous other problems.  Or, perhaps, something
is flipping what it thinks is a visibility bit but on the wrong page.

I still haven't categorically ruled out pl/sh yet; that's something to
keep in mind.

In the 'plus' category, aside from flushing out this issue, I've had
zero runtime problems so far aside from the mains problem; bisection
(at least on the 'bad' side) has been reliably engaged by simply
counting the number of warnings/errors/etc in the log.  That's really
impressive.

merlin



pgsql-hackers by date:

Previous
From: David G Johnston
Date:
Subject: Re: Proposal: knowing detail of config files via SQL
Next
From: Andres Freund
Date:
Subject: Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?