Thread: Strange deadlock error last night

Strange deadlock error last night

From
"Scott Whitney"
Date:
I'm about to begin researching this, but I thought I'd throw this out there
and see if there are any quick responses.

Last night, I got this:

Jan 13 03:31:28 db01 postgres[23537]: [140-2] DETAIL:  Process 23537 waits
for AccessShareLock on relation 1260 of database 0; blocked by process
22228.
Jan 13 03:31:28 db01 postgres[23537]: [140-3]   Process 22228 waits for
AccessShareLock on relation 1262 of database 0; blocked by process 6816.
Jan 13 03:31:28 db01 postgres[23537]: [140-4]   Process 6816 waits for
AccessShareLock on relation 1260 of database 0; blocked by process 14624.
Jan 13 03:31:28 db01 postgres[23537]: [140-5]   Process 14624 waits for
AccessExclusiveLock on relation 1260 of database 0; blocked by process
23537.
Jan 13 03:31:28 db01 postgres[14624]: [243-1] ERROR:  deadlock detected
Jan 13 03:31:28 db01 postgres[14624]: [243-2] DETAIL:  Process 14624 waits
for AccessExclusiveLock on relation 1260 of database 0; blocked by process
22228.
Jan 13 03:31:28 db01 postgres[14624]: [243-3]   Process 22228 waits for
AccessShareLock on relation 1262 of database 0; blocked by process 6816.
Jan 13 03:31:28 db01 postgres[14624]: [243-4]   Process 6816 waits for
AccessShareLock on relation 1260 of database 0; blocked by process 14624.
Jan 13 03:31:28 db01 postgres[6816]: [54-1] ERROR:  deadlock detected
Jan 13 03:31:29 db01 postgres[6816]: [54-2] DETAIL:  Process 6816 waits for
AccessShareLock on relation 1260 of database 0; blocked by process 22228.
Jan 13 03:31:29 db01 postgres[6816]: [54-3]     Process 22228 waits for
AccessShareLock on relation 1262 of database 0; blocked by process 6816.

Here's what the procs were doing at the time, according to the logs:

** Jan 12 03:32:08 db01 postgres[22228]: [279-1] WARNING:  skipping
"pg_database" --- only table or database owner can vacuum it
** Jan 12 03:33:42 db01 postgres[14624]: [242-1] WARNING:  skipping
"pg_database" --- only table or database owner can vacuum it
Jan 13 03:30:36 db01 postgres[23537]: [139-1] WARNING:  skipping "pg_authid"
--- only table or database owner can vacuum it
Jan 13 03:31:25 db01 postgres[6816]: [53-1] WARNING:  skipping
"pg_auth_members" --- only table or database owner can vacuum it

** Note that procs 14624 and 22228 do not appear on the same day at the same
time. I have no record of any pg logging of those processes other than what
you see above. That, unfortunately, doesn't necessarily tell me anything.

It ended up locking up about 250 customer databases until I restarted the
postmaster. This is version 8.1.4. Upgrading right now (even to a minor rev)
is not really an option. This box has been up and running for 306 days. This
postgres level has been installed for..err...well...at least Aug 9, 2006,
based on some dates in the directories.

I can't determine what "database 0" is. It's not a datid. I'm assuming it's
a relative number based on what was going on at the time. We're running
auto-vacuum, as we have been for years. A full vacuum happens every Saturday
night.

Any clever ideas? Requests for more information?

Thanks in advance!

Scott Whitney
Journyx, Inc.


Re: Strange deadlock error last night

From
Tom Lane
Date:
"Scott Whitney" <swhitney@journyx.com> writes:
> Last night, I got this:

> Jan 13 03:31:28 db01 postgres[23537]: [140-2] DETAIL:  Process 23537 waits
> for AccessShareLock on relation 1260 of database 0; blocked by process
> 22228.
> Jan 13 03:31:28 db01 postgres[23537]: [140-3]   Process 22228 waits for
> AccessShareLock on relation 1262 of database 0; blocked by process 6816.
> Jan 13 03:31:28 db01 postgres[23537]: [140-4]   Process 6816 waits for
> AccessShareLock on relation 1260 of database 0; blocked by process 14624.
> Jan 13 03:31:28 db01 postgres[23537]: [140-5]   Process 14624 waits for
> AccessExclusiveLock on relation 1260 of database 0; blocked by process
> 23537.
> Jan 13 03:31:28 db01 postgres[14624]: [243-1] ERROR:  deadlock detected

Well, the problem is evidently that guy trying to get exclusive lock on
relation 1260, which is pg_authid (a quick way to check that is "select
1260::regclass").  (Note: the "database 0" just means it's a system catalog
that's shared across all databases.)  pg_authid is touched often enough
that trying to ex-lock it in an active database is just asking for
trouble.

Is it possible that that process was trying to run VACUUM FULL or
REINDEX on pg_authid?  I can't think of any other likely-sounding
explanation for something trying to take an exclusive lock on that
catalog.  The best solution would be "don't do that" ...

> It ended up locking up about 250 customer databases until I restarted the
> postmaster. This is version 8.1.4. Upgrading right now (even to a minor rev)
> is not really an option.

Not related to the immediate problem, but: you really need to schedule
5 minutes' downtime so you can update to 8.1.latest.  Reasons can be
found here:
http://www.postgresql.org/docs/8.1/static/release.html

            regards, tom lane

Re: Strange deadlock error last night

From
Alvaro Herrera
Date:
Tom Lane wrote:
> "Scott Whitney" <swhitney@journyx.com> writes:

> > It ended up locking up about 250 customer databases until I restarted the
> > postmaster. This is version 8.1.4. Upgrading right now (even to a minor rev)
> > is not really an option.
>
> Not related to the immediate problem, but: you really need to schedule
> 5 minutes' downtime so you can update to 8.1.latest.  Reasons can be
> found here:
> http://www.postgresql.org/docs/8.1/static/release.html

Particularly so if you run autovacuum, because otherwise you'll be hit
by a bug fixed in 8.1.6 about template0 being unfrozen (thawed?)

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Strange deadlock error last night

From
"Scott Marlowe"
Date:
On Tue, Jan 13, 2009 at 10:37 AM, Scott Whitney <swhitney@journyx.com> wrote:

> It ended up locking up about 250 customer databases until I restarted the
> postmaster. This is version 8.1.4. Upgrading right now (even to a minor rev)
> is not really an option. This box has been up and running for 306 days. This
> postgres level has been installed for..err...well...at least Aug 9, 2006,
> based on some dates in the directories.

You need to ask yourself how much downtime you can afford.  The 2 or 3
minutes every few months to go from 8.1.x to 8.1.x+1, or the half a
day of downtime when some horrendous bug takes down the whole site
because you didn't update it.  Seriously, that unfozen template0 bug
that Alvarro mentioned is one of those kinds of bugs.

Nothing like your db going down in the middle of the day with an error
message that it's going down to prevent txid wraparound induced loss,
please run vacuum on all your databases in single user mode.

If you can't find set aside a minute or two at 0200 hrs, then don't be
surprised when you get one of those failures.

Re: Strange deadlock error last night

From
"Scott Whitney"
Date:
Thanks for all the information, guys. I think Tom was right. Our application
was doing a couple of full vacs at the same time. It's weird that we didn't
run into this in the past.

You're all absolutely right about the upgrading, but in our environment,
it's not 2-3 minutes. It's 2-3 weeks. I've got to fully vet the app on the
platform internally with full test plans, etc, even for the most minor
upgrades; corp policy.

Right now, my effort is in going to the latest stable branch. Moving
forward, I will use these notes to get the company to revisit the minor
upgrade policy, though.

After all, when I _do_ get hit by one of those bugs, I _will_ be asked why
we weren't upgraded. *sigh*



-----Original Message-----
From: Scott Marlowe [mailto:scott.marlowe@gmail.com]
Sent: Tuesday, January 13, 2009 4:16 PM
To: Scott Whitney
Cc: pgsql-admin@postgresql.org
Subject: Re: [ADMIN] Strange deadlock error last night

On Tue, Jan 13, 2009 at 10:37 AM, Scott Whitney <swhitney@journyx.com>
wrote:

> It ended up locking up about 250 customer databases until I restarted the
> postmaster. This is version 8.1.4. Upgrading right now (even to a minor
rev)
> is not really an option. This box has been up and running for 306 days.
This
> postgres level has been installed for..err...well...at least Aug 9, 2006,
> based on some dates in the directories.

You need to ask yourself how much downtime you can afford.  The 2 or 3
minutes every few months to go from 8.1.x to 8.1.x+1, or the half a
day of downtime when some horrendous bug takes down the whole site
because you didn't update it.  Seriously, that unfozen template0 bug
that Alvarro mentioned is one of those kinds of bugs.

Nothing like your db going down in the middle of the day with an error
message that it's going down to prevent txid wraparound induced loss,
please run vacuum on all your databases in single user mode.

If you can't find set aside a minute or two at 0200 hrs, then don't be
surprised when you get one of those failures.


Re: Strange deadlock error last night

From
"Scott Marlowe"
Date:
On Tue, Jan 13, 2009 at 3:35 PM, Scott Whitney <swhitney@journyx.com> wrote:
> Thanks for all the information, guys. I think Tom was right. Our application
> was doing a couple of full vacs at the same time. It's weird that we didn't
> run into this in the past.
>
> You're all absolutely right about the upgrading, but in our environment,
> it's not 2-3 minutes. It's 2-3 weeks. I've got to fully vet the app on the
> platform internally with full test plans, etc, even for the most minor
> upgrades; corp policy.

That's preparation time versus execution time.  While it may take 2-3
weeks to get the testing done, once done, the actual down time is
minimal.  I would suggest that your current testing methodology is
horribly flawed.  I know it was foisted on you by outside forces, so
it's not really "your" idea.  But you know what I mean.  There's the
cost of possible data loss / down time due to unforseen problems
caused by an update, and there's the cost of possible data loss / down
time due to known flaws in your current code.  If your testing
methodology was strict enough, it would have found such flaws that
were fixed in subsequent releases, but it hasn't.  Meantime, you spend
weeks testing for possible new bugs, while running known broken code
with real data loss / down time inducing bugs.

There's a reasonable cost / benefits analysis, and right now,
someone's set the cost far too high for far too few benefits.  Been
where you are, and I know how much it sucks.