Thread: Strange deadlock error last night
I'm about to begin researching this, but I thought I'd throw this out there and see if there are any quick responses. Last night, I got this: Jan 13 03:31:28 db01 postgres[23537]: [140-2] DETAIL: Process 23537 waits for AccessShareLock on relation 1260 of database 0; blocked by process 22228. Jan 13 03:31:28 db01 postgres[23537]: [140-3] Process 22228 waits for AccessShareLock on relation 1262 of database 0; blocked by process 6816. Jan 13 03:31:28 db01 postgres[23537]: [140-4] Process 6816 waits for AccessShareLock on relation 1260 of database 0; blocked by process 14624. Jan 13 03:31:28 db01 postgres[23537]: [140-5] Process 14624 waits for AccessExclusiveLock on relation 1260 of database 0; blocked by process 23537. Jan 13 03:31:28 db01 postgres[14624]: [243-1] ERROR: deadlock detected Jan 13 03:31:28 db01 postgres[14624]: [243-2] DETAIL: Process 14624 waits for AccessExclusiveLock on relation 1260 of database 0; blocked by process 22228. Jan 13 03:31:28 db01 postgres[14624]: [243-3] Process 22228 waits for AccessShareLock on relation 1262 of database 0; blocked by process 6816. Jan 13 03:31:28 db01 postgres[14624]: [243-4] Process 6816 waits for AccessShareLock on relation 1260 of database 0; blocked by process 14624. Jan 13 03:31:28 db01 postgres[6816]: [54-1] ERROR: deadlock detected Jan 13 03:31:29 db01 postgres[6816]: [54-2] DETAIL: Process 6816 waits for AccessShareLock on relation 1260 of database 0; blocked by process 22228. Jan 13 03:31:29 db01 postgres[6816]: [54-3] Process 22228 waits for AccessShareLock on relation 1262 of database 0; blocked by process 6816. Here's what the procs were doing at the time, according to the logs: ** Jan 12 03:32:08 db01 postgres[22228]: [279-1] WARNING: skipping "pg_database" --- only table or database owner can vacuum it ** Jan 12 03:33:42 db01 postgres[14624]: [242-1] WARNING: skipping "pg_database" --- only table or database owner can vacuum it Jan 13 03:30:36 db01 postgres[23537]: [139-1] WARNING: skipping "pg_authid" --- only table or database owner can vacuum it Jan 13 03:31:25 db01 postgres[6816]: [53-1] WARNING: skipping "pg_auth_members" --- only table or database owner can vacuum it ** Note that procs 14624 and 22228 do not appear on the same day at the same time. I have no record of any pg logging of those processes other than what you see above. That, unfortunately, doesn't necessarily tell me anything. It ended up locking up about 250 customer databases until I restarted the postmaster. This is version 8.1.4. Upgrading right now (even to a minor rev) is not really an option. This box has been up and running for 306 days. This postgres level has been installed for..err...well...at least Aug 9, 2006, based on some dates in the directories. I can't determine what "database 0" is. It's not a datid. I'm assuming it's a relative number based on what was going on at the time. We're running auto-vacuum, as we have been for years. A full vacuum happens every Saturday night. Any clever ideas? Requests for more information? Thanks in advance! Scott Whitney Journyx, Inc.
"Scott Whitney" <swhitney@journyx.com> writes: > Last night, I got this: > Jan 13 03:31:28 db01 postgres[23537]: [140-2] DETAIL: Process 23537 waits > for AccessShareLock on relation 1260 of database 0; blocked by process > 22228. > Jan 13 03:31:28 db01 postgres[23537]: [140-3] Process 22228 waits for > AccessShareLock on relation 1262 of database 0; blocked by process 6816. > Jan 13 03:31:28 db01 postgres[23537]: [140-4] Process 6816 waits for > AccessShareLock on relation 1260 of database 0; blocked by process 14624. > Jan 13 03:31:28 db01 postgres[23537]: [140-5] Process 14624 waits for > AccessExclusiveLock on relation 1260 of database 0; blocked by process > 23537. > Jan 13 03:31:28 db01 postgres[14624]: [243-1] ERROR: deadlock detected Well, the problem is evidently that guy trying to get exclusive lock on relation 1260, which is pg_authid (a quick way to check that is "select 1260::regclass"). (Note: the "database 0" just means it's a system catalog that's shared across all databases.) pg_authid is touched often enough that trying to ex-lock it in an active database is just asking for trouble. Is it possible that that process was trying to run VACUUM FULL or REINDEX on pg_authid? I can't think of any other likely-sounding explanation for something trying to take an exclusive lock on that catalog. The best solution would be "don't do that" ... > It ended up locking up about 250 customer databases until I restarted the > postmaster. This is version 8.1.4. Upgrading right now (even to a minor rev) > is not really an option. Not related to the immediate problem, but: you really need to schedule 5 minutes' downtime so you can update to 8.1.latest. Reasons can be found here: http://www.postgresql.org/docs/8.1/static/release.html regards, tom lane
Tom Lane wrote: > "Scott Whitney" <swhitney@journyx.com> writes: > > It ended up locking up about 250 customer databases until I restarted the > > postmaster. This is version 8.1.4. Upgrading right now (even to a minor rev) > > is not really an option. > > Not related to the immediate problem, but: you really need to schedule > 5 minutes' downtime so you can update to 8.1.latest. Reasons can be > found here: > http://www.postgresql.org/docs/8.1/static/release.html Particularly so if you run autovacuum, because otherwise you'll be hit by a bug fixed in 8.1.6 about template0 being unfrozen (thawed?) -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Jan 13, 2009 at 10:37 AM, Scott Whitney <swhitney@journyx.com> wrote: > It ended up locking up about 250 customer databases until I restarted the > postmaster. This is version 8.1.4. Upgrading right now (even to a minor rev) > is not really an option. This box has been up and running for 306 days. This > postgres level has been installed for..err...well...at least Aug 9, 2006, > based on some dates in the directories. You need to ask yourself how much downtime you can afford. The 2 or 3 minutes every few months to go from 8.1.x to 8.1.x+1, or the half a day of downtime when some horrendous bug takes down the whole site because you didn't update it. Seriously, that unfozen template0 bug that Alvarro mentioned is one of those kinds of bugs. Nothing like your db going down in the middle of the day with an error message that it's going down to prevent txid wraparound induced loss, please run vacuum on all your databases in single user mode. If you can't find set aside a minute or two at 0200 hrs, then don't be surprised when you get one of those failures.
Thanks for all the information, guys. I think Tom was right. Our application was doing a couple of full vacs at the same time. It's weird that we didn't run into this in the past. You're all absolutely right about the upgrading, but in our environment, it's not 2-3 minutes. It's 2-3 weeks. I've got to fully vet the app on the platform internally with full test plans, etc, even for the most minor upgrades; corp policy. Right now, my effort is in going to the latest stable branch. Moving forward, I will use these notes to get the company to revisit the minor upgrade policy, though. After all, when I _do_ get hit by one of those bugs, I _will_ be asked why we weren't upgraded. *sigh* -----Original Message----- From: Scott Marlowe [mailto:scott.marlowe@gmail.com] Sent: Tuesday, January 13, 2009 4:16 PM To: Scott Whitney Cc: pgsql-admin@postgresql.org Subject: Re: [ADMIN] Strange deadlock error last night On Tue, Jan 13, 2009 at 10:37 AM, Scott Whitney <swhitney@journyx.com> wrote: > It ended up locking up about 250 customer databases until I restarted the > postmaster. This is version 8.1.4. Upgrading right now (even to a minor rev) > is not really an option. This box has been up and running for 306 days. This > postgres level has been installed for..err...well...at least Aug 9, 2006, > based on some dates in the directories. You need to ask yourself how much downtime you can afford. The 2 or 3 minutes every few months to go from 8.1.x to 8.1.x+1, or the half a day of downtime when some horrendous bug takes down the whole site because you didn't update it. Seriously, that unfozen template0 bug that Alvarro mentioned is one of those kinds of bugs. Nothing like your db going down in the middle of the day with an error message that it's going down to prevent txid wraparound induced loss, please run vacuum on all your databases in single user mode. If you can't find set aside a minute or two at 0200 hrs, then don't be surprised when you get one of those failures.
On Tue, Jan 13, 2009 at 3:35 PM, Scott Whitney <swhitney@journyx.com> wrote: > Thanks for all the information, guys. I think Tom was right. Our application > was doing a couple of full vacs at the same time. It's weird that we didn't > run into this in the past. > > You're all absolutely right about the upgrading, but in our environment, > it's not 2-3 minutes. It's 2-3 weeks. I've got to fully vet the app on the > platform internally with full test plans, etc, even for the most minor > upgrades; corp policy. That's preparation time versus execution time. While it may take 2-3 weeks to get the testing done, once done, the actual down time is minimal. I would suggest that your current testing methodology is horribly flawed. I know it was foisted on you by outside forces, so it's not really "your" idea. But you know what I mean. There's the cost of possible data loss / down time due to unforseen problems caused by an update, and there's the cost of possible data loss / down time due to known flaws in your current code. If your testing methodology was strict enough, it would have found such flaws that were fixed in subsequent releases, but it hasn't. Meantime, you spend weeks testing for possible new bugs, while running known broken code with real data loss / down time inducing bugs. There's a reasonable cost / benefits analysis, and right now, someone's set the cost far too high for far too few benefits. Been where you are, and I know how much it sucks.