Thread: Deadlock situation?
I have an app that we were load testing - it maintains a pool of connections to PostgreSQL - 8.3.1 Suddenly, after running fine for weeks, the app hung - unable to get a free connection from the pool. select * from pg_stat_activity;" shows me that most of my connections in a COMMIT phase: 03:05:37.73064-05 | 2008-04-24 03:05:38.419796-05 | 2008-04-24 02:11:53.908518-05 | 127.0.0.1 | 53807 16385 | ispaaa | 953 | 16384 | pslogin | COMMIT | f | 2008-04-24 While some are in a SELECT: 16385 | ispaaa | 1181 | 16384 | pslogin | select dynamichos0_.ethernetmacaddr as ethernet1_0_, dynamichos0_.ipaddr as ipaddr0_, dynamichos0_.cpemac as cpemac0_, dynamichos0_.regtime as regtime0_, dynamichos0_.leasetime as leasetime0_, dynamichos0_.last_updated as last5_0_ from iphost dynamichos0_, cpe cpe1_ where dynamichos0_.cpemac=cpe1_.cpemac and 1=1 and dynamichos0_.ethernetmacaddr=$1 and dynamichos0_.cpemac=$2 and cpe1_.regBaseId=$3 and dynamichos0_.ipaddr<>$4 | f | 2008-04-24 03:05:37.734041-05 | 2008-04-24 03:05:38.405854-05 | 2008-04-24 02:41:54.413337-05 | 127.0.0.1 | 55363 Perhaps VACUUM had something to do with it?: 16385 | ispaaa | 8956 | 16384 | pslogin | delete from iphost where leasetime<$1 | f | 2008-04-24 18:43:29.920069-05 | 2008-04-24 18:43:30.116435-05 | 2008-04-24 18:41:59.071032-05 | 127.0.0.1 | 49069 16385 | ispaaa | 1618 | 10 | postgres | autovacuum: VACUUM public.iphost | f | 2008-04-24 03:05:13.212436-05 | 2008-04-24 03:05:13.212436-05 | 2008-04-24 03:05:12.526611-05 | | Where should I begin to look for the source of this problem? Thanks for any info, Dan
On Apr 29, 2008, at 4:54 PM, Dan Armbrust wrote: > I have an app that we were load testing - it maintains a pool of > connections to PostgreSQL - 8.3.1 > > > Suddenly, after running fine for weeks, the app hung - unable to get a > free connection from the pool. > > > select * from pg_stat_activity;" shows me that most of my connections > in a COMMIT phase: > > 03:05:37.73064-05 | 2008-04-24 03:05:38.419796-05 | 2008-04-24 > 02:11:53.908518-05 | 127.0.0.1 | 53807 > 16385 | ispaaa | 953 | 16384 | pslogin | COMMIT | f > | 2008-04-24 > > > While some are in a SELECT: > > 16385 | ispaaa | 1181 | 16384 | pslogin | select > dynamichos0_.ethernetmacaddr as ethernet1_0_, dynamichos0_.ipaddr as > ipaddr0_, dynamichos0_.cpemac as cpemac0_, dynamichos0_.regtime as > regtime0_, dynamichos0_.leasetime as leasetime0_, > dynamichos0_.last_updated as last5_0_ from iphost dynamichos0_, cpe > cpe1_ where dynamichos0_.cpemac=cpe1_.cpemac and 1=1 and > dynamichos0_.ethernetmacaddr=$1 and dynamichos0_.cpemac=$2 and > cpe1_.regBaseId=$3 and dynamichos0_.ipaddr<>$4 | f | > 2008-04-24 03:05:37.734041-05 | 2008-04-24 03:05:38.405854-05 | > 2008-04-24 02:41:54.413337-05 | 127.0.0.1 | 55363 > > > Perhaps VACUUM had something to do with it?: > > 16385 | ispaaa | 8956 | 16384 | pslogin | delete from iphost > where leasetime<$1 | f | 2008-04-24 18:43:29.920069-05 | > 2008-04-24 18:43:30.116435-05 | 2008-04-24 18:41:59.071032-05 | > 127.0.0.1 | > > 49069 16385 | ispaaa | 1618 | 10 | postgres | autovacuum: > VACUUM public.iphost | f | 2008-04-24 03:05:13.212436-05 | > 2008-04-24 03:05:13.212436-05 | 2008-04-24 03:05:12.526611-05 | > | > > > Where should I begin to look for the source of this problem? > > Thanks for any info, Well, you can look in pg_locks to see if there are outstanding locks waiting on already granted conflicting locks. This isn't a deadlock situation, though, Postgres will detect those and kill one of the offending processes so that the others can finish (it'll leave a log message about it, too). My guess is that you've got some long running write/ddl query that's go a heavy lock on iphost or you have a LOT of queries that need heavy locks hitting the table at once. How large is iphost? How many of those deletes have you got going on it? Do you also have concurrent updates running against it? Do you have any ddl queries running against it (alter tables, index builds/drops, etc...)? Erik Jones DBA | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com
Semantics of deadlock aside, I seem to be deadlocked, yet postgres didn't detect any deadlock situation. There are no DDL queries running. Just a lot of updates, and some inserts and deletes. I had to restart postgres to let things recover - so I can't gather any more info right now - but if/when it happens again, I'd like to know what else to gather. Looking at the time stamps, the transaction start timestamp of this autovacuum query is the oldest one: autovacuum: VACUUM public.iphost 2008-04-24 03:05:13.212436-05 | Then, between 03:05:37 and 03:05:38, nearly every other connection came to a halt. A few connections came to a halt several hours later. I'm baffled, because this autovacuum query seems to have locked the entire database. I also don't know what the "waiting" column means in the output - but they all have the flag of "f". Does that column means that it is waiting on a lock - t or f? Thanks, Dan
Dan Armbrust escribió: > select * from pg_stat_activity;" shows me that most of my connections > in a COMMIT phase: > > 03:05:37.73064-05 | 2008-04-24 03:05:38.419796-05 | 2008-04-24 > 02:11:53.908518-05 | 127.0.0.1 | 53807 > 16385 | ispaaa | 953 | 16384 | pslogin | COMMIT | f > | 2008-04-24 Do you have deferred constraints? Maybe some of them are missing indexes. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
"Dan Armbrust" <daniel.armbrust.list@gmail.com> writes: > I had to restart postgres to let things recover - so I can't gather > any more info right now - but if/when it happens again, I'd like to > know what else to gather. Well, there went the evidence :-( ... but what exactly did you have to do to shut it down? I'm wondering whether the backends responded to SIGINT or SIGTERM. Next time, it'd be good to confirm (with top or vmstat or similar) whether the backends are actually idle or are eating CPU or I/O. Also try strace'ing a few of them; the pattern of kernel calls if any would be revealing. The lack of deadlock reports or 't' values in pg_stat_activity.waiting says that you weren't blocking on heavyweight locks. It's not impossible that there was a deadlock at the LWLock level, though. What sort of indexes are there on this table? Teodor just fixed an issue in GIN indexes that involved taking an unreasonable number of LWLocks, and if that code wasn't exposing itself to deadlock risks I'd be pretty surprised. regards, tom lane
> Well, there went the evidence :-( ... but what exactly did you have > to do to shut it down? I'm wondering whether the backends responded > to SIGINT or SIGTERM. Sorry :( First, I tried issuing a kill to the pid of the vacuum process. It didn't respond to kill. Then, I tried a kill -9. It responded to that :) However, postgres got very unhappy at this point - I tried to log in to issue another query against the pg_stat_activity table, but every command I issued at this point simply hung. In this case, Postgres had been started in the foreground on a terminal, so I went to that terminal, and did a ctrl-c. Eventually, postgres stopped, but the terminal wouldn't respond either - and I had to close it. > > Next time, it'd be good to confirm (with top or vmstat or similar) > whether the backends are actually idle or are eating CPU or I/O. I didn't notice any high load on the processor - the server seemed idle. But I didn't look real close a the individual postgres processes. > Also try strace'ing a few of them; the pattern of kernel calls if > any would be revealing. > > The lack of deadlock reports or 't' values in pg_stat_activity.waiting > says that you weren't blocking on heavyweight locks. It's not > impossible that there was a deadlock at the LWLock level, though. > > What sort of indexes are there on this table? Here is the DDL for the table where vacuum was trying to run: CREATE TABLE iphost ( ethernetmacaddr char(17) NOT NULL, cpemac char(11) NOT NULL, ipaddr varchar(15) NOT NULL, regtime timestamp NOT NULL, leasetime timestamp, last_updated timestamp NOT NULL DEFAULT now(), CONSTRAINT pk_iphost PRIMARY KEY (ethernetmacaddr, ipaddr), CONSTRAINT fk_iphost_cpe FOREIGN KEY (cpemac) REFERENCES cpe (cpemac) MATCH SIMPLE ON UPDATE CASCADE ON DELETE CASCADE ) WITHOUT OIDS; ALTER TABLE iphost OWNER TO pslogin; CREATE INDEX ix_iphost_cpemac ON iphost USING btree (cpemac); CREATE INDEX ix_iphost_ipaddr ON iphost USING btree (ipaddr); > Teodor just fixed > an issue in GIN indexes that involved taking an unreasonable number of > LWLocks, and if that code wasn't exposing itself to deadlock risks > I'd be pretty surprised. > > regards, tom lane Thanks. If/when I can recreate this, I'll try to gather more info. Dan
"Dan Armbrust" <daniel.armbrust.list@gmail.com> writes: >> What sort of indexes are there on this table? > [ just btrees ] Okay, that lets the GIN stuff off the hook. I wonder whether you saw some variant of the deadlock reported in bug #3883 --- see discussion here: http://archives.postgresql.org/pgsql-hackers/2008-01/msg00873.php The only known way of getting into that problem is fixed in 8.3, but a potential for silent deadlock is definitely still there. I don't however have any idea how such a deadlock could occur without at least some processes showing as 'waiting' in pg_stat_activity. Do you have any idea what the transactions that were blocked at COMMIT had been doing? Are there any foreign key constraints leading *to* this table? regards, tom lane
> I wonder whether you saw some variant of the deadlock reported in bug > #3883 --- see discussion here: > http://archives.postgresql.org/pgsql-hackers/2008-01/msg00873.php > The only known way of getting into that problem is fixed in 8.3, > but a potential for silent deadlock is definitely still there. > I don't however have any idea how such a deadlock could occur without > at least some processes showing as 'waiting' in pg_stat_activity. > > Do you have any idea what the transactions that were blocked at > COMMIT had been doing? Should have just been selects and updates, and perhaps some inserts and deletes - nothing to out of the ordinary or complex. If I can get this to happen consistently, I'll set up a logger to capture all of the SQL statements leading up to the freeze. > > Are there any foreign key constraints leading *to* this table? Nope. Unfortunately, I had been running for a couple of weeks steady before I got the error - so reproducing it may not be quick. But I'm going to set up a couple of parallel tests on more systems, and see if I can happen into it again sooner. Oh, and I don't think though should have anything to do with it, but I did discover that fsync was turned off on this postgres instance. Someone else didn't clean up after a different test, and I forgot to change it back. Thanks, Dan
> In this case, Postgres had been started in the foreground on a > terminal, so I went to that terminal, and did a ctrl-c. Eventually, > postgres stopped, but the terminal wouldn't respond either - and I had > to close it. Just out of curiosity, could you maybe have XOFF'd the terminal? I've had cases in the past where Postgres got unhappy and froze when it couldn't write logging information to stderr. (Though, granted, in the cases I hit the server's stderr was redirected to a pipe that was full, so it's not 100% analogous) -Dan
On Wed, Apr 30, 2008 at 12:36 PM, <dan@sidhe.org> wrote: > > In this case, Postgres had been started in the foreground on a > > terminal, so I went to that terminal, and did a ctrl-c. Eventually, > > postgres stopped, but the terminal wouldn't respond either - and I had > > to close it. > > Just out of curiosity, could you maybe have XOFF'd the terminal? I've had > cases in the past where Postgres got unhappy and froze when it couldn't > write logging information to stderr. (Though, granted, in the cases I hit > the server's stderr was redirected to a pipe that was full, so it's not > 100% analogous) Its certainly possible that something "interesting" happened to the terminal. We do have an issue with a KVM switch on this machine that sometimes causes all sorts of unpredictable random garbage input to hit the screen when you switch to the system. If I can't reproduce the problem (which so far I have not been able to) I'll probably let it go, perhaps naively believeing that some terminal issue set off the chain of events. Thanks, Dan
On Thu, May 1, 2008 at 9:38 AM, Dan Armbrust <daniel.armbrust.list@gmail.com> wrote: > On Wed, Apr 30, 2008 at 12:36 PM, <dan@sidhe.org> wrote: > > > In this case, Postgres had been started in the foreground on a > > > terminal, so I went to that terminal, and did a ctrl-c. Eventually, > > > postgres stopped, but the terminal wouldn't respond either - and I had > > > to close it. > > > > Just out of curiosity, could you maybe have XOFF'd the terminal? I've had > > cases in the past where Postgres got unhappy and froze when it couldn't > > write logging information to stderr. (Though, granted, in the cases I hit > > the server's stderr was redirected to a pipe that was full, so it's not > > 100% analogous) > > Its certainly possible that something "interesting" happened to the > terminal. We do have an issue with a KVM switch on this machine that > sometimes causes all sorts of unpredictable random garbage input to > hit the screen when you switch to the system. Place I worked at 5 to 10 years or so ago, I specifically did NOT allow them to hook up the KVM switch to my primary servers for just this reason, plus it kept the junior super administrators from doing stupid things like hitting CTRL-ALT-DELETE on one of my servers. On the very very rare occasion I needed to actually sit in front of the server I'd hook up my one kvm connector and do what I had to, then I'd disconnect it and do everything else remotely. Admittedly, it wasn't the KVM that was at fault there, but it sure did cause some heart ache when someone sat down and hit CTRL-ALT-DEL to log into windows without looking at the screen and seeing that the last machine it was on was a Red Hat box...