Thread: Deadlock situation?

Deadlock situation?

From
"Dan Armbrust"
Date:
I have an app that we were load testing - it maintains a pool of
connections to PostgreSQL - 8.3.1


Suddenly, after running fine for weeks, the app hung - unable to get a
free connection from the pool.


select * from pg_stat_activity;" shows me that most of my connections
in a COMMIT phase:

03:05:37.73064-05  | 2008-04-24 03:05:38.419796-05 | 2008-04-24
02:11:53.908518-05 | 127.0.0.1   |       53807
 16385 | ispaaa  |     953 |    16384 | pslogin  | COMMIT       | f
   | 2008-04-24


While some are in a SELECT:

 16385 | ispaaa  |    1181 |    16384 | pslogin  | select
dynamichos0_.ethernetmacaddr as ethernet1_0_, dynamichos0_.ipaddr as
ipaddr0_, dynamichos0_.cpemac as cpemac0_, dynamichos0_.regtime as
regtime0_, dynamichos0_.leasetime as leasetime0_,
dynamichos0_.last_updated as last5_0_ from iphost dynamichos0_, cpe
cpe1_ where  dynamichos0_.cpemac=cpe1_.cpemac and 1=1 and
dynamichos0_.ethernetmacaddr=$1 and dynamichos0_.cpemac=$2 and
cpe1_.regBaseId=$3 and dynamichos0_.ipaddr<>$4    | f       |
2008-04-24 03:05:37.734041-05 | 2008-04-24 03:05:38.405854-05 |
2008-04-24 02:41:54.413337-05 | 127.0.0.1   |       55363


Perhaps VACUUM had something to do with it?:

 16385 | ispaaa  |    8956 |    16384 | pslogin  | delete from iphost
where leasetime<$1      | f       | 2008-04-24 18:43:29.920069-05 |
2008-04-24 18:43:30.116435-05 | 2008-04-24 18:41:59.071032-05 |
127.0.0.1   |

  49069  16385 | ispaaa  |    1618 |       10 | postgres | autovacuum:
VACUUM public.iphost      | f       | 2008-04-24 03:05:13.212436-05 |
2008-04-24 03:05:13.212436-05 | 2008-04-24 03:05:12.526611-05 |
     |


Where should I begin to look for the source of this problem?

Thanks for any info,

Dan

Re: Deadlock situation?

From
Erik Jones
Date:
On Apr 29, 2008, at 4:54 PM, Dan Armbrust wrote:

> I have an app that we were load testing - it maintains a pool of
> connections to PostgreSQL - 8.3.1
>
>
> Suddenly, after running fine for weeks, the app hung - unable to get a
> free connection from the pool.
>
>
> select * from pg_stat_activity;" shows me that most of my connections
> in a COMMIT phase:
>
> 03:05:37.73064-05  | 2008-04-24 03:05:38.419796-05 | 2008-04-24
> 02:11:53.908518-05 | 127.0.0.1   |       53807
> 16385 | ispaaa  |     953 |    16384 | pslogin  | COMMIT       | f
>   | 2008-04-24
>
>
> While some are in a SELECT:
>
> 16385 | ispaaa  |    1181 |    16384 | pslogin  | select
> dynamichos0_.ethernetmacaddr as ethernet1_0_, dynamichos0_.ipaddr as
> ipaddr0_, dynamichos0_.cpemac as cpemac0_, dynamichos0_.regtime as
> regtime0_, dynamichos0_.leasetime as leasetime0_,
> dynamichos0_.last_updated as last5_0_ from iphost dynamichos0_, cpe
> cpe1_ where  dynamichos0_.cpemac=cpe1_.cpemac and 1=1 and
> dynamichos0_.ethernetmacaddr=$1 and dynamichos0_.cpemac=$2 and
> cpe1_.regBaseId=$3 and dynamichos0_.ipaddr<>$4    | f       |
> 2008-04-24 03:05:37.734041-05 | 2008-04-24 03:05:38.405854-05 |
> 2008-04-24 02:41:54.413337-05 | 127.0.0.1   |       55363
>
>
> Perhaps VACUUM had something to do with it?:
>
> 16385 | ispaaa  |    8956 |    16384 | pslogin  | delete from iphost
> where leasetime<$1      | f       | 2008-04-24 18:43:29.920069-05 |
> 2008-04-24 18:43:30.116435-05 | 2008-04-24 18:41:59.071032-05 |
> 127.0.0.1   |
>
>  49069  16385 | ispaaa  |    1618 |       10 | postgres | autovacuum:
> VACUUM public.iphost      | f       | 2008-04-24 03:05:13.212436-05 |
> 2008-04-24 03:05:13.212436-05 | 2008-04-24 03:05:12.526611-05 |
>     |
>
>
> Where should I begin to look for the source of this problem?
>
> Thanks for any info,

Well, you can look in pg_locks to see if there are outstanding locks
waiting on already granted conflicting locks.  This isn't a deadlock
situation, though, Postgres will detect those and kill one of the
offending processes so that the others can finish (it'll leave a log
message about it, too).  My guess is that you've got some long running
write/ddl query that's go a heavy lock on iphost or you have a LOT of
queries that need heavy locks hitting the table at once. How large is
iphost?  How many of those deletes have you got going on it?  Do you
also have concurrent updates running against it?  Do you have any ddl
queries running against it (alter tables, index builds/drops, etc...)?

Erik Jones

DBA | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com




Re: Deadlock situation?

From
"Dan Armbrust"
Date:
Semantics of deadlock aside, I seem to be deadlocked, yet postgres
didn't detect any deadlock situation.

There are no DDL queries running.  Just a lot of updates, and some
inserts and deletes.

I had to restart postgres to let things recover - so I can't gather
any more info right now - but if/when it happens again, I'd like to
know what else to gather.

Looking at the time stamps, the transaction start timestamp of this
autovacuum query is the oldest one:

 autovacuum: VACUUM public.iphost   2008-04-24 03:05:13.212436-05 |

Then, between 03:05:37 and 03:05:38, nearly every other connection
came to a halt.

A few connections came to a halt several hours later.

I'm baffled, because this autovacuum query seems to have locked the
entire database.

I also don't know what the "waiting" column means in the output - but
they all have the flag of "f".  Does that column means that it is
waiting on a lock - t or f?

Thanks,

Dan

Re: Deadlock situation?

From
Alvaro Herrera
Date:
Dan Armbrust escribió:

> select * from pg_stat_activity;" shows me that most of my connections
> in a COMMIT phase:
>
> 03:05:37.73064-05  | 2008-04-24 03:05:38.419796-05 | 2008-04-24
> 02:11:53.908518-05 | 127.0.0.1   |       53807
>  16385 | ispaaa  |     953 |    16384 | pslogin  | COMMIT       | f
>    | 2008-04-24

Do you have deferred constraints?  Maybe some of them are missing
indexes.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Deadlock situation?

From
Tom Lane
Date:
"Dan Armbrust" <daniel.armbrust.list@gmail.com> writes:
> I had to restart postgres to let things recover - so I can't gather
> any more info right now - but if/when it happens again, I'd like to
> know what else to gather.

Well, there went the evidence :-( ... but what exactly did you have
to do to shut it down?  I'm wondering whether the backends responded
to SIGINT or SIGTERM.

Next time, it'd be good to confirm (with top or vmstat or similar)
whether the backends are actually idle or are eating CPU or I/O.
Also try strace'ing a few of them; the pattern of kernel calls if
any would be revealing.

The lack of deadlock reports or 't' values in pg_stat_activity.waiting
says that you weren't blocking on heavyweight locks.  It's not
impossible that there was a deadlock at the LWLock level, though.

What sort of indexes are there on this table?  Teodor just fixed
an issue in GIN indexes that involved taking an unreasonable number of
LWLocks, and if that code wasn't exposing itself to deadlock risks
I'd be pretty surprised.

            regards, tom lane

Re: Deadlock situation?

From
"Dan Armbrust"
Date:
>  Well, there went the evidence :-( ... but what exactly did you have
>  to do to shut it down?  I'm wondering whether the backends responded
>  to SIGINT or SIGTERM.

Sorry :(

First, I tried issuing a kill to the pid of the vacuum process.  It
didn't respond to kill.
Then, I tried a kill -9.  It responded to that :)

However, postgres got very unhappy at this point - I tried to log in
to issue another query against the pg_stat_activity table, but every
command I issued at this point simply hung.

In this case, Postgres had been started in the foreground on a
terminal, so I went to that terminal, and did a ctrl-c.  Eventually,
postgres stopped, but the terminal wouldn't respond either - and I had
to close it.

>
>  Next time, it'd be good to confirm (with top or vmstat or similar)
>  whether the backends are actually idle or are eating CPU or I/O.

I didn't notice any high load on the processor - the server seemed
idle.  But I didn't look real close a the individual postgres
processes.

>  Also try strace'ing a few of them; the pattern of kernel calls if
>  any would be revealing.
>
>  The lack of deadlock reports or 't' values in pg_stat_activity.waiting
>  says that you weren't blocking on heavyweight locks.  It's not
>  impossible that there was a deadlock at the LWLock level, though.
>
>  What sort of indexes are there on this table?

Here is the DDL for the table where vacuum was trying to run:

CREATE TABLE iphost
(
  ethernetmacaddr char(17) NOT NULL,
  cpemac char(11) NOT NULL,
  ipaddr varchar(15) NOT NULL,
  regtime timestamp NOT NULL,
  leasetime timestamp,
  last_updated timestamp NOT NULL DEFAULT now(),
  CONSTRAINT pk_iphost PRIMARY KEY (ethernetmacaddr, ipaddr),
  CONSTRAINT fk_iphost_cpe FOREIGN KEY (cpemac)
      REFERENCES cpe (cpemac) MATCH SIMPLE
      ON UPDATE CASCADE ON DELETE CASCADE
)
WITHOUT OIDS;
ALTER TABLE iphost OWNER TO pslogin;

CREATE INDEX ix_iphost_cpemac
  ON iphost
  USING btree
  (cpemac);

CREATE INDEX ix_iphost_ipaddr
  ON iphost
  USING btree
  (ipaddr);


> Teodor just fixed
>  an issue in GIN indexes that involved taking an unreasonable number of
>  LWLocks, and if that code wasn't exposing itself to deadlock risks
>  I'd be pretty surprised.
>
>                         regards, tom lane


Thanks.  If/when I can recreate this, I'll try to gather more info.

Dan

Re: Deadlock situation?

From
Tom Lane
Date:
"Dan Armbrust" <daniel.armbrust.list@gmail.com> writes:
>> What sort of indexes are there on this table?
> [ just btrees ]

Okay, that lets the GIN stuff off the hook.

I wonder whether you saw some variant of the deadlock reported in bug
#3883 --- see discussion here:
http://archives.postgresql.org/pgsql-hackers/2008-01/msg00873.php
The only known way of getting into that problem is fixed in 8.3,
but a potential for silent deadlock is definitely still there.
I don't however have any idea how such a deadlock could occur without
at least some processes showing as 'waiting' in pg_stat_activity.

Do you have any idea what the transactions that were blocked at
COMMIT had been doing?

Are there any foreign key constraints leading *to* this table?

            regards, tom lane

Re: Deadlock situation?

From
"Dan Armbrust"
Date:
>  I wonder whether you saw some variant of the deadlock reported in bug
>  #3883 --- see discussion here:
>  http://archives.postgresql.org/pgsql-hackers/2008-01/msg00873.php
>  The only known way of getting into that problem is fixed in 8.3,
>  but a potential for silent deadlock is definitely still there.
>  I don't however have any idea how such a deadlock could occur without
>  at least some processes showing as 'waiting' in pg_stat_activity.
>
>  Do you have any idea what the transactions that were blocked at
>  COMMIT had been doing?

Should have just been selects and updates, and perhaps some inserts
and deletes - nothing to out of the ordinary or complex.

If I can get this to happen consistently, I'll set up a logger to
capture all of the SQL statements leading up to the freeze.

>
>  Are there any foreign key constraints leading *to* this table?

Nope.

Unfortunately, I had been running for a couple of weeks steady before
I got the error - so reproducing it may not be quick.  But I'm going
to set up a couple of parallel tests on more systems, and see if I can
happen into it again sooner.

Oh, and I don't think though should have anything to do with it, but I
did discover that fsync was turned off on this postgres instance.
Someone else didn't clean up after a different test, and I forgot to
change it back.

Thanks,

Dan

Re: Deadlock situation?

From
dan@sidhe.org
Date:
> In this case, Postgres had been started in the foreground on a
> terminal, so I went to that terminal, and did a ctrl-c.  Eventually,
> postgres stopped, but the terminal wouldn't respond either - and I had
> to close it.

Just out of curiosity, could you maybe have XOFF'd the terminal? I've had
cases in the past where Postgres got unhappy and froze when it couldn't
write logging information to stderr. (Though, granted, in the cases I hit
the server's stderr was redirected to a pipe that was full, so it's not
100% analogous)

-Dan

Re: Deadlock situation?

From
"Dan Armbrust"
Date:
On Wed, Apr 30, 2008 at 12:36 PM,  <dan@sidhe.org> wrote:
> > In this case, Postgres had been started in the foreground on a
>  > terminal, so I went to that terminal, and did a ctrl-c.  Eventually,
>  > postgres stopped, but the terminal wouldn't respond either - and I had
>  > to close it.
>
>  Just out of curiosity, could you maybe have XOFF'd the terminal? I've had
>  cases in the past where Postgres got unhappy and froze when it couldn't
>  write logging information to stderr. (Though, granted, in the cases I hit
>  the server's stderr was redirected to a pipe that was full, so it's not
>  100% analogous)

Its certainly possible that something "interesting" happened to the
terminal.  We do have an issue with a KVM switch on this machine that
sometimes causes all sorts of unpredictable random garbage input to
hit the screen when you switch to the system.

If I can't reproduce the problem (which so far I have not been able
to) I'll probably let it go, perhaps naively believeing that some
terminal issue set off the chain of events.

Thanks,

Dan

Re: Deadlock situation?

From
"Scott Marlowe"
Date:
On Thu, May 1, 2008 at 9:38 AM, Dan Armbrust
<daniel.armbrust.list@gmail.com> wrote:
> On Wed, Apr 30, 2008 at 12:36 PM,  <dan@sidhe.org> wrote:
>  > > In this case, Postgres had been started in the foreground on a
>  >  > terminal, so I went to that terminal, and did a ctrl-c.  Eventually,
>  >  > postgres stopped, but the terminal wouldn't respond either - and I had
>  >  > to close it.
>  >
>  >  Just out of curiosity, could you maybe have XOFF'd the terminal? I've had
>  >  cases in the past where Postgres got unhappy and froze when it couldn't
>  >  write logging information to stderr. (Though, granted, in the cases I hit
>  >  the server's stderr was redirected to a pipe that was full, so it's not
>  >  100% analogous)
>
>  Its certainly possible that something "interesting" happened to the
>  terminal.  We do have an issue with a KVM switch on this machine that
>  sometimes causes all sorts of unpredictable random garbage input to
>  hit the screen when you switch to the system.

Place I worked at 5 to 10 years or so ago, I specifically did NOT
allow them to hook up the KVM switch to my primary servers for just
this reason, plus it kept the junior super administrators from doing
stupid things like hitting CTRL-ALT-DELETE on one of my servers.  On
the very very rare occasion I needed to actually sit in front of the
server I'd hook up my one kvm connector and do what I had to, then I'd
disconnect it and do everything else remotely.

Admittedly, it wasn't the KVM that was at fault there, but it sure did
cause some heart ache when someone sat down and hit CTRL-ALT-DEL to
log into windows without looking at the screen and seeing that the
last machine it was on was a Red Hat box...