Thread: How to debug a locked backend ?

How to debug a locked backend ?

From
Csaba Nagy
Date:
Hi all,

I have right now a locked backend, similar to what I had a few days ago.
It won't answer to pg_cancel_backend.
Is there a way to diagnoze what it is actually doing ?

If you answer, please do it quickly as in max 15 minutes I'll kill -9
...

Thanks,
Csaba.



Re: How to debug a locked backend ?

From
Martijn van Oosterhout
Date:
Attach with gdb:

gdb -p [pid]

You may need to specify the the executable on some platforms.

On Fri, Nov 18, 2005 at 12:13:00PM +0100, Csaba Nagy wrote:
> Hi all,
>
> I have right now a locked backend, similar to what I had a few days ago.
> It won't answer to pg_cancel_backend.
> Is there a way to diagnoze what it is actually doing ?
>
> If you answer, please do it quickly as in max 15 minutes I'll kill -9
> ...
>
> Thanks,
> Csaba.
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: How to debug a locked backend ?

From
"A.j. Langereis"
Date:
First of all: "TIP 2: Don't 'kill -9' the postmaster"

On topic: What do you mean with "locked"? Is it using 100% CPU? Or does it
give any error messages? If so, what are they?
If there is a deadlock situation, Postmaster should detect it and will then
cancel the query.

Yours,

Aarjan
----- Original Message -----
From: "Csaba Nagy" <nagy@ecircle-ag.com>
To: "Postgres general mailing list" <pgsql-general@postgresql.org>
Sent: Friday, November 18, 2005 12:13 PM
Subject: [GENERAL] How to debug a locked backend ?


> Hi all,
>
> I have right now a locked backend, similar to what I had a few days ago.
> It won't answer to pg_cancel_backend.
> Is there a way to diagnoze what it is actually doing ?
>
> If you answer, please do it quickly as in max 15 minutes I'll kill -9
> ...
>
> Thanks,
> Csaba.
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>
>



Re: How to debug a locked backend ?

From
Richard Huxton
Date:
Csaba Nagy wrote:
> Hi all,
>
> I have right now a locked backend, similar to what I had a few days ago.
> It won't answer to pg_cancel_backend.
> Is there a way to diagnoze what it is actually doing ?
>
> If you answer, please do it quickly as in max 15 minutes I'll kill -9

Have you tried "strace -p <pid>" to see what calls it's making?

--
   Richard Huxton
   Archonet Ltd

Re: How to debug a locked backend ?

From
Tom Lane
Date:
Csaba Nagy <nagy@ecircle-ag.com> writes:
> I have right now a locked backend, similar to what I had a few days ago.
> It won't answer to pg_cancel_backend.
> Is there a way to diagnoze what it is actually doing ?

> If you answer, please do it quickly as in max 15 minutes I'll kill -9
> ...

This post was pretty much a complete waste of bandwidth then, seeing
that pggeneral never has a turnaround time as short as 15 min (it seems
to be more like 2 hours today :-().

However, if it happens again, please see what you can learn by attaching
to the stuck backend with gdb and getting a stack trace.

            regards, tom lane

Re: How to debug a locked backend ?

From
Csaba Nagy
Date:
Richar, Martijn,

Thanks for answering, but I had to kill the process in the meantime. I
tried kill -11 in the hope it will produce a core dump at least, but it
either didn't dump core or I don't know where to look for it as I can't
find it.

In any case, this is the second time I experience such a lock-up this
week, so I will definitely need to find out what's going on.

I would exclude hardware failure, as it happened exactly with the same
process, involving exactly the same queries/table and the same failure
symptoms, which is not characteristic for hardware failures (that should
be more random).

So, in order to find out what's going on, what should I do if it happens
again ? Use gdb, and do what ?
Strace is a good idea, I'll do that too if there is a next time.

In any case, this time I have more details of the locks and what was
still running. I'll try to make a test case out of it, maybe I'll catch
the error on a development machine...

Cheers,
Csaba.

On Fri, 2005-11-18 at 14:23, Richard Huxton wrote:
> Csaba Nagy wrote:
> > Hi all,
> >
> > I have right now a locked backend, similar to what I had a few days ago.
> > It won't answer to pg_cancel_backend.
> > Is there a way to diagnoze what it is actually doing ?
> >
> > If you answer, please do it quickly as in max 15 minutes I'll kill -9
>
> Have you tried "strace -p <pid>" to see what calls it's making?
>
> --
>    Richard Huxton
>    Archonet Ltd


Re: How to debug a locked backend ?

From
Csaba Nagy
Date:
> Well, I've had time to read your previous message too.
>
> The first time you seem to imply the machine slowed down across all
> processes - ssh etc. Was that the case this time?

OK, the slowdown/crash was a different problem, which might have been
caused by a "too many files open" problem combined with a huge load.
There was no evidence that the server actually crashed, but we needed to
reboot it because it became completely non-responsive. My current
problem is this:

http://archives.postgresql.org/pgsql-general/2005-11/msg00828.php

This current situation is different, the machine was almost idle, and
there was 1 UPDATE on a table running, a lot of inserts on the same
table waiting for a lock held by the transaction of the UPDATE, and a
VACUUM (started by autovacuum) running on the same table. There was no
dead-lock detected by the system, and I guess it should have detected it
if it was one, cause this whole thing was running for more than 3 hours
according to pg_stat_activity, and I set the deadlock timeout to 2
seconds.

So this happened 2 days ago, and today it happened again with exactly
the same picture: update, inserts, (auto)vacuum, on exactly the same
table as 2 days ago. That can't be a coincidence, there is something bad
either with our code or postgres. I do suspect that our code is doing
something not completely kosher, but it still should not lock in this
way.

Now according to pg_locks, the transaction of the update was holding an
exclusive lock on the table, which I can't explain, as we don't acquire
any table lock in our whole application, and the SQL of the update is
coming from our application, I could identify the process it's part of.
The only explicit locking we do in that process is via a SELECT ... FOR
UPDATE on that table, but that should place no exclusive lock on the
table, right ? So where is that lock coming from ?

And then, why was I not able to cancel the backend via kill or
pg_cancel_backend ?

> When you say "locked" do you mean it was waiting on locks, was using all
> the CPU, unresponsive or just taking the query a long time?
>
> To prepare for next time I'd:
> 1. Leave ssh logged-in, run screen to get three sessions
> 2. Leave "top" running in the first - that'll show you process
> activity/general load
> 3. Run "vmstat 10" in the second - that'll show you overall
> memory/swap/disk/cpu usage.
> 4. The third session is then free to work in, if neither of the first
> two show anything useful.

In that particular case I had a few terminals open, one of them actually
running top and others free. They were not responding either, so the
machine was completely bogged down. I do routinely have a few terminals
open, a psql one included. But that kind of lock-down did not happen
anymore, and we changed server in the meantime.

Cheers,
Csaba.



Re: How to debug a locked backend ?

From
Csaba Nagy
Date:
Hmm, sometimes I get my emails immediately on the list, so it's not
always that bad - I had to try.

I have absolutely no experience using gdb... lots of experience
debugging Java, but that's a different animal.
I also don't have too much time right now learning and experimenting
with gdb, so I would appreciate some hints of specific manual pages to
check out on how to do the connecting/stack trace.

Thanks,
Csaba.

BTW, Tom, my mails are always bounced by your system because some
reverse DNS mixup on my side. I brought this to our admins attention but
it seems they won't fix it...


On Fri, 2005-11-18 at 16:10, Tom Lane wrote:
> Csaba Nagy <nagy@ecircle-ag.com> writes:
> > I have right now a locked backend, similar to what I had a few days ago.
> > It won't answer to pg_cancel_backend.
> > Is there a way to diagnoze what it is actually doing ?
>
> > If you answer, please do it quickly as in max 15 minutes I'll kill -9
> > ...
>
> This post was pretty much a complete waste of bandwidth then, seeing
> that pggeneral never has a turnaround time as short as 15 min (it seems
> to be more like 2 hours today :-().
>
> However, if it happens again, please see what you can learn by attaching
> to the stuck backend with gdb and getting a stack trace.
>
>             regards, tom lane


Re: How to debug a locked backend ?

From
Richard Huxton
Date:
Csaba Nagy wrote:
> Richar, Martijn,
>
> Thanks for answering, but I had to kill the process in the meantime. I
> tried kill -11 in the hope it will produce a core dump at least, but it
> either didn't dump core or I don't know where to look for it as I can't
> find it.
>
> In any case, this is the second time I experience such a lock-up this
> week, so I will definitely need to find out what's going on.
>
> I would exclude hardware failure, as it happened exactly with the same
> process, involving exactly the same queries/table and the same failure
> symptoms, which is not characteristic for hardware failures (that should
> be more random).
>
> So, in order to find out what's going on, what should I do if it happens
> again ? Use gdb, and do what ?
> Strace is a good idea, I'll do that too if there is a next time.

Well, I've had time to read your previous message too.

The first time you seem to imply the machine slowed down across all
processes - ssh etc. Was that the case this time?

When you say "locked" do you mean it was waiting on locks, was using all
the CPU, unresponsive or just taking the query a long time?

To prepare for next time I'd:
1. Leave ssh logged-in, run screen to get three sessions
2. Leave "top" running in the first - that'll show you process
activity/general load
3. Run "vmstat 10" in the second - that'll show you overall
memory/swap/disk/cpu usage.
4. The third session is then free to work in, if neither of the first
two show anything useful.

--
   Richard Huxton
   Archonet Ltd

Re: How to debug a locked backend ?

From
Andrew Sullivan
Date:
On Fri, Nov 18, 2005 at 05:00:06PM +0100, Csaba Nagy wrote:
>
> Now according to pg_locks, the transaction of the update was holding an
> exclusive lock on the table, which I can't explain, as we don't acquire
> any table lock in our whole application, and the SQL of the update is

On the table, or was it a ROW EXCLUSIVE lock?  That will be
automatically taken by an UPDATE.

> And then, why was I not able to cancel the backend via kill or
> pg_cancel_backend ?

That's a good question.  The return code from kill -2 <pid> (or
whatever SIGINT is on your system) might be useful here.  Also,
attaching to the pid with gdb might help.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
The plural of anecdote is not data.
        --Roger Brinner

Re: How to debug a locked backend ?

From
Csaba Nagy
Date:
Martijn,

Thanks for the info.
The binary is compiled by myself, with mostly the default options as
checked out from CVS, so it's stripped as much as the default make
options do that...
I already checked the pg_locks view, with some results documented in my
post about the first time this kind of problem happend:

http://archives.postgresql.org/pgsql-general/2005-11/msg00828.php

I can say it was not a deadlock, at least not detected by postgres.
I've used strace before occasionally, so that one is not a stranger for
me, just didn't cross my mind at all and needed to act fairly quickly.
Next time I'll use it.

Thanks,
Csaba.

On Fri, 2005-11-18 at 18:58, Martijn van Oosterhout wrote:
> On Fri, Nov 18, 2005 at 03:55:01PM +0100, Csaba Nagy wrote:
> > Richar, Martijn,
> >
> > Thanks for answering, but I had to kill the process in the meantime. I
> > tried kill -11 in the hope it will produce a core dump at least, but it
> > either didn't dump core or I don't know where to look for it as I can't
> > find it.
>
> <snip>
>
> > So, in order to find out what's going on, what should I do if it happens
> > again ? Use gdb, and do what ?
> > Strace is a good idea, I'll do that too if there is a next time.
>
> The most useful thing is to type "bt" to get a backtrace. Unfortunatly,
> if you've got a stripped binary the output will be totally unreadable.
> But we can hope.
>
> strace tells you the system calls it's doing. If it's stuck in a loop
> in userspace this won't print much useful, but that fact is useful in
> itself. Finally there is ltrace which lists library calls. Somewhat
> more invasive, but can sometimes produce something where strace
> doesn't.
>
> There are tables for the current locks, but you'll need to look in the
> docs for that.
>
> Have a nice day,


Re: How to debug a locked backend ?

From
Martijn van Oosterhout
Date:
On Fri, Nov 18, 2005 at 03:55:01PM +0100, Csaba Nagy wrote:
> Richar, Martijn,
>
> Thanks for answering, but I had to kill the process in the meantime. I
> tried kill -11 in the hope it will produce a core dump at least, but it
> either didn't dump core or I don't know where to look for it as I can't
> find it.

<snip>

> So, in order to find out what's going on, what should I do if it happens
> again ? Use gdb, and do what ?
> Strace is a good idea, I'll do that too if there is a next time.

The most useful thing is to type "bt" to get a backtrace. Unfortunatly,
if you've got a stripped binary the output will be totally unreadable.
But we can hope.

strace tells you the system calls it's doing. If it's stuck in a loop
in userspace this won't print much useful, but that fact is useful in
itself. Finally there is ltrace which lists library calls. Somewhat
more invasive, but can sometimes produce something where strace
doesn't.

There are tables for the current locks, but you'll need to look in the
docs for that.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: How to debug a locked backend ?

From
Martijn van Oosterhout
Date:
On Fri, Nov 18, 2005 at 05:00:06PM +0100, Csaba Nagy wrote:
> This current situation is different, the machine was almost idle, and
> there was 1 UPDATE on a table running, a lot of inserts on the same
> table waiting for a lock held by the transaction of the UPDATE, and a
> VACUUM (started by autovacuum) running on the same table. There was no
> dead-lock detected by the system, and I guess it should have detected it
> if it was one, cause this whole thing was running for more than 3 hours
> according to pg_stat_activity, and I set the deadlock timeout to 2
> seconds.

So what is the UPDATE doing? What is the query (see pg_stat_activity)
doing? Is it updating a lot of rows? If the query does run for a long
time holding any kind of lock, you're going to get strange effects like
this.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: How to debug a locked backend ?

From
Csaba Nagy
Date:
[snip]
> So what is the UPDATE doing? What is the query (see pg_stat_activity)
> doing? Is it updating a lot of rows? If the query does run for a long
> time holding any kind of lock, you're going to get strange effects like
> this.

The UPDATE is processing ~ 100 rows in a transaction (I'm not sure,
depends on how the application is set up, and it is a parameter in the
query so pg_stat or postgres logs will not show that). The rows are
previously locked by a SELECT ... FOR UPDATE. I can't give you the
actual queries, and I don't have time now to set up a similar case. I
plan to create a test case which does similar processing and let it run
on a test server maybe it can produce the lock.

In any case, I debugged through the process, and there's no exclusive
lock placed by it on any table (checked pg_locks after each executed
step). This is a straightforward processing, without if-branches, so I'm
100% sure it's always executing the same queries. So I'm puzzled by what
could have caused the ExclusiveLock I observed (based on the pg_locks
view during the lockup) on the updated table. You can see it in the
attachment to my post (look for "execute locks(5239)"):

http://archives.postgresql.org/pgsql-general/2005-11/msg00828.php

I have enough information now to know how to get a stack trace, so I'm
prepared for the next time... though I'd prefer there wouldn't be one
;-)

Thanks,
Csaba.