Thread: How to debug a locked backend ?
Hi all, I have right now a locked backend, similar to what I had a few days ago. It won't answer to pg_cancel_backend. Is there a way to diagnoze what it is actually doing ? If you answer, please do it quickly as in max 15 minutes I'll kill -9 ... Thanks, Csaba.
Attach with gdb: gdb -p [pid] You may need to specify the the executable on some platforms. On Fri, Nov 18, 2005 at 12:13:00PM +0100, Csaba Nagy wrote: > Hi all, > > I have right now a locked backend, similar to what I had a few days ago. > It won't answer to pg_cancel_backend. > Is there a way to diagnoze what it is actually doing ? > > If you answer, please do it quickly as in max 15 minutes I'll kill -9 > ... > > Thanks, > Csaba. > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
First of all: "TIP 2: Don't 'kill -9' the postmaster" On topic: What do you mean with "locked"? Is it using 100% CPU? Or does it give any error messages? If so, what are they? If there is a deadlock situation, Postmaster should detect it and will then cancel the query. Yours, Aarjan ----- Original Message ----- From: "Csaba Nagy" <nagy@ecircle-ag.com> To: "Postgres general mailing list" <pgsql-general@postgresql.org> Sent: Friday, November 18, 2005 12:13 PM Subject: [GENERAL] How to debug a locked backend ? > Hi all, > > I have right now a locked backend, similar to what I had a few days ago. > It won't answer to pg_cancel_backend. > Is there a way to diagnoze what it is actually doing ? > > If you answer, please do it quickly as in max 15 minutes I'll kill -9 > ... > > Thanks, > Csaba. > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > >
Csaba Nagy wrote: > Hi all, > > I have right now a locked backend, similar to what I had a few days ago. > It won't answer to pg_cancel_backend. > Is there a way to diagnoze what it is actually doing ? > > If you answer, please do it quickly as in max 15 minutes I'll kill -9 Have you tried "strace -p <pid>" to see what calls it's making? -- Richard Huxton Archonet Ltd
Csaba Nagy <nagy@ecircle-ag.com> writes: > I have right now a locked backend, similar to what I had a few days ago. > It won't answer to pg_cancel_backend. > Is there a way to diagnoze what it is actually doing ? > If you answer, please do it quickly as in max 15 minutes I'll kill -9 > ... This post was pretty much a complete waste of bandwidth then, seeing that pggeneral never has a turnaround time as short as 15 min (it seems to be more like 2 hours today :-(). However, if it happens again, please see what you can learn by attaching to the stuck backend with gdb and getting a stack trace. regards, tom lane
Richar, Martijn, Thanks for answering, but I had to kill the process in the meantime. I tried kill -11 in the hope it will produce a core dump at least, but it either didn't dump core or I don't know where to look for it as I can't find it. In any case, this is the second time I experience such a lock-up this week, so I will definitely need to find out what's going on. I would exclude hardware failure, as it happened exactly with the same process, involving exactly the same queries/table and the same failure symptoms, which is not characteristic for hardware failures (that should be more random). So, in order to find out what's going on, what should I do if it happens again ? Use gdb, and do what ? Strace is a good idea, I'll do that too if there is a next time. In any case, this time I have more details of the locks and what was still running. I'll try to make a test case out of it, maybe I'll catch the error on a development machine... Cheers, Csaba. On Fri, 2005-11-18 at 14:23, Richard Huxton wrote: > Csaba Nagy wrote: > > Hi all, > > > > I have right now a locked backend, similar to what I had a few days ago. > > It won't answer to pg_cancel_backend. > > Is there a way to diagnoze what it is actually doing ? > > > > If you answer, please do it quickly as in max 15 minutes I'll kill -9 > > Have you tried "strace -p <pid>" to see what calls it's making? > > -- > Richard Huxton > Archonet Ltd
> Well, I've had time to read your previous message too. > > The first time you seem to imply the machine slowed down across all > processes - ssh etc. Was that the case this time? OK, the slowdown/crash was a different problem, which might have been caused by a "too many files open" problem combined with a huge load. There was no evidence that the server actually crashed, but we needed to reboot it because it became completely non-responsive. My current problem is this: http://archives.postgresql.org/pgsql-general/2005-11/msg00828.php This current situation is different, the machine was almost idle, and there was 1 UPDATE on a table running, a lot of inserts on the same table waiting for a lock held by the transaction of the UPDATE, and a VACUUM (started by autovacuum) running on the same table. There was no dead-lock detected by the system, and I guess it should have detected it if it was one, cause this whole thing was running for more than 3 hours according to pg_stat_activity, and I set the deadlock timeout to 2 seconds. So this happened 2 days ago, and today it happened again with exactly the same picture: update, inserts, (auto)vacuum, on exactly the same table as 2 days ago. That can't be a coincidence, there is something bad either with our code or postgres. I do suspect that our code is doing something not completely kosher, but it still should not lock in this way. Now according to pg_locks, the transaction of the update was holding an exclusive lock on the table, which I can't explain, as we don't acquire any table lock in our whole application, and the SQL of the update is coming from our application, I could identify the process it's part of. The only explicit locking we do in that process is via a SELECT ... FOR UPDATE on that table, but that should place no exclusive lock on the table, right ? So where is that lock coming from ? And then, why was I not able to cancel the backend via kill or pg_cancel_backend ? > When you say "locked" do you mean it was waiting on locks, was using all > the CPU, unresponsive or just taking the query a long time? > > To prepare for next time I'd: > 1. Leave ssh logged-in, run screen to get three sessions > 2. Leave "top" running in the first - that'll show you process > activity/general load > 3. Run "vmstat 10" in the second - that'll show you overall > memory/swap/disk/cpu usage. > 4. The third session is then free to work in, if neither of the first > two show anything useful. In that particular case I had a few terminals open, one of them actually running top and others free. They were not responding either, so the machine was completely bogged down. I do routinely have a few terminals open, a psql one included. But that kind of lock-down did not happen anymore, and we changed server in the meantime. Cheers, Csaba.
Hmm, sometimes I get my emails immediately on the list, so it's not always that bad - I had to try. I have absolutely no experience using gdb... lots of experience debugging Java, but that's a different animal. I also don't have too much time right now learning and experimenting with gdb, so I would appreciate some hints of specific manual pages to check out on how to do the connecting/stack trace. Thanks, Csaba. BTW, Tom, my mails are always bounced by your system because some reverse DNS mixup on my side. I brought this to our admins attention but it seems they won't fix it... On Fri, 2005-11-18 at 16:10, Tom Lane wrote: > Csaba Nagy <nagy@ecircle-ag.com> writes: > > I have right now a locked backend, similar to what I had a few days ago. > > It won't answer to pg_cancel_backend. > > Is there a way to diagnoze what it is actually doing ? > > > If you answer, please do it quickly as in max 15 minutes I'll kill -9 > > ... > > This post was pretty much a complete waste of bandwidth then, seeing > that pggeneral never has a turnaround time as short as 15 min (it seems > to be more like 2 hours today :-(). > > However, if it happens again, please see what you can learn by attaching > to the stuck backend with gdb and getting a stack trace. > > regards, tom lane
Csaba Nagy wrote: > Richar, Martijn, > > Thanks for answering, but I had to kill the process in the meantime. I > tried kill -11 in the hope it will produce a core dump at least, but it > either didn't dump core or I don't know where to look for it as I can't > find it. > > In any case, this is the second time I experience such a lock-up this > week, so I will definitely need to find out what's going on. > > I would exclude hardware failure, as it happened exactly with the same > process, involving exactly the same queries/table and the same failure > symptoms, which is not characteristic for hardware failures (that should > be more random). > > So, in order to find out what's going on, what should I do if it happens > again ? Use gdb, and do what ? > Strace is a good idea, I'll do that too if there is a next time. Well, I've had time to read your previous message too. The first time you seem to imply the machine slowed down across all processes - ssh etc. Was that the case this time? When you say "locked" do you mean it was waiting on locks, was using all the CPU, unresponsive or just taking the query a long time? To prepare for next time I'd: 1. Leave ssh logged-in, run screen to get three sessions 2. Leave "top" running in the first - that'll show you process activity/general load 3. Run "vmstat 10" in the second - that'll show you overall memory/swap/disk/cpu usage. 4. The third session is then free to work in, if neither of the first two show anything useful. -- Richard Huxton Archonet Ltd
On Fri, Nov 18, 2005 at 05:00:06PM +0100, Csaba Nagy wrote: > > Now according to pg_locks, the transaction of the update was holding an > exclusive lock on the table, which I can't explain, as we don't acquire > any table lock in our whole application, and the SQL of the update is On the table, or was it a ROW EXCLUSIVE lock? That will be automatically taken by an UPDATE. > And then, why was I not able to cancel the backend via kill or > pg_cancel_backend ? That's a good question. The return code from kill -2 <pid> (or whatever SIGINT is on your system) might be useful here. Also, attaching to the pid with gdb might help. A -- Andrew Sullivan | ajs@crankycanuck.ca The plural of anecdote is not data. --Roger Brinner
Martijn, Thanks for the info. The binary is compiled by myself, with mostly the default options as checked out from CVS, so it's stripped as much as the default make options do that... I already checked the pg_locks view, with some results documented in my post about the first time this kind of problem happend: http://archives.postgresql.org/pgsql-general/2005-11/msg00828.php I can say it was not a deadlock, at least not detected by postgres. I've used strace before occasionally, so that one is not a stranger for me, just didn't cross my mind at all and needed to act fairly quickly. Next time I'll use it. Thanks, Csaba. On Fri, 2005-11-18 at 18:58, Martijn van Oosterhout wrote: > On Fri, Nov 18, 2005 at 03:55:01PM +0100, Csaba Nagy wrote: > > Richar, Martijn, > > > > Thanks for answering, but I had to kill the process in the meantime. I > > tried kill -11 in the hope it will produce a core dump at least, but it > > either didn't dump core or I don't know where to look for it as I can't > > find it. > > <snip> > > > So, in order to find out what's going on, what should I do if it happens > > again ? Use gdb, and do what ? > > Strace is a good idea, I'll do that too if there is a next time. > > The most useful thing is to type "bt" to get a backtrace. Unfortunatly, > if you've got a stripped binary the output will be totally unreadable. > But we can hope. > > strace tells you the system calls it's doing. If it's stuck in a loop > in userspace this won't print much useful, but that fact is useful in > itself. Finally there is ltrace which lists library calls. Somewhat > more invasive, but can sometimes produce something where strace > doesn't. > > There are tables for the current locks, but you'll need to look in the > docs for that. > > Have a nice day,
On Fri, Nov 18, 2005 at 03:55:01PM +0100, Csaba Nagy wrote: > Richar, Martijn, > > Thanks for answering, but I had to kill the process in the meantime. I > tried kill -11 in the hope it will produce a core dump at least, but it > either didn't dump core or I don't know where to look for it as I can't > find it. <snip> > So, in order to find out what's going on, what should I do if it happens > again ? Use gdb, and do what ? > Strace is a good idea, I'll do that too if there is a next time. The most useful thing is to type "bt" to get a backtrace. Unfortunatly, if you've got a stripped binary the output will be totally unreadable. But we can hope. strace tells you the system calls it's doing. If it's stuck in a loop in userspace this won't print much useful, but that fact is useful in itself. Finally there is ltrace which lists library calls. Somewhat more invasive, but can sometimes produce something where strace doesn't. There are tables for the current locks, but you'll need to look in the docs for that. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
On Fri, Nov 18, 2005 at 05:00:06PM +0100, Csaba Nagy wrote: > This current situation is different, the machine was almost idle, and > there was 1 UPDATE on a table running, a lot of inserts on the same > table waiting for a lock held by the transaction of the UPDATE, and a > VACUUM (started by autovacuum) running on the same table. There was no > dead-lock detected by the system, and I guess it should have detected it > if it was one, cause this whole thing was running for more than 3 hours > according to pg_stat_activity, and I set the deadlock timeout to 2 > seconds. So what is the UPDATE doing? What is the query (see pg_stat_activity) doing? Is it updating a lot of rows? If the query does run for a long time holding any kind of lock, you're going to get strange effects like this. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
[snip] > So what is the UPDATE doing? What is the query (see pg_stat_activity) > doing? Is it updating a lot of rows? If the query does run for a long > time holding any kind of lock, you're going to get strange effects like > this. The UPDATE is processing ~ 100 rows in a transaction (I'm not sure, depends on how the application is set up, and it is a parameter in the query so pg_stat or postgres logs will not show that). The rows are previously locked by a SELECT ... FOR UPDATE. I can't give you the actual queries, and I don't have time now to set up a similar case. I plan to create a test case which does similar processing and let it run on a test server maybe it can produce the lock. In any case, I debugged through the process, and there's no exclusive lock placed by it on any table (checked pg_locks after each executed step). This is a straightforward processing, without if-branches, so I'm 100% sure it's always executing the same queries. So I'm puzzled by what could have caused the ExclusiveLock I observed (based on the pg_locks view during the lockup) on the updated table. You can see it in the attachment to my post (look for "execute locks(5239)"): http://archives.postgresql.org/pgsql-general/2005-11/msg00828.php I have enough information now to know how to get a stack trace, so I'm prepared for the next time... though I'd prefer there wouldn't be one ;-) Thanks, Csaba.