Thread: postgres process got stuck in "notify interrupt waiting" status
Hi. We use LISTEN/NOTIFY quite a bit but today something unusual (bad) happened. Number of processes waiting for a lock just started to go up up up. I finally found the object being locked was pg_listener which RhodiumToad on IRC kindly informed happens during LISTEN/NOTIFY. The process that had the lock (in pg_locks it had granted = t ) was shown by ps in status "notify interrupt waiting" and has had the lock for over half an hour. (Usually these notifications are very quick.) the process would not respond to kill, so I kill -9'ed The only reference I could find to a similar problem was at http://archives.postgresql.org/pgsql-performance/2008-02/msg00345.php which seemed to indicate a process should not be in this state for very long. We are on postgres 8.4.12. I'd like to figure out what happened. There is a web server that talks to this database server (amongst other clients), and the client addr and port mapped to this web server, but there was no process on the web server matching the port number. that's when I decided to kill the postgres process. Anything I should know or read up on? Any suggestions? I'd like the system to be able to recover, and for the process to terminate if the client is no longer around. Best, Aleksey
BTW, after I signalled TERM, the process status changed from notify interrupt waiting to notify interrupt waiting waiting which I thought looked kind of odd. Then I signalled KILL. Aleksey On Tue, Sep 4, 2012 at 6:21 PM, Aleksey Tsalolikhin <atsaloli.tech@gmail.com> wrote: > Hi. > > We use LISTEN/NOTIFY quite a bit but today something unusual (bad) happened. > > Number of processes waiting for a lock just started to go up up up. > > I finally found the object being locked was pg_listener which > RhodiumToad on IRC kindly informed happens during LISTEN/NOTIFY. The > process that had the lock (in pg_locks it had granted = t ) was shown > by ps in status "notify interrupt waiting" and has had the lock for > over half an hour. (Usually these notifications are very quick.) > > the process would not respond to kill, so I kill -9'ed > > The only reference I could find to a similar problem was at > http://archives.postgresql.org/pgsql-performance/2008-02/msg00345.php > which seemed to indicate a process should not be in this state for > very long. > > We are on postgres 8.4.12. > > I'd like to figure out what happened. > > There is a web server that talks to this database server (amongst > other clients), and the client addr and port mapped to this web > server, but there was no process on the web server matching the port > number. that's when I decided to kill the postgres process. > > Anything I should know or read up on? Any suggestions? > > I'd like the system to be able to recover, and for the process to > terminate if the client is no longer around. > > Best, > Aleksey -- Upcoming Trainings: "Editing with vi" 31 Aug 2012 at LinuxCon North America in San Diego, CA (http://lcna2012.sched.org/speaker/alekseytsalolikhin) "Time Management for System Administrators" 28 Sep 2012 at Ohio Linux Fest (http://ohiolinux.org/register) "Editing with vi" 28 Sep 2012 at Ohio Linux Fest (http://ohiolinux.org/register) "Automating System Administration with CFEngine 3" 22-25 Oct 2012 in Palo Alto, CA (http://www.eventbrite.com/event/3388161081)
On 09/04/12 7:09 PM, Aleksey Tsalolikhin wrote: > BTW, after I signalled TERM, the process status changed from > > notify interrupt waiting > > to > > notify interrupt waiting waiting > > which I thought looked kind of odd. > > Then I signalled KILL. was this a client process or a postgres process? kill -9 on postgres processes can easily trigger data corruption. -- john r pierce N 37, W 122 santa cruz ca mid-left coast
On Tue, Sep 4, 2012 at 7:21 PM, John R Pierce <pierce@hogranch.com> wrote: > On 09/04/12 7:09 PM, Aleksey Tsalolikhin wrote: >> >> BTW, after I signalled TERM, the process status changed from >> >> notify interrupt waiting >> >> to >> >> notify interrupt waiting waiting >> >> which I thought looked kind of odd. >> >> Then I signalled KILL. > > > was this a client process or a postgres process? kill -9 on postgres > processes can easily trigger data corruption. This was a postgres process. i certainly won't signal KILL anymore to postgres processes, thanks for that warning, John. Aleksey
John R Pierce wrote: > was this a client process or a postgres process? kill -9 on postgres > processes can easily trigger data corruption. It definitely shouldn't cause data corruption, otherwise PostgreSQL would not be crash safe. Yours, Laurenz Albe
On 09/05/2012 12:21 PM, John R Pierce wrote: > was this a client process or a postgres process? kill -9 on postgres > processes can easily trigger data corruption. It certainly shouldn't. kill -9 of the postmaster, deletion of postmaster.pid, and re-starting postgresql *might* but AFAIK even then you'll have to bypass the shared memory lockout (unless you're on Windows). -- Craig Ringer
Craig Ringer <ringerc@ringerc.id.au> writes: > On 09/05/2012 12:21 PM, John R Pierce wrote: >> was this a client process or a postgres process? kill -9 on postgres >> processes can easily trigger data corruption. > It certainly shouldn't. > kill -9 of the postmaster, deletion of postmaster.pid, and re-starting > postgresql *might* but AFAIK even then you'll have to bypass the shared > memory lockout (unless you're on Windows). Correction on that: manually deleting postmaster.pid *does* bypass the shared memory lock. If there are still any live backends from the old postmaster, you can get corruption as a result of this, because the old backends and the new ones will be modifying the database independently. This is why we recommend that you never delete postmaster.pid manually, and certainly not as part of an automatic startup script. Having said that, a kill -9 on an individual backend (*not* the postmaster) should be safe enough, if you don't mind the fact that it'll kill all your other sessions too. regards, tom lane
On Wed, Sep 5, 2012 at 7:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Having said that, a kill -9 on an individual backend (*not* the > postmaster) should be safe enough, if you don't mind the fact that > it'll kill all your other sessions too. > Got it, thanks. Why will it kill all your other sessions too? Isn't there a separate backend process for each session? Best, -at
Aleksey Tsalolikhin <atsaloli.tech@gmail.com> wrote: > Why will it kill all your other sessions too? Isn't there a > separate backend process for each session? When stopped that abruptly, the process has no chance to clean up its pending state in shared memory. A fresh copy of shared memory is needed, so it is necessary to effectively do an immediate restart on the whole PostgreSQL instance. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Aleksey Tsalolikhin <atsaloli.tech@gmail.com> wrote: >> Why will it kill all your other sessions too? Isn't there a >> separate backend process for each session? > When stopped that abruptly, the process has no chance to clean up > its pending state in shared memory. A fresh copy of shared memory > is needed, so it is necessary to effectively do an immediate restart > on the whole PostgreSQL instance. Right. On seeing one child die unexpectedly, the postmaster forcibly SIGQUITs all its other children and initiates a crash recovery sequence. The reason for this is exactly that we can't trust the contents of shared memory anymore. An example is that the dying backend may have held some critical lock, which there is no way to release, so that every other session will shortly be stuck anyway. regards, tom lane
Got it, thanks, Kevin, Tom. So how about that this process that was in "notify interrupt waiting waiting" status after I SIGTERM'ed it. Is the double "waiting" expected? Aleksey
Aleksey Tsalolikhin <atsaloli.tech@gmail.com> writes: > So how about that this process that was in "notify interrupt waiting > waiting" status after I SIGTERM'ed it. Is the double "waiting" > expected? That sounded a bit fishy to me too. But unless you can reproduce it in something newer than 8.4.x, nobody's likely to take much of an interest. The LISTEN/NOTIFY infrastructure got completely rewritten in 9.0, so any bugs in the legacy version are probably just going to get benign neglect at this point ... especially if we don't know how to reproduce them. regards, tom lane
On Wed, Sep 5, 2012 at 10:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > That sounded a bit fishy to me too. But unless you can reproduce it in > something newer than 8.4.x, nobody's likely to take much of an interest. > The LISTEN/NOTIFY infrastructure got completely rewritten in 9.0, so > any bugs in the legacy version are probably just going to get benign > neglect at this point ... especially if we don't know how to reproduce > them. Got it, thanks, Tom! Will urge our shop to upgrade to 9.1. Best, -at