Thread: postgres (zombie)

postgres (zombie)

From
Constantin Teodorescu
Date:
I am sending this mail on hackers list because the bug list seems to be
dissapeared !

Something strange start happening on my PostgreSQL server :

Linux RedHat 5.2 i386 Pentium machine 64 Mb RAM
PostgreSQL 6.4.2 official release

there are a number of maximum 6 users working simultaneously but not so
hard on the database that isn't so big (2 Mb dumped).
the clients are Tcl/Tk programs. 3 clients are accesing server from a
local network, 3 or 4 clients are accesing server through a serial 115
kb line through a CISCO.

Till now, everything went ok, but sometimes, in the last few days, I
found some postgres (<zombie>) processes and when every client is
logging out, another postgres <zombie> process appears. I had to kill
-SIGTERM the master, wait for 5 or 6 seconds and then restart it again.

When 1 postgres <zombie> process is appearing, the current working
clients can work ahead, no problem at all. But newer connections aren't
accepted.

=======
I am not sure, but I think that the serial line is broked sometimes and
the client-server communication has small interrupts.
Could it be possible that these problems hang up postgresql so bad ?

Constantin Teodorescu
FLEX Consulting Braila, ROMANIA


Re: [HACKERS] postgres (zombie)

From
Tom Lane
Date:
Constantin Teodorescu <teo@flex.ro> writes:
> Till now, everything went ok, but sometimes, in the last few days, I
> found some postgres (<zombie>) processes and when every client is
> logging out, another postgres <zombie> process appears. I had to kill
> -SIGTERM the master, wait for 5 or 6 seconds and then restart it again.
> When 1 postgres <zombie> process is appearing, the current working
> clients can work ahead, no problem at all. But newer connections aren't
> accepted.

This sounds like the postmaster process has gotten hung up somehow ---
it's not responding to incoming connection requests, nor is it noticing
SIGCHLD (signal that one of its child processes exited --- the zombies
are there because the postmaster hasn't done a wait() to reap them).

I've never seen this myself, but it sure sounds like a bug.

Next time you see the condition, would you kill the postmaster with a
signal that will produce a coredump (SIGABRT or SIGSEGV should work)
and extract a backtrace from the core file?  That will give us more
to go on.  Note it will help if you've compiled the backend with -g ...
and don't throw away the corefile, we may need to ask more questions.
        regards, tom lane