[HACKERS] Active zombies at AIX - Mailing list pgsql-hackers

From Konstantin Knizhnik
Subject [HACKERS] Active zombies at AIX
Date
Msg-id 06f4d085-e2a5-83a7-919a-cb5a878f9e42@postgrespro.ru
Whole thread Raw
Responses Re: [HACKERS] Active zombies at AIX  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: [HACKERS] Active zombies at AIX  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
List pgsql-hackers
Hi hackers,

Yet another story about AIX. For some reasons AIX very slowly cleaning zombie processes.
If we launch pgbench with -C parameter then very soon limit for maximal number of connections is exhausted.
If maximal number of connection is set to 1000, then after ten seconds of pgbench activity we get about 900 zombie processes and it takes about 100 seconds (!)
before all of them are terminated.

proctree shows a lot of defunt processes:

[14:44:41]root@postgres:~ # proctree 26084446
26084446 /opt/postgresql/xlc/9.6/bin/postgres -D /postg_fs/postgresql/xlc
4784362 <defunct>
4980786 <defunct>
11403448 <defunct>
11468930 <defunct>
11993176 <defunct>
12189710 <defunct>
12517390 <defunct>
13238374 <defunct>
13565974 <defunct>
13893826 postgres: wal writer process
14024716 <defunct>
15401000 <defunct>
...
25691556 <defunct> But ps shows that status of process is <existing>

[14:46:02]root@postgres:~ # ps -elk | grep 25691556

  • A - 25691556 - - - - - <exiting>

Breakpoint set in reaper() function in postmaster shows that each invocation of this functions (called by SIGCHLD handler) proceed 5-10 PIDS per invocation.
So there are two hypothesis: either AIX is very slowly delivering SIGCHLD to parent, either exit of process takes too much time.

The fact the backends are in exiting state makes second hypothesis more reliable.
We have tried different Postgres configurations with local and TCP sockets, with different amount of shared buffers and built both with gcc and xlc.
In all cases behavior is similar: zombies do not want to die.
As far as it is not possible to attach debugger to defunct process, it is not clear how to understand what's going on.

I wonder if somebody has encountered similar problems at AIX and may be can suggest some solution to solve this problem.
Thanks in advance
-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] Assignment of valid collation for SET operations on queries with UNKNOWN types.
Next
From: Peter Eisentraut
Date:
Subject: Re: [HACKERS] [COMMITTERS] pgsql: Add pg_sequence system catalog