Thread: postgres backend process hang on " D " state
Hi,
We have a performance test on Postgresql 13.4 on RHEL8.4 , just after connection storm in ( 952 new connections coming in 1 minute), a lot of backends start on “ D “ state, and when more sessions got disconnected, they do not exit successfully, instead became “defunct”. No errors from postgresql.log , just after the connection storm, some pg_cron workers can not started either. The server is a Virtual machine and no IO hang (actually) IO load is very low. Could be a postgresql bug or an OS abnormal behavior?
top - 13:18:02 up 4 days, 6:59, 6 users, load average: 308.68, 307.93, 307.40
Tasks: 1690 total, 1 running, 853 sleeping, 0 stopped, 836 zombie
%Cpu(s): 0.1 us, 0.8 sy, 0.0 ni, 99.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 128657.6 total, 1188.7 free, 52921.5 used, 74547.4 buff/cache
MiB Swap: 3072.0 total, 3066.7 free, 5.3 used. 74757.3 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1070325 postgres 20 0 45.6g 15356 11776 D 0.5 0.0 19:21.37 postgres: xuser xdb 192.x.x.132(33318) BIND
1070328 postgres 20 0 45.6g 16120 11660 D 0.5 0.0 19:21.96 postgres: xuser xdb 192.x.x.121(34372) BIND
1070329 postgres 20 0 45.6g 15380 11872 D 0.5 0.0 19:20.90 postgres: xuser xdb 192.x.x.126(41316) BIND
1070397 postgres 20 0 45.6g 14804 11604 D 0.5 0.0 19:23.57 postgres: xuser xdb 192.x.x.132(33324) BIND
1070434 postgres 20 0 45.6g 14928 11812 D 0.5 0.0 19:21.14 postgres: xuser xdb 192.x.x.129(57298) BIND
1070480 postgres 20 0 45.6g 14612 11660 D 0.5 0.0 19:19.88 postgres: xuser xdb 192.x.x.127(52424) BIND
1070508 postgres 20 0 45.6g 14928 11812 D 0.5 0.0 19:20.48 postgres: xuser xdb 192.x.x.127(52428) BIND
1070523 postgres 20 0 45.6g 14544 11716 D 0.5 0.0 19:22.53 postgres: xuser xdb 192.x.x.130(33678) BIND
1070647 postgres 20 0 45.6g 14444 11660 D 0.5 0.0 19:24.36 postgres: xuser xdb 192.x.x.129(57316) BIND
1070648 postgres 20 0 45.6g 14352 11524 D 0.5 0.0 19:22.86 postgres: xuser xdb 192.x.x.133(48796) BIND
1070676 postgres 20 0 45.6g 14456 11660 D 0.5 0.0 19:21.92 postgres: xuser xdb 192.x.x.128(54614) BIND
1070724 postgres 20 0 45.6g 14352 11524 D 0.5 0.0 19:20.90 postgres: xuser xdb 192.x.x.126(41370) BIND
1070739 postgres 20 0 45.6g 14008 11412 D 0.5 0.0 19:22.69 postgres: xuser xdb 192.x.x.123(56164) BIND
1070786 postgres 20 0 45.6g 14352 11524 D 0.5 0.0 19:22.51 postgres: xuser xdb 192.x.x.121(34428) BIND
1070801 postgres 20 0 45.6g 13240 10688 D 0.5 0.0 19:22.19 postgres: xuser xdb 192.x.x.126(41382) BIND
1070815 postgres 20 0 45.6g 13240 10688 D 0.5 0.0 19:21.36 postgres: xuser xdb 192.x.x.53(55950) BIND
1070830 postgres 20 0 45.6g 13240 10688 D 0.5 0.0 19:23.80 postgres: xuser xdb 192.x.x.131(41704) BIND
1070841 postgres 20 0 45.6g 13304 10744 D 0.5 0.0 19:24.25 postgres: xuser xdb 192.x.x.131(41706) BIND
1070884 postgres 20 0 45.6g 13264 10688 D 0.5 0.0 19:20.61 postgres: xuser xdb 192.x.x.122(33734) BIND
1070903 postgres 20 0 45.6g 14456 11660 D 0.5 0.0 19:23.43 postgres: xuser xdb 192.x.x.132(33384) BIND
1070915 postgres 20 0 45.5g 7280 5372 D 0.5 0.0 19:21.20 postgres: xuser xdb 192.x.x.129(57350) initializing
1070941 postgres 20 0 45.5g 7280 5372 D 0.5 0.0 19:20.58 postgres: xuser xdb 192.x.x.124(35934) initializing
1070944 postgres 20 0 45.5g 7280 5372 D 0.5 0.0 19:18.57 postgres: xuser xdb 192.x.x.50(58964) initializing
1070963 postgres 20 0 45.5g 7280 5372 D 0.5 0.0 19:21.98 postgres: xuser xdb 192.x.x.132(33362) initializing
1070974 postgres 20 0 45.5g 7280 5372 D 0.5 0.0 19:22.76 postgres: xuser xdb 192.x.x.54(56774) initializing
1070986 postgres 20 0 45.5g 7284 5372 D 0.5 0.0 19:21.89 postgres: xuser xdb 192.x.x.132(33394) initializing
…
postgres 1071160 1951 0 May27 ? 00:00:02 [postmaster] <defunct>
postgres 1071161 1951 0 May27 ? 00:00:02 [postmaster] <defunct>
postgres 1071162 1951 0 May27 ? 00:00:02 [postmaster] <defunct>
postgres 1071163 1951 0 May27 ? 00:00:02 [postmaster] <defunct>
postgres 1071164 1951 0 May27 ? 00:00:03 [postmaster] <defunct>
postgres 1071167 1951 0 May27 ? 00:00:21 [postmaster] <defunct>
postgres 1071168 1951 0 May27 ? 00:00:03 [postmaster] <defunct>
postgres 1071170 1951 0 May27 ? 00:00:03 [postmaster] <defunct>
postgres 1071171 1951 0 May27 ? 00:00:02 [postmaster] <defunct>
postgres 1071174 1951 0 May27 ? 00:00:02 [postmaster] <defunct>
postgres 1071175 1951 0 May27 ? 00:00:03 [postmaster] <defunct>
postgres 1071176 1951 0 May27 ? 00:00:02 [postmaster] <defunct>
postgres 1071179 1951 0 May27 ? 00:00:02 [postmaster] <defunct>
postgres 1071181 1951 0 May27 ? 00:00:03 [postmaster] <defunct>
postgres 1071184 1951 0 May27 ? 00:00:02 [postmaster] <defunct>
postgres 1071185 1951 0 May27 ? 00:00:02 [postmaster] <defunct>
postgres 1071187 1951 0 May27 ? 00:00:03 [postmaster] <defunct>
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.16 0.00 0.00 99.84
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Thanks,
James
Hi,
We have a performance test on Postgresql 13.4 on RHEL8.4 ,
On Sun, May 29, 2022 at 01:20:12PM +0000, James Pang (chaolpan) wrote: > We have a performance test on Postgresql 13.4 on RHEL8.4 , just after connection storm in ( 952 new connections comingin 1 minute), a lot of backends start on " D " state, and when more sessions got disconnected, they do not exit successfully,instead became "defunct". No errors from postgresql.log , just after the connection storm, some pg_cron workerscan not started either. The server is a Virtual machine and no IO hang (actually) IO load is very low. Could bea postgresql bug or an OS abnormal behavior? What settings have you set ? https://wiki.postgresql.org/wiki/Server_Configuration What extensions do you have loaded? \dx Send the output of SELECT * FROM pg_stat_activity either as an attachment or in \x mode? What is your data dir ? Is it on the VM's root filesystem or something else ? Show the output of "mount". Are there any kernel messages in /var/log/messages or `dmesg` ? How many relations are in your schema ? Are you using temp tables ? Long-running transactions ? -- Justin
1. extensions shared_preload_libraries = 'orafce,pgaudit,pg_cron,pg_stat_statements,set_user' 2. psql can not login now ,it hang there too, so can not check anything from pg_stats_* views 3. one main app user and 2 schemas ,no long running transactions . 4. we use /pgdata , it's on xfs , lvm/vg RHEL8.4 ,it's a shared storage, no use root filesystem. /dev/mapper/pgdatavg-pgdatalv 500G 230G 271G 46% /pgdata /dev/mapper/pgdatavg-pgarchivelv 190G 1.5G 189G 1% /pgarchive /dev/mapper/pgdatavg-pgwallv 100G 34G 67G 34% /pgwal Regards, James -----Original Message----- From: Justin Pryzby <pryzby@telsasoft.com> Sent: Sunday, May 29, 2022 11:02 PM To: James Pang (chaolpan) <chaolpan@cisco.com> Cc: pgsql-performance@lists.postgresql.org Subject: Re: postgres backend process hang on " D " state On Sun, May 29, 2022 at 01:20:12PM +0000, James Pang (chaolpan) wrote: > We have a performance test on Postgresql 13.4 on RHEL8.4 , just after connection storm in ( 952 new connections comingin 1 minute), a lot of backends start on " D " state, and when more sessions got disconnected, they do not exit successfully,instead became "defunct". No errors from postgresql.log , just after the connection storm, some pg_cron workerscan not started either. The server is a Virtual machine and no IO hang (actually) IO load is very low. Could bea postgresql bug or an OS abnormal behavior? What settings have you set ? https://wiki.postgresql.org/wiki/Server_Configuration What extensions do you have loaded? \dx Send the output of SELECT * FROM pg_stat_activity either as an attachment or in \x mode? What is your data dir ? Is it on the VM's root filesystem or something else ? Show the output of "mount". Are there any kernel messages in /var/log/messages or `dmesg` ? How many relations are in your schema ? Are you using temp tables ? Long-running transactions ? -- Justin
On Mon, May 30, 2022 at 01:19:56AM +0000, James Pang (chaolpan) wrote: > 1. extensions > shared_preload_libraries = 'orafce,pgaudit,pg_cron,pg_stat_statements,set_user' > 2. psql can not login now ,it hang there too, so can not check anything from pg_stats_* views > 3. one main app user and 2 schemas ,no long running transactions . > 4. we use /pgdata , it's on xfs , lvm/vg RHEL8.4 ,it's a shared storage, no use root filesystem. > /dev/mapper/pgdatavg-pgdatalv 500G 230G 271G 46% /pgdata > /dev/mapper/pgdatavg-pgarchivelv 190G 1.5G 189G 1% /pgarchive > /dev/mapper/pgdatavg-pgwallv 100G 34G 67G 34% /pgwal What are the LVM PVs ? Is it a scsi/virt device ? Or iscsi/drbd/??? I didn't hear back if there's any kernel errors. Is the storage broken/stuck/disconnected ? Can you run "time find /pgdata /pgarchive /pgwal -ls |wc" ? Could you run "ps -u postgres -O wchan=============================" Can you strace one of the stuck backends ? It sounds like you'll have to restart the service or VM (forcibly if necessary) to resolve the immediate issue and then collect the other info, and leave a "psql" open to try to (if the problem recurs) check pg_stat_activity and other DB info.
"James Pang (chaolpan)" <chaolpan@cisco.com> writes: > 1. extensions > shared_preload_libraries = 'orafce,pgaudit,pg_cron,pg_stat_statements,set_user' Can you still reproduce this if you remove all of those? regards, tom lane
Update your questions . time find /pgdata /pgarchive /pgwal -ls |wc 82165 903817 8397391 real 0m1.120s user 0m0.432s sys 0m0.800s ps -u postgres -O wchan============================= PID ============================ S TTY TIME COMMAND 1951 - D ? 00:26:37 /usr/pgsql-13/bin/postmaster -D /pgdata -c config_file=/pgdata/postgresql.conf 2341 - S ? 00:00:06 postgres: logger 2361 - S ? 00:01:02 postgres: checkpointer 2362 - S ? 00:00:27 postgres: background writer 2363 - S ? 00:00:59 postgres: walwriter 2364 - S ? 00:02:00 postgres: autovacuum launcher 2365 - Z ? 00:00:04 [postmaster] <defunct> 2366 do_epoll_wait S ? 00:13:30 postgres: stats collector 2367 do_epoll_wait S ? 00:00:18 postgres: pg_cron launcher 2368 - S ? 00:00:00 postgres: logical replication launcher 1053144 - Z ? 00:05:36 [postmaster] <defunct> 1053319 - Z ? 00:05:29 [postmaster] <defunct> 1053354 - Z ? 00:05:53 [postmaster] <defunct> 1053394 - Z ? 00:05:51 [postmaster] <defunct> ... 1064387 - Z ? 00:05:13 [postmaster] <defunct> 1070257 - D ? 00:24:23 postgres: test pbwd 192.168.205.53(55886) BIND 1070258 - D ? 00:24:24 postgres: test pbwd 192.168.205.50(58910) BIND 1070259 - D ? 00:24:22 postgres: test pbwd 192.168.205.133(48754) SELECT 1070260 - Z ? 00:05:02 [postmaster] <defunct> ... Strace / gdb will hang there too for trace a process. Regards, James -----Original Message----- From: Justin Pryzby <pryzby@telsasoft.com> Sent: Monday, May 30, 2022 10:20 AM To: James Pang (chaolpan) <chaolpan@cisco.com> Cc: pgsql-performance@lists.postgresql.org Subject: Re: postgres backend process hang on " D " state On Mon, May 30, 2022 at 01:19:56AM +0000, James Pang (chaolpan) wrote: > 1. extensions > shared_preload_libraries = 'orafce,pgaudit,pg_cron,pg_stat_statements,set_user' > 2. psql can not login now ,it hang there too, so can not check > anything from pg_stats_* views 3. one main app user and 2 schemas ,no long running transactions . > 4. we use /pgdata , it's on xfs , lvm/vg RHEL8.4 ,it's a shared storage, no use root filesystem. > /dev/mapper/pgdatavg-pgdatalv 500G 230G 271G 46% /pgdata > /dev/mapper/pgdatavg-pgarchivelv 190G 1.5G 189G 1% /pgarchive > /dev/mapper/pgdatavg-pgwallv 100G 34G 67G 34% /pgwal What are the LVM PVs ? Is it a scsi/virt device ? Or iscsi/drbd/??? I didn't hear back if there's any kernel errors. Is the storage broken/stuck/disconnected ? Can you run "time find /pgdata /pgarchive /pgwal -ls |wc" ? Could you run "ps -u postgres -O wchan=============================" Can you strace one of the stuck backends ? It sounds like you'll have to restart the service or VM (forcibly if necessary) to resolve the immediate issue and then collectthe other info, and leave a "psql" open to try to (if the problem recurs) check pg_stat_activity and other DB info.
Maybe any bugs from these extensions ? I can try that removing all extensions, but we need these extensions. Thanks, James -----Original Message----- From: Tom Lane <tgl@sss.pgh.pa.us> Sent: Monday, May 30, 2022 10:21 AM To: James Pang (chaolpan) <chaolpan@cisco.com> Cc: Justin Pryzby <pryzby@telsasoft.com>; pgsql-performance@lists.postgresql.org Subject: Re: postgres backend process hang on " D " state "James Pang (chaolpan)" <chaolpan@cisco.com> writes: > 1. extensions > shared_preload_libraries = 'orafce,pgaudit,pg_cron,pg_stat_statements,set_user' Can you still reproduce this if you remove all of those? regards, tom lane
Seems all process blocked by a system service "fapolicy.d", when I stop the service, defunct process got disappearedand pending backends got released or moving on, and database started to accept new connection again. This fapolicy.d in RHEL8.4 is enabled by system admin to support security compliance requirements. systemctl stop fapolicyd , after that, everything go back to be normal soon. Regards, James -----Original Message----- From: James Pang (chaolpan) Sent: Monday, May 30, 2022 11:00 AM To: Tom Lane <tgl@sss.pgh.pa.us> Cc: Justin Pryzby <pryzby@telsasoft.com>; pgsql-performance@lists.postgresql.org Subject: RE: postgres backend process hang on " D " state Maybe any bugs from these extensions ? I can try that removing all extensions, but we need these extensions. Thanks, James -----Original Message----- From: Tom Lane <tgl@sss.pgh.pa.us> Sent: Monday, May 30, 2022 10:21 AM To: James Pang (chaolpan) <chaolpan@cisco.com> Cc: Justin Pryzby <pryzby@telsasoft.com>; pgsql-performance@lists.postgresql.org Subject: Re: postgres backend process hang on " D " state "James Pang (chaolpan)" <chaolpan@cisco.com> writes: > 1. extensions > shared_preload_libraries = 'orafce,pgaudit,pg_cron,pg_stat_statements,set_user' Can you still reproduce this if you remove all of those? regards, tom lane