Thread: postgres backend process hang on " D " state

postgres backend process hang on " D " state

From
"James Pang (chaolpan)"
Date:

Hi,

   We have a performance test on Postgresql 13.4 on RHEL8.4 , just after connection storm in ( 952 new connections coming in 1 minute),  a lot of backends start on “ D “ state,  and when more sessions got disconnected, they do not exit successfully, instead became  “defunct”.   No errors from postgresql.log , just after the connection storm, some pg_cron workers can not started either.  The server is a Virtual machine and no IO hang (actually) IO load is very low.   Could be a postgresql bug or an OS abnormal behavior?

 

top - 13:18:02 up 4 days,  6:59,  6 users,  load average: 308.68, 307.93, 307.40

Tasks: 1690 total,   1 running, 853 sleeping,   0 stopped, 836 zombie

%Cpu(s):  0.1 us,  0.8 sy,  0.0 ni, 99.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

MiB Mem : 128657.6 total,   1188.7 free,  52921.5 used,  74547.4 buff/cache

MiB Swap:   3072.0 total,   3066.7 free,      5.3 used.  74757.3 avail Mem

 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND

1070325 postgres  20   0   45.6g  15356  11776 D   0.5   0.0  19:21.37 postgres: xuser  xdb 192.x.x.132(33318) BIND

1070328 postgres  20   0   45.6g  16120  11660 D   0.5   0.0  19:21.96 postgres: xuser  xdb 192.x.x.121(34372) BIND

1070329 postgres  20   0   45.6g  15380  11872 D   0.5   0.0  19:20.90 postgres: xuser  xdb 192.x.x.126(41316) BIND

1070397 postgres  20   0   45.6g  14804  11604 D   0.5   0.0  19:23.57 postgres: xuser  xdb 192.x.x.132(33324) BIND

1070434 postgres  20   0   45.6g  14928  11812 D   0.5   0.0  19:21.14 postgres: xuser  xdb 192.x.x.129(57298) BIND

1070480 postgres  20   0   45.6g  14612  11660 D   0.5   0.0  19:19.88 postgres: xuser  xdb 192.x.x.127(52424) BIND

1070508 postgres  20   0   45.6g  14928  11812 D   0.5   0.0  19:20.48 postgres: xuser  xdb 192.x.x.127(52428) BIND

1070523 postgres  20   0   45.6g  14544  11716 D   0.5   0.0  19:22.53 postgres: xuser  xdb 192.x.x.130(33678) BIND

1070647 postgres  20   0   45.6g  14444  11660 D   0.5   0.0  19:24.36 postgres: xuser  xdb 192.x.x.129(57316) BIND

1070648 postgres  20   0   45.6g  14352  11524 D   0.5   0.0  19:22.86 postgres: xuser  xdb 192.x.x.133(48796) BIND

1070676 postgres  20   0   45.6g  14456  11660 D   0.5   0.0  19:21.92 postgres: xuser  xdb 192.x.x.128(54614) BIND

1070724 postgres  20   0   45.6g  14352  11524 D   0.5   0.0  19:20.90 postgres: xuser  xdb 192.x.x.126(41370) BIND

1070739 postgres  20   0   45.6g  14008  11412 D   0.5   0.0  19:22.69 postgres: xuser  xdb 192.x.x.123(56164) BIND

1070786 postgres  20   0   45.6g  14352  11524 D   0.5   0.0  19:22.51 postgres: xuser  xdb 192.x.x.121(34428) BIND

1070801 postgres  20   0   45.6g  13240  10688 D   0.5   0.0  19:22.19 postgres: xuser  xdb 192.x.x.126(41382) BIND

1070815 postgres  20   0   45.6g  13240  10688 D   0.5   0.0  19:21.36 postgres: xuser  xdb 192.x.x.53(55950) BIND

1070830 postgres  20   0   45.6g  13240  10688 D   0.5   0.0  19:23.80 postgres: xuser  xdb 192.x.x.131(41704) BIND

1070841 postgres  20   0   45.6g  13304  10744 D   0.5   0.0  19:24.25 postgres: xuser  xdb 192.x.x.131(41706) BIND

1070884 postgres  20   0   45.6g  13264  10688 D   0.5   0.0  19:20.61 postgres: xuser  xdb 192.x.x.122(33734) BIND

1070903 postgres  20   0   45.6g  14456  11660 D   0.5   0.0  19:23.43 postgres: xuser  xdb 192.x.x.132(33384) BIND

1070915 postgres  20   0   45.5g   7280   5372 D   0.5   0.0  19:21.20 postgres: xuser  xdb 192.x.x.129(57350) initializing

1070941 postgres  20   0   45.5g   7280   5372 D   0.5   0.0  19:20.58 postgres: xuser  xdb 192.x.x.124(35934) initializing

1070944 postgres  20   0   45.5g   7280   5372 D   0.5   0.0  19:18.57 postgres: xuser  xdb 192.x.x.50(58964) initializing

1070963 postgres  20   0   45.5g   7280   5372 D   0.5   0.0  19:21.98 postgres: xuser  xdb 192.x.x.132(33362) initializing

1070974 postgres  20   0   45.5g   7280   5372 D   0.5   0.0  19:22.76 postgres: xuser  xdb 192.x.x.54(56774) initializing

1070986 postgres  20   0   45.5g   7284   5372 D   0.5   0.0  19:21.89 postgres: xuser  xdb 192.x.x.132(33394) initializing

 

postgres 1071160    1951  0 May27 ?        00:00:02 [postmaster] <defunct>

postgres 1071161    1951  0 May27 ?        00:00:02 [postmaster] <defunct>

postgres 1071162    1951  0 May27 ?        00:00:02 [postmaster] <defunct>

postgres 1071163    1951  0 May27 ?        00:00:02 [postmaster] <defunct>

postgres 1071164    1951  0 May27 ?        00:00:03 [postmaster] <defunct>

postgres 1071167    1951  0 May27 ?        00:00:21 [postmaster] <defunct>

postgres 1071168    1951  0 May27 ?        00:00:03 [postmaster] <defunct>

postgres 1071170    1951  0 May27 ?        00:00:03 [postmaster] <defunct>

postgres 1071171    1951  0 May27 ?        00:00:02 [postmaster] <defunct>

postgres 1071174    1951  0 May27 ?        00:00:02 [postmaster] <defunct>

postgres 1071175    1951  0 May27 ?        00:00:03 [postmaster] <defunct>

postgres 1071176    1951  0 May27 ?        00:00:02 [postmaster] <defunct>

postgres 1071179    1951  0 May27 ?        00:00:02 [postmaster] <defunct>

postgres 1071181    1951  0 May27 ?        00:00:03 [postmaster] <defunct>

postgres 1071184    1951  0 May27 ?        00:00:02 [postmaster] <defunct>

postgres 1071185    1951  0 May27 ?        00:00:02 [postmaster] <defunct>

postgres 1071187    1951  0 May27 ?        00:00:03 [postmaster] <defunct>

 

 

avg-cpu:  %user   %nice %system %iowait  %steal   %idle

           0.00    0.00    0.16    0.00    0.00   99.84

 

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util

sdb              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

sdd              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

sdc              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

sda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

scd0             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

dm-0             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

dm-1             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

dm-2             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

dm-3             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

dm-4             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

dm-5             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

dm-6             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

dm-7             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

dm-8             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

dm-9             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

 

 

Thanks,

 

James

 

Re: postgres backend process hang on " D " state

From
Ranier Vilela
Date:
Em dom., 29 de mai. de 2022 às 10:20, James Pang (chaolpan) <chaolpan@cisco.com> escreveu:

Hi,

   We have a performance test on Postgresql 13.4 on RHEL8.4 ,

Hard to say with this info, but how is this " test", why not use the 13.7, with all bugs fixes related?

regards,
Ranier Vilela

Re: postgres backend process hang on " D " state

From
Justin Pryzby
Date:
On Sun, May 29, 2022 at 01:20:12PM +0000, James Pang (chaolpan) wrote:
>    We have a performance test on Postgresql 13.4 on RHEL8.4 , just after connection storm in ( 952 new connections
comingin 1 minute),  a lot of backends start on " D " state,  and when more sessions got disconnected, they do not exit
successfully,instead became  "defunct".   No errors from postgresql.log , just after the connection storm, some pg_cron
workerscan not started either.  The server is a Virtual machine and no IO hang (actually) IO load is very low.   Could
bea postgresql bug or an OS abnormal behavior?
 

What settings have you set ?
https://wiki.postgresql.org/wiki/Server_Configuration

What extensions do you have loaded?  \dx

Send the output of SELECT * FROM pg_stat_activity either as an attachment or in
\x mode?

What is your data dir ?  Is it on the VM's root filesystem or something else ?
Show the output of "mount".  Are there any kernel messages in /var/log/messages
or `dmesg` ?

How many relations are in your schema ?
Are you using temp tables ?
Long-running transactions ?

-- 
Justin



RE: postgres backend process hang on " D " state

From
"James Pang (chaolpan)"
Date:
1. extensions
      shared_preload_libraries = 'orafce,pgaudit,pg_cron,pg_stat_statements,set_user'
2.  psql can not login now ,it hang there too, so can not check anything from pg_stats_* views
3.  one main app user and 2 schemas ,no long running transactions .
4. we use /pgdata , it's on xfs , lvm/vg RHEL8.4 ,it's a shared storage, no use root filesystem.
/dev/mapper/pgdatavg-pgdatalv     500G  230G  271G  46% /pgdata
/dev/mapper/pgdatavg-pgarchivelv  190G  1.5G  189G   1% /pgarchive
/dev/mapper/pgdatavg-pgwallv      100G   34G   67G  34% /pgwal

Regards,

James
-----Original Message-----
From: Justin Pryzby <pryzby@telsasoft.com>
Sent: Sunday, May 29, 2022 11:02 PM
To: James Pang (chaolpan) <chaolpan@cisco.com>
Cc: pgsql-performance@lists.postgresql.org
Subject: Re: postgres backend process hang on " D " state

On Sun, May 29, 2022 at 01:20:12PM +0000, James Pang (chaolpan) wrote:
>    We have a performance test on Postgresql 13.4 on RHEL8.4 , just after connection storm in ( 952 new connections
comingin 1 minute),  a lot of backends start on " D " state,  and when more sessions got disconnected, they do not exit
successfully,instead became  "defunct".   No errors from postgresql.log , just after the connection storm, some pg_cron
workerscan not started either.  The server is a Virtual machine and no IO hang (actually) IO load is very low.   Could
bea postgresql bug or an OS abnormal behavior? 

What settings have you set ?
https://wiki.postgresql.org/wiki/Server_Configuration

What extensions do you have loaded?  \dx

Send the output of SELECT * FROM pg_stat_activity either as an attachment or in \x mode?

What is your data dir ?  Is it on the VM's root filesystem or something else ?
Show the output of "mount".  Are there any kernel messages in /var/log/messages or `dmesg` ?

How many relations are in your schema ?
Are you using temp tables ?
Long-running transactions ?

--
Justin



Re: postgres backend process hang on " D " state

From
Justin Pryzby
Date:
On Mon, May 30, 2022 at 01:19:56AM +0000, James Pang (chaolpan) wrote:
> 1. extensions 
>       shared_preload_libraries = 'orafce,pgaudit,pg_cron,pg_stat_statements,set_user'
> 2.  psql can not login now ,it hang there too, so can not check anything from pg_stats_* views
> 3.  one main app user and 2 schemas ,no long running transactions . 
> 4. we use /pgdata , it's on xfs , lvm/vg RHEL8.4 ,it's a shared storage, no use root filesystem.
> /dev/mapper/pgdatavg-pgdatalv     500G  230G  271G  46% /pgdata
> /dev/mapper/pgdatavg-pgarchivelv  190G  1.5G  189G   1% /pgarchive
> /dev/mapper/pgdatavg-pgwallv      100G   34G   67G  34% /pgwal

What are the LVM PVs ?  Is it a scsi/virt device ?  Or iscsi/drbd/???

I didn't hear back if there's any kernel errors.
Is the storage broken/stuck/disconnected ?
Can you run "time find /pgdata /pgarchive /pgwal -ls |wc" ?

Could you run "ps -u postgres -O wchan============================="

Can you strace one of the stuck backends ?

It sounds like you'll have to restart the service or VM (forcibly if necessary)
to resolve the immediate issue and then collect the other info, and leave a
"psql" open to try to (if the problem recurs) check pg_stat_activity and other
DB info.



Re: postgres backend process hang on " D " state

From
Tom Lane
Date:
"James Pang (chaolpan)" <chaolpan@cisco.com> writes:
> 1. extensions
>       shared_preload_libraries = 'orafce,pgaudit,pg_cron,pg_stat_statements,set_user'

Can you still reproduce this if you remove all of those?

            regards, tom lane



RE: postgres backend process hang on " D " state

From
"James Pang (chaolpan)"
Date:
Update your questions

. time find /pgdata /pgarchive /pgwal -ls |wc
  82165  903817 8397391

real    0m1.120s
user    0m0.432s
sys     0m0.800s

ps -u postgres -O wchan=============================
    PID ============================ S TTY          TIME COMMAND
   1951 -                            D ?        00:26:37 /usr/pgsql-13/bin/postmaster -D /pgdata -c
config_file=/pgdata/postgresql.conf
   2341 -                            S ?        00:00:06 postgres: logger
   2361 -                            S ?        00:01:02 postgres: checkpointer
   2362 -                            S ?        00:00:27 postgres: background writer
   2363 -                            S ?        00:00:59 postgres: walwriter
   2364 -                            S ?        00:02:00 postgres: autovacuum launcher
   2365 -                            Z ?        00:00:04 [postmaster] <defunct>
   2366 do_epoll_wait                S ?        00:13:30 postgres: stats collector
   2367 do_epoll_wait                S ?        00:00:18 postgres: pg_cron launcher
   2368 -                            S ?        00:00:00 postgres: logical replication launcher
1053144 -                            Z ?        00:05:36 [postmaster] <defunct>
1053319 -                            Z ?        00:05:29 [postmaster] <defunct>
1053354 -                            Z ?        00:05:53 [postmaster] <defunct>
1053394 -                            Z ?        00:05:51 [postmaster] <defunct>
...
1064387 -                            Z ?        00:05:13 [postmaster] <defunct>
1070257 -                            D ?        00:24:23 postgres: test pbwd 192.168.205.53(55886) BIND
1070258 -                            D ?        00:24:24 postgres: test pbwd 192.168.205.50(58910) BIND
1070259 -                            D ?        00:24:22 postgres: test pbwd 192.168.205.133(48754) SELECT
1070260 -                            Z ?        00:05:02 [postmaster] <defunct>
...

Strace / gdb will hang there too for trace a process.

Regards,

James

-----Original Message-----
From: Justin Pryzby <pryzby@telsasoft.com>
Sent: Monday, May 30, 2022 10:20 AM
To: James Pang (chaolpan) <chaolpan@cisco.com>
Cc: pgsql-performance@lists.postgresql.org
Subject: Re: postgres backend process hang on " D " state

On Mon, May 30, 2022 at 01:19:56AM +0000, James Pang (chaolpan) wrote:
> 1. extensions
>       shared_preload_libraries = 'orafce,pgaudit,pg_cron,pg_stat_statements,set_user'
> 2.  psql can not login now ,it hang there too, so can not check
> anything from pg_stats_* views 3.  one main app user and 2 schemas ,no long running transactions .
> 4. we use /pgdata , it's on xfs , lvm/vg RHEL8.4 ,it's a shared storage, no use root filesystem.
> /dev/mapper/pgdatavg-pgdatalv     500G  230G  271G  46% /pgdata
> /dev/mapper/pgdatavg-pgarchivelv  190G  1.5G  189G   1% /pgarchive
> /dev/mapper/pgdatavg-pgwallv      100G   34G   67G  34% /pgwal

What are the LVM PVs ?  Is it a scsi/virt device ?  Or iscsi/drbd/???

I didn't hear back if there's any kernel errors.
Is the storage broken/stuck/disconnected ?
Can you run "time find /pgdata /pgarchive /pgwal -ls |wc" ?

Could you run "ps -u postgres -O wchan============================="

Can you strace one of the stuck backends ?

It sounds like you'll have to restart the service or VM (forcibly if necessary) to resolve the immediate issue and then
collectthe other info, and leave a "psql" open to try to (if the problem recurs) check pg_stat_activity and other DB
info.



RE: postgres backend process hang on " D " state

From
"James Pang (chaolpan)"
Date:
   Maybe any bugs from these extensions ?  I can try that removing all extensions, but we need these extensions.

Thanks,

James

-----Original Message-----
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Monday, May 30, 2022 10:21 AM
To: James Pang (chaolpan) <chaolpan@cisco.com>
Cc: Justin Pryzby <pryzby@telsasoft.com>; pgsql-performance@lists.postgresql.org
Subject: Re: postgres backend process hang on " D " state

"James Pang (chaolpan)" <chaolpan@cisco.com> writes:
> 1. extensions
>       shared_preload_libraries = 'orafce,pgaudit,pg_cron,pg_stat_statements,set_user'

Can you still reproduce this if you remove all of those?

            regards, tom lane



RE: postgres backend process hang on " D " state

From
"James Pang (chaolpan)"
Date:
Seems all process blocked by a system service "fapolicy.d", when I stop the service,  defunct process got
disappearedand pending backends got released or moving on, and database started to accept new connection again.  
   This fapolicy.d in RHEL8.4 is enabled by system admin to support security compliance requirements.
   systemctl stop fapolicyd , after that, everything go back to be normal soon.

Regards,

James


-----Original Message-----
From: James Pang (chaolpan)
Sent: Monday, May 30, 2022 11:00 AM
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Justin Pryzby <pryzby@telsasoft.com>; pgsql-performance@lists.postgresql.org
Subject: RE: postgres backend process hang on " D " state

   Maybe any bugs from these extensions ?  I can try that removing all extensions, but we need these extensions.

Thanks,

James

-----Original Message-----
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Monday, May 30, 2022 10:21 AM
To: James Pang (chaolpan) <chaolpan@cisco.com>
Cc: Justin Pryzby <pryzby@telsasoft.com>; pgsql-performance@lists.postgresql.org
Subject: Re: postgres backend process hang on " D " state

"James Pang (chaolpan)" <chaolpan@cisco.com> writes:
> 1. extensions
>       shared_preload_libraries = 'orafce,pgaudit,pg_cron,pg_stat_statements,set_user'

Can you still reproduce this if you remove all of those?

            regards, tom lane