Re: backends stuck in "startup" - Mailing list pgsql-general

From Justin Pryzby
Subject Re: backends stuck in "startup"
Date
Msg-id 20171122064702.GQ5668@telsasoft.com
Whole thread Raw
In response to Re: backends stuck in "startup"  (Andres Freund <andres@anarazel.de>)
Responses Re: backends stuck in "startup"
List pgsql-general
On Tue, Nov 21, 2017 at 03:40:27PM -0800, Andres Freund wrote:
> Hi,
> 
> On 2017-11-21 17:09:26 -0600, Justin Pryzby wrote:
> > I'm sorry to report this previously reported problem is happening again,
> > starting shortly after pg_upgrading a customer to PG10.1 from 9.6.5.
> > 
> > As $subject: backends are stuck in startup for minutes at a time.  I didn't
> > strace this time, but I believe last time I saw one was waiting in a futex.
> 
> Could you try stracing next time?

I straced all the "startup" PIDs, which were all in futex, without exception:

[pryzbyj@TS-DB ~]$ sudo strace `ps h --sort start_time -o pid,args -u postgres |awk '/startup$/{print "-p"$1}'` 2>&1
|tee-a /tmp/strace-pg-backends
 

Process 3611 attached
Process 6081 attached
[pid  6081] futex(0x7ff4314000f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  3611] futex(0x7ff431400138, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  3354] futex(0x7ff431400258, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  3117] futex(0x7ff4314002b8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  2948] futex(0x7ff431400378, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  2925] futex(0x7ff4314003d8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  2789] futex(0x7ff431400438, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  2752] futex(0x7ff4314004b8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  2663] futex(0x7ff431400618, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  2661] futex(0x7ff431400638, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  2624] futex(0x7ff431400678, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  2576] futex(0x7ff4314006f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid   365] futex(0x7ff431400718, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid   326] futex(0x7ff431400798, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 32704] futex(0x7ff431400898, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 32309] futex(0x7ff431400938, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 32292] futex(0x7ff431400a18, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 32170] futex(0x7ff431400ad8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 32134] futex(0x7ff431400b58, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31987] futex(0x7ff431400bf8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31962] futex(0x7ff431400c58, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31817] futex(0x7ff431400d38, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31807] futex(0x7ff431400d98, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 29664] futex(0x7ff431400e98, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 29462] futex(0x7ff431400ef8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 29426] futex(0x7ff431400f38, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 29258] futex(0x7ff431401058, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 29227] futex(0x7ff431402f58, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 29107] futex(0x7ff431401138, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 29071] futex(0x7ff431401198, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 28942] futex(0x7ff431401258, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 28924] futex(0x7ff431401298, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 26942] futex(0x7ff431401418, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 25278] futex(0x7ff431401438, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24909] futex(0x7ff431401598, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24908] futex(0x7ff4314015b8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24852] futex(0x7ff4314015f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24851] futex(0x7ff431401618, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24585] futex(0x7ff4314016f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24584] futex(0x7ff431401718, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24478] futex(0x7ff431401918, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24476] futex(0x7ff431401958, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24470] futex(0x7ff4314019d8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24464] futex(0x7ff431401ad8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24458] futex(0x7ff431401bb8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24456] futex(0x7ff431401b98, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24450] futex(0x7ff431401c58, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24448] futex(0x7ff431401c98, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24439] futex(0x7ff431401cf8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24323] futex(0x7ff431401e38, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24322] futex(0x7ff431401e58, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 23057] futex(0x7ff431401fd8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 23055] futex(0x7ff431402038, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21826] futex(0x7ff4314021f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21824] futex(0x7ff431402238, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21794] futex(0x7ff431402298, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21792] futex(0x7ff431402638, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21791] futex(0x7ff431402af8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21790] futex(0x7ff431402c98, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21786] futex(0x7ff431402478, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21785] futex(0x7ff431402758, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21784] futex(0x7ff431403098, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21768] futex(0x7ff431402718, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21662] futex(0x7ff431402338, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21648] futex(0x7ff431402c38, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21630] futex(0x7ff4314024d8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21591] futex(0x7ff431402278, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21590] futex(0x7ff431402bf8, FUTEX_WAIT, 0, NULL

Also:

[pryzbyj@TS-DB ~]$ date;ps -O lstart,wchan=9999999999999999999999 -u postgres --sort start_time |grep startup$
Wed Nov 22 00:47:41 EST 20171990 Wed Nov 22 00:36:44 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx
xx.247(53934)startup2007 Wed Nov 22 00:36:44 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx
xx.247(53945)startup2014 Wed Nov 22 00:36:44 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx
xx.247(53946)startup2042 Wed Nov 22 00:36:48 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx
xx.248(51353)startup2043 Wed Nov 22 00:36:48 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx
xx.252(52122)startup2047 Wed Nov 22 00:36:49 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx
xx.253(46696)startup2049 Wed Nov 22 00:36:49 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx
xx.249(60135)startup2052 Wed Nov 22 00:36:50 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx
xx.247(53947)startup2095 Wed Nov 22 00:37:13 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx
xx.248(51355)startup2096 Wed Nov 22 00:37:14 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx
xx.249(60137)startup2097 Wed Nov 22 00:37:15 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx
xx.247(53949)startup2178 Wed Nov 22 00:37:43 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx
xx.253(46698)startup2188 Wed Nov 22 00:38:00 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx
xx.252(52124)startup2331 Wed Nov 22 00:38:30 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx
xx.248(51357)startup2332 Wed Nov 22 00:38:30 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx
xx.249(60139)startup2333 Wed Nov 22 00:38:31 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx
xx.247(53951)startup2529 Wed Nov 22 00:39:00 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx
xx.253(46700)startup2530 Wed Nov 22 00:39:01 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx
xx.248(51358)startup2533 Wed Nov 22 00:39:01 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx
xx.247(53952)startup2535 Wed Nov 22 00:39:01 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx
xx.252(52125)startup2549 Wed Nov 22 00:39:01 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx
xx.253(46701)startup
 
[...]

> Don't think it's conclusive that they're all continually waiting rather than
> contending on a few locks.  As soon as you attach to one backend a contended
> lock situation can quickly turn into standstill if you end up interrupting a
> process that holds a lock.

I didn't understand what you mean.

By "attach" do you mean attach with strace?  gdb?  Or ??

I also tried stracing all (400) postgres PIDs, which didn't magically unlock them..
sudo strace `ps h --sort start_time -o pid,args -u postgres |awk '/|startup$/{print "-p"$1}'` 2>>
/tmp/strace-pg-backends3&
 

FYI these AREN'T being rebuilt very often:

[pryzbyj@TS-DB ~]$ sudo sh -c "find /var/lib/pgsql/ -name 'pg_internal.init' -print0 |xargs -r0 ls -lt"
-rw------- 1 postgres postgres 136164 Nov 21 21:50 /var/lib/pgsql/10/data/base/16404/pg_internal.init
-rw------- 1 postgres postgres  21340 Nov 21 21:50 /var/lib/pgsql/10/data/global/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 21:47 /var/lib/pgsql/10/data/base/16406/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 21:32 /var/lib/pgsql/10/data/base/16410/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 21:28 /var/lib/pgsql/10/data/base/16401/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 21:21 /var/lib/pgsql/10/data/base/16403/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 21:13 /var/lib/pgsql/10/data/base/13451/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 21:13 /var/lib/pgsql/10/data/base/16408/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 20:59 /var/lib/pgsql/10/data/base/16411/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 20:56 /var/lib/pgsql/10/data/base/16402/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 20:01 /var/lib/pgsql/10/data/base/1/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 20:01 /var/lib/pgsql/10/data/base/16409/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 20:00 /var/lib/pgsql/10/data/base/16407/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 20:00 /var/lib/pgsql/10/data/base/16405/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 19:59 /var/lib/pgsql/10/data/base/16400/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 17:17 /var/lib/pgsql/10/data/base/13450/pg_internal.init
[pryzbyj@TS-DB ~]$ date
Wed Nov 22 00:45:14 EST 2017

And it's not a minor overhead:

[pryzbyj@TS-DB ~]$ time psql newn -c 'SELECT 1'
psql: FATAL:  terminating connection due to administrator command
real    10m22.089s

Justin


pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: On Judging the Value of Tests
Next
From: legrand legrand
Date:
Subject: Re: One Partition by list is always chosen by planner