Re: backends stuck in "startup" - Mailing list pgsql-general
From | Justin Pryzby |
---|---|
Subject | Re: backends stuck in "startup" |
Date | |
Msg-id | 20171122064702.GQ5668@telsasoft.com Whole thread Raw |
In response to | Re: backends stuck in "startup" (Andres Freund <andres@anarazel.de>) |
Responses |
Re: backends stuck in "startup"
|
List | pgsql-general |
On Tue, Nov 21, 2017 at 03:40:27PM -0800, Andres Freund wrote: > Hi, > > On 2017-11-21 17:09:26 -0600, Justin Pryzby wrote: > > I'm sorry to report this previously reported problem is happening again, > > starting shortly after pg_upgrading a customer to PG10.1 from 9.6.5. > > > > As $subject: backends are stuck in startup for minutes at a time. I didn't > > strace this time, but I believe last time I saw one was waiting in a futex. > > Could you try stracing next time? I straced all the "startup" PIDs, which were all in futex, without exception: [pryzbyj@TS-DB ~]$ sudo strace `ps h --sort start_time -o pid,args -u postgres |awk '/startup$/{print "-p"$1}'` 2>&1 |tee-a /tmp/strace-pg-backends Process 3611 attached Process 6081 attached [pid 6081] futex(0x7ff4314000f8, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 3611] futex(0x7ff431400138, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 3354] futex(0x7ff431400258, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 3117] futex(0x7ff4314002b8, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 2948] futex(0x7ff431400378, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 2925] futex(0x7ff4314003d8, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 2789] futex(0x7ff431400438, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 2752] futex(0x7ff4314004b8, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 2663] futex(0x7ff431400618, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 2661] futex(0x7ff431400638, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 2624] futex(0x7ff431400678, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 2576] futex(0x7ff4314006f8, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 365] futex(0x7ff431400718, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 326] futex(0x7ff431400798, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 32704] futex(0x7ff431400898, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 32309] futex(0x7ff431400938, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 32292] futex(0x7ff431400a18, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 32170] futex(0x7ff431400ad8, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 32134] futex(0x7ff431400b58, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 31987] futex(0x7ff431400bf8, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 31962] futex(0x7ff431400c58, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 31817] futex(0x7ff431400d38, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 31807] futex(0x7ff431400d98, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 29664] futex(0x7ff431400e98, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 29462] futex(0x7ff431400ef8, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 29426] futex(0x7ff431400f38, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 29258] futex(0x7ff431401058, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 29227] futex(0x7ff431402f58, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 29107] futex(0x7ff431401138, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 29071] futex(0x7ff431401198, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 28942] futex(0x7ff431401258, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 28924] futex(0x7ff431401298, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 26942] futex(0x7ff431401418, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 25278] futex(0x7ff431401438, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 24909] futex(0x7ff431401598, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 24908] futex(0x7ff4314015b8, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 24852] futex(0x7ff4314015f8, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 24851] futex(0x7ff431401618, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 24585] futex(0x7ff4314016f8, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 24584] futex(0x7ff431401718, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 24478] futex(0x7ff431401918, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 24476] futex(0x7ff431401958, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 24470] futex(0x7ff4314019d8, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 24464] futex(0x7ff431401ad8, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 24458] futex(0x7ff431401bb8, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 24456] futex(0x7ff431401b98, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 24450] futex(0x7ff431401c58, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 24448] futex(0x7ff431401c98, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 24439] futex(0x7ff431401cf8, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 24323] futex(0x7ff431401e38, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 24322] futex(0x7ff431401e58, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 23057] futex(0x7ff431401fd8, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 23055] futex(0x7ff431402038, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 21826] futex(0x7ff4314021f8, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 21824] futex(0x7ff431402238, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 21794] futex(0x7ff431402298, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 21792] futex(0x7ff431402638, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 21791] futex(0x7ff431402af8, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 21790] futex(0x7ff431402c98, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 21786] futex(0x7ff431402478, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 21785] futex(0x7ff431402758, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 21784] futex(0x7ff431403098, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 21768] futex(0x7ff431402718, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 21662] futex(0x7ff431402338, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 21648] futex(0x7ff431402c38, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 21630] futex(0x7ff4314024d8, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 21591] futex(0x7ff431402278, FUTEX_WAIT, 0, NULL <unfinished ...> [pid 21590] futex(0x7ff431402bf8, FUTEX_WAIT, 0, NULL Also: [pryzbyj@TS-DB ~]$ date;ps -O lstart,wchan=9999999999999999999999 -u postgres --sort start_time |grep startup$ Wed Nov 22 00:47:41 EST 20171990 Wed Nov 22 00:36:44 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.247(53934)startup2007 Wed Nov 22 00:36:44 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.247(53945)startup2014 Wed Nov 22 00:36:44 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.247(53946)startup2042 Wed Nov 22 00:36:48 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.248(51353)startup2043 Wed Nov 22 00:36:48 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.252(52122)startup2047 Wed Nov 22 00:36:49 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.253(46696)startup2049 Wed Nov 22 00:36:49 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.249(60135)startup2052 Wed Nov 22 00:36:50 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.247(53947)startup2095 Wed Nov 22 00:37:13 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.248(51355)startup2096 Wed Nov 22 00:37:14 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.249(60137)startup2097 Wed Nov 22 00:37:15 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.247(53949)startup2178 Wed Nov 22 00:37:43 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.253(46698)startup2188 Wed Nov 22 00:38:00 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.252(52124)startup2331 Wed Nov 22 00:38:30 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.248(51357)startup2332 Wed Nov 22 00:38:30 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.249(60139)startup2333 Wed Nov 22 00:38:31 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.247(53951)startup2529 Wed Nov 22 00:39:00 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.253(46700)startup2530 Wed Nov 22 00:39:01 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.248(51358)startup2533 Wed Nov 22 00:39:01 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.247(53952)startup2535 Wed Nov 22 00:39:01 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.252(52125)startup2549 Wed Nov 22 00:39:01 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.253(46701)startup [...] > Don't think it's conclusive that they're all continually waiting rather than > contending on a few locks. As soon as you attach to one backend a contended > lock situation can quickly turn into standstill if you end up interrupting a > process that holds a lock. I didn't understand what you mean. By "attach" do you mean attach with strace? gdb? Or ?? I also tried stracing all (400) postgres PIDs, which didn't magically unlock them.. sudo strace `ps h --sort start_time -o pid,args -u postgres |awk '/|startup$/{print "-p"$1}'` 2>> /tmp/strace-pg-backends3& FYI these AREN'T being rebuilt very often: [pryzbyj@TS-DB ~]$ sudo sh -c "find /var/lib/pgsql/ -name 'pg_internal.init' -print0 |xargs -r0 ls -lt" -rw------- 1 postgres postgres 136164 Nov 21 21:50 /var/lib/pgsql/10/data/base/16404/pg_internal.init -rw------- 1 postgres postgres 21340 Nov 21 21:50 /var/lib/pgsql/10/data/global/pg_internal.init -rw------- 1 postgres postgres 136164 Nov 21 21:47 /var/lib/pgsql/10/data/base/16406/pg_internal.init -rw------- 1 postgres postgres 136164 Nov 21 21:32 /var/lib/pgsql/10/data/base/16410/pg_internal.init -rw------- 1 postgres postgres 136164 Nov 21 21:28 /var/lib/pgsql/10/data/base/16401/pg_internal.init -rw------- 1 postgres postgres 136164 Nov 21 21:21 /var/lib/pgsql/10/data/base/16403/pg_internal.init -rw------- 1 postgres postgres 136164 Nov 21 21:13 /var/lib/pgsql/10/data/base/13451/pg_internal.init -rw------- 1 postgres postgres 136164 Nov 21 21:13 /var/lib/pgsql/10/data/base/16408/pg_internal.init -rw------- 1 postgres postgres 136164 Nov 21 20:59 /var/lib/pgsql/10/data/base/16411/pg_internal.init -rw------- 1 postgres postgres 136164 Nov 21 20:56 /var/lib/pgsql/10/data/base/16402/pg_internal.init -rw------- 1 postgres postgres 136164 Nov 21 20:01 /var/lib/pgsql/10/data/base/1/pg_internal.init -rw------- 1 postgres postgres 136164 Nov 21 20:01 /var/lib/pgsql/10/data/base/16409/pg_internal.init -rw------- 1 postgres postgres 136164 Nov 21 20:00 /var/lib/pgsql/10/data/base/16407/pg_internal.init -rw------- 1 postgres postgres 136164 Nov 21 20:00 /var/lib/pgsql/10/data/base/16405/pg_internal.init -rw------- 1 postgres postgres 136164 Nov 21 19:59 /var/lib/pgsql/10/data/base/16400/pg_internal.init -rw------- 1 postgres postgres 136164 Nov 21 17:17 /var/lib/pgsql/10/data/base/13450/pg_internal.init [pryzbyj@TS-DB ~]$ date Wed Nov 22 00:45:14 EST 2017 And it's not a minor overhead: [pryzbyj@TS-DB ~]$ time psql newn -c 'SELECT 1' psql: FATAL: terminating connection due to administrator command real 10m22.089s Justin
pgsql-general by date: