Re: Missed condition-variable wakeups on FreeBSD - Mailing list pgsql-hackers
From | Melanie Plageman |
---|---|
Subject | Re: Missed condition-variable wakeups on FreeBSD |
Date | |
Msg-id | CAAKRu_aps6Jz41do4AwBjyZUG6QkWkyxtWZj125CRPybKX-KYw@mail.gmail.com Whole thread Raw |
In response to | Missed condition-variable wakeups on FreeBSD (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Missed condition-variable wakeups on FreeBSD
|
List | pgsql-hackers |
On Sat, Feb 26, 2022 at 2:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > About once a month over the last six months, my buildfarm animal > florican has gotten stuck while running the core regression tests. > The symptoms have looked very much the same each time: there is > a backend with two parallel worker processes that are just sitting > and not consuming any CPU time. Each time I've attached to these > processes with gdb to check their stack traces, and seen pretty > much the same story every time (traces below). What is really > interesting is that after I detach from the second worker, the > processes resume running and finish out the test successfully! > I don't know much about how gdb interacts with kernel calls on > FreeBSD, but I speculate that the poll(2) call returns with EINTR > after gdb releases the process, and then things resume fine, > suggesting that we lost an interrupt somewhere. > > I have observed this three times in the REL_11 branch, once > in REL_12, and a couple of times last summer before it occurred > to me to start keeping notes. Over that time the machine has > been running various patchlevels of FreeBSD 13.0. > > Here's the stack trace from the leader process in the most > recent event (REL_11 as of yesterday). It's not always the > same query that gets stuck, but it's always a parallel hash join: > > (gdb) p debug_query_string > $1 = 0x21873090 "select count(*) from simple r join simple s using (id);" > (gdb) bt > #0 _poll () at _poll.S:4 > #1 0x21701361 in __thr_poll (fds=0x219dc170, nfds=2, timeout=-1) at /usr/src/lib/libthr/thread/thr_syscalls.c:338 > #2 0x215eaf3f in poll (pfd=0x219dc170, nfds=2, timeout=-1) at /usr/src/lib/libc/sys/poll.c:47 > #3 0x0097b0fd in WaitEventSetWaitBlock (set=<optimized out>, cur_timeout=-1, occurred_events=<optimized out>, nevents=<optimizedout>) at latch.c:1171 > #4 WaitEventSetWait (set=0x219dc138, timeout=-1, occurred_events=<optimized out>, nevents=1, wait_event_info=134217738)at latch.c:1000 > #5 0x0099842b in ConditionVariableSleep (cv=0x2bf0b230, wait_event_info=134217738) at condition_variable.c:157 > #6 0x00977e12 in BarrierArriveAndWait (barrier=0x2bf0b218, wait_event_info=134217738) at barrier.c:191 > #7 0x00873441 in ExecHashJoinImpl (pstate=<optimized out>, parallel=true) at nodeHashjoin.c:328 > #8 ExecParallelHashJoin (pstate=0x2a887740) at nodeHashjoin.c:600 > #9 0x0086a509 in ExecProcNode (node=0x2a887740) at ../../../src/include/executor/executor.h:248 > #10 fetch_input_tuple (aggstate=aggstate@entry=0x2a887500) at nodeAgg.c:407 > #11 0x00869646 in agg_retrieve_direct (aggstate=<optimized out>) at nodeAgg.c:1746 > #12 ExecAgg (pstate=0x2a887500) at nodeAgg.c:1561 > #13 0x0086e181 in ExecProcNode (node=0x2a887500) at ../../../src/include/executor/executor.h:248 > #14 gather_getnext (gatherstate=0x2a887414) at nodeGather.c:276 > #15 ExecGather (pstate=0x2a887414) at nodeGather.c:207 > #16 0x0086a509 in ExecProcNode (node=0x2a887414) at ../../../src/include/executor/executor.h:248 > #17 fetch_input_tuple (aggstate=aggstate@entry=0x2a8871b8) at nodeAgg.c:407 > #18 0x00869646 in agg_retrieve_direct (aggstate=<optimized out>) at nodeAgg.c:1746 > #19 ExecAgg (pstate=0x2a8871b8) at nodeAgg.c:1561 > #20 0x0085956b in ExecProcNode (node=0x2a8871b8) at ../../../src/include/executor/executor.h:248 > #21 ExecutePlan (estate=0x2a887090, planstate=0x2a8871b8, use_parallel_mode=<optimized out>, operation=CMD_SELECT, sendTuples=<optimizedout>, > numberTuples=<optimized out>, direction=<optimized out>, dest=0x2ab629b4, execute_once=<optimized out>) at execMain.c:1712 > #22 standard_ExecutorRun (queryDesc=0x218a1490, direction=ForwardScanDirection, count=0, execute_once=<optimized out>)at execMain.c:353 > #23 0x00859445 in ExecutorRun (queryDesc=0x218a1490, direction=ForwardScanDirection, count=0, execute_once=<optimized out>)at execMain.c:296 > #24 0x009a3d57 in PortalRunSelect (portal=portal@entry=0x2197f090, forward=<optimized out>, count=0, dest=0x2ab629b4) atpquery.c:941 > #25 0x009a39d6 in PortalRun (portal=0x2197f090, count=2147483647, isTopLevel=<optimized out>, run_once=<optimized out>,dest=0x2ab629b4, altdest=0x2ab629b4, > completionTag=0xffbfdc70 "") at pquery.c:782 > #26 0x009a2b53 in exec_simple_query (query_string=query_string@entry=0x21873090 "select count(*) from simple r join simples using (id);") at postgres.c:1145 > #27 0x009a04ec in PostgresMain (argc=1, argv=0x21954ed4, dbname=<optimized out>, username=0x21954d38 "buildfarm") at postgres.c:4052 > > Worker 1 looks like this: > > (gdb) bt > #0 _poll () at _poll.S:4 > #1 0x21701361 in __thr_poll (fds=0x2187cb48, nfds=2, timeout=-1) at /usr/src/lib/libthr/thread/thr_syscalls.c:338 > #2 0x215eaf3f in poll (pfd=0x2187cb48, nfds=2, timeout=-1) at /usr/src/lib/libc/sys/poll.c:47 > #3 0x0097b0fd in WaitEventSetWaitBlock (set=<optimized out>, cur_timeout=-1, occurred_events=<optimized out>, nevents=<optimizedout>) at latch.c:1171 > #4 WaitEventSetWait (set=0x2187cb10, timeout=-1, occurred_events=<optimized out>, nevents=1, wait_event_info=134217738)at latch.c:1000 > #5 0x0099842b in ConditionVariableSleep (cv=0x2196b230, wait_event_info=134217738) at condition_variable.c:157 > #6 0x00977e12 in BarrierArriveAndWait (barrier=0x2196b218, wait_event_info=134217738) at barrier.c:191 > #7 0x00873441 in ExecHashJoinImpl (pstate=<optimized out>, parallel=true) at nodeHashjoin.c:328 > #8 ExecParallelHashJoin (pstate=0x2a8378d8) at nodeHashjoin.c:600 > #9 0x0086a509 in ExecProcNode (node=0x2a8378d8) at ../../../src/include/executor/executor.h:248 > #10 fetch_input_tuple (aggstate=aggstate@entry=0x2a8376c8) at nodeAgg.c:407 > #11 0x00869646 in agg_retrieve_direct (aggstate=<optimized out>) at nodeAgg.c:1746 > #12 ExecAgg (pstate=0x2a8376c8) at nodeAgg.c:1561 > #13 0x0085956b in ExecProcNode (node=0x2a8376c8) at ../../../src/include/executor/executor.h:248 > #14 ExecutePlan (estate=0x2a837090, planstate=0x2a8376c8, use_parallel_mode=<optimized out>, operation=CMD_SELECT, sendTuples=<optimizedout>, > numberTuples=<optimized out>, direction=<optimized out>, dest=0x219ac780, execute_once=<optimized out>) at execMain.c:1712 > #15 standard_ExecutorRun (queryDesc=0x219ad3e0, direction=ForwardScanDirection, count=0, execute_once=<optimized out>)at execMain.c:353 > #16 0x00859445 in ExecutorRun (queryDesc=0x219ad3e0, direction=ForwardScanDirection, count=0, execute_once=<optimized out>)at execMain.c:296 > #17 0x0085d6de in ParallelQueryMain (seg=0x2187c89c, toc=0x2196b000) at execParallel.c:1402 > #18 0x00750ee0 in ParallelWorkerMain (main_arg=1903254317) at parallel.c:1433 > > Worker 2 looks like this: > > (gdb) bt > #0 _poll () at _poll.S:4 > #1 0x21701361 in __thr_poll (fds=0x2187cb48, nfds=2, timeout=-1) at /usr/src/lib/libthr/thread/thr_syscalls.c:338 > #2 0x215eaf3f in poll (pfd=0x2187cb48, nfds=2, timeout=-1) at /usr/src/lib/libc/sys/poll.c:47 > #3 0x0097b0fd in WaitEventSetWaitBlock (set=<optimized out>, cur_timeout=-1, occurred_events=<optimized out>, nevents=<optimizedout>) at latch.c:1171 > #4 WaitEventSetWait (set=0x2187cb10, timeout=-1, occurred_events=<optimized out>, nevents=1, wait_event_info=134217737)at latch.c:1000 > #5 0x0099842b in ConditionVariableSleep (cv=0x2196b230, wait_event_info=134217737) at condition_variable.c:157 > #6 0x00977e12 in BarrierArriveAndWait (barrier=0x2196b218, wait_event_info=134217737) at barrier.c:191 > #7 0x0086ea44 in MultiExecParallelHash (node=<optimized out>) at nodeHash.c:312 > #8 MultiExecHash (node=0x2a837e88) at nodeHash.c:112 > #9 0x0085fa03 in MultiExecProcNode (node=0x2a837e88) at execProcnode.c:501 > #10 0x00872f80 in ExecHashJoinImpl (pstate=<optimized out>, parallel=true) at nodeHashjoin.c:290 > #11 ExecParallelHashJoin (pstate=0x2a8378d8) at nodeHashjoin.c:600 > #12 0x0086a509 in ExecProcNode (node=0x2a8378d8) at ../../../src/include/executor/executor.h:248 > #13 fetch_input_tuple (aggstate=aggstate@entry=0x2a8376c8) at nodeAgg.c:407 > #14 0x00869646 in agg_retrieve_direct (aggstate=<optimized out>) at nodeAgg.c:1746 > #15 ExecAgg (pstate=0x2a8376c8) at nodeAgg.c:1561 > #16 0x0085956b in ExecProcNode (node=0x2a8376c8) at ../../../src/include/executor/executor.h:248 > #17 ExecutePlan (estate=0x2a837090, planstate=0x2a8376c8, use_parallel_mode=<optimized out>, operation=CMD_SELECT, sendTuples=<optimizedout>, > numberTuples=<optimized out>, direction=<optimized out>, dest=0x219ac780, execute_once=<optimized out>) at execMain.c:1712 > #18 standard_ExecutorRun (queryDesc=0x219ad3e0, direction=ForwardScanDirection, count=0, execute_once=<optimized out>)at execMain.c:353 > #19 0x00859445 in ExecutorRun (queryDesc=0x219ad3e0, direction=ForwardScanDirection, count=0, execute_once=<optimized out>)at execMain.c:296 > #20 0x0085d6de in ParallelQueryMain (seg=0x2187c89c, toc=0x2196b000) at execParallel.c:1402 > #21 0x00750ee0 in ParallelWorkerMain (main_arg=1903254317) at parallel.c:1433 > #22 0x009193d3 in StartBackgroundWorker () at bgworker.c:911 > > In each case I attached to the process, ran the gdb commands shown > above, and detached. As I've grown used to seeing, the processes > resumed running after I detached from the second worker. I've not > tried attaching in something other than PID order, but I probably > will next time. > > Thoughts? Ideas on debugging this? How could it be that worker 2 is waiting on the build barrier in PHJ_BUILD_HASHING_INNER and worker 1 and the leader are waiting on it with it supposedly in PHJ_BUILD_HASHING_OUTER?
pgsql-hackers by date: