Thread: Slave server sometimes locks up
PgSQL 10.7, Ubuntu 16.04 LTS
Symptoms:
- server accepts new queries until connections exhausted (all queries are SELECT)
- queries are active, never end, but no disk IO
- queries can't be killed with kill -TERM or pg_terminate_backend()
- system load is minimal (vmstat shows 100% idle)
- perf top shows nothing
- statement_timeout is ignored
- no locks with SELECT relation::regclass, * FROM pg_locks WHERE NOT GRANTED;
- server exits only on kill -9
- strace on SELECT process indefinitely shows:
futex(0x7f00fe94c938, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff^Cstrace: Process 121319 detached
<detached ...>
<detached ...>
GDB backtrace:
(gdb) bt
#0 0x00007f05256f1827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f00fe94ba38) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1 do_futex_wait (sem=sem@entry=0x7f00fe94ba38, abstime=0x0) at sem_waitcommon.c:111
#2 0x00007f05256f18d4 in __new_sem_wait_slow (sem=0x7f00fe94ba38, abstime=0x0) at sem_waitcommon.c:181
#3 0x00007f05256f197a in __new_sem_wait (sem=sem@entry=0x7f00fe94ba38) at sem_wait.c:29
#4 0x000055c9b95eb792 in PGSemaphoreLock (sema=0x7f00fe94ba38) at pg_sema.c:316
#5 0x000055c9b965eaec in LWLockAcquire (lock=0x7f00fe96f880, mode=mode@entry=LW_EXCLUSIVE) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/storage/lmgr/lwlock.c:1233
#6 0x000055c9b96497f7 in dsm_create (size=size@entry=105544, flags=flags@entry=1) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/storage/ipc/dsm.c:493
#7 0x000055c9b94139ff in InitializeParallelDSM (pcxt=pcxt@entry=0x55c9bb8d9d58) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/access/transam/parallel.c:268
#8 0x000055c9b9538187 in ExecInitParallelPlan (planstate=0x55c9bb949178, estate=0x55c9bb948568, nworkers=1) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/executor/execParallel.c:470
#9 0x000055c9b9547d20 in ExecGather (pstate=0x55c9bb948e58) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/executor/nodeGather.c:158
#10 0x000055c9b95543d5 in ExecProcNode (node=0x55c9bb948e58) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/include/executor/executor.h:250
#11 ExecNestLoop (pstate=0x55c9bb948c68) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/executor/nodeNestloop.c:109
#12 0x000055c9b9556b56 in ExecProcNode (node=0x55c9bb948c68) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/include/executor/executor.h:250
#13 ExecSort (pstate=0x55c9bb9489f8) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/executor/nodeSort.c:106
#14 0x000055c9b9556e57 in ExecProcNode (node=0x55c9bb9489f8) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/include/executor/executor.h:250
#15 ExecUnique (pstate=0x55c9bb948808) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/executor/nodeUnique.c:73
#16 0x000055c9b9533fcb in ExecProcNode (node=0x55c9bb948808) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/include/executor/executor.h:250
#17 ExecutePlan (execute_once=<optimized out>, dest=0x55c9bb8a9328, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x55c9bb948808, estate=0x55c9bb948568)
at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/executor/execMain.c:1723
#18 standard_ExecutorRun (queryDesc=0x55c9bb98ba78, direction=<optimized out>, count=0, execute_once=<optimized out>) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/executor/execMain.c:364
#19 0x000055c9b9670d6b in PortalRunSelect (portal=portal@entry=0x55c9bb8f1d68, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, dest=dest@entry=0x55c9bb8a9328)
at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/tcop/pquery.c:932
#20 0x000055c9b9672380 in PortalRun (portal=portal@entry=0x55c9bb8f1d68, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', run_once=<optimized out>, dest=dest@entry=0x55c9bb8a9328, altdest=altdest@entry=0x55c9bb8a9328,
completionTag=0x7ffdde0387c0 "") at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/tcop/pquery.c:773
#21 0x000055c9b966eef9 in exec_execute_message (max_rows=9223372036854775807, portal_name=0x55c9bb8a8f18 "") at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/tcop/postgres.c:2007
#22 PostgresMain (argc=<optimized out>, argv=argv@entry=0x55c9bb845328, dbname=0x55c9bb7ec9e8 "prod", username=<optimized out>) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/tcop/postgres.c:4180
#23 0x000055c9b93a46b9 in BackendRun (port=0x55c9bb840ce0) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/postmaster/postmaster.c:4405
#24 BackendStartup (port=0x55c9bb840ce0) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/postmaster/postmaster.c:4077
#25 ServerLoop () at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/postmaster/postmaster.c:1755
#26 0x000055c9b95fd6fb in PostmasterMain (argc=5, argv=<optimized out>) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/postmaster/postmaster.c:1363
#27 0x000055c9b93a5ab5 in main (argc=5, argv=0x55c9bb7ea850) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/main/main.c:228
#0 0x00007f05256f1827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f00fe94ba38) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1 do_futex_wait (sem=sem@entry=0x7f00fe94ba38, abstime=0x0) at sem_waitcommon.c:111
#2 0x00007f05256f18d4 in __new_sem_wait_slow (sem=0x7f00fe94ba38, abstime=0x0) at sem_waitcommon.c:181
#3 0x00007f05256f197a in __new_sem_wait (sem=sem@entry=0x7f00fe94ba38) at sem_wait.c:29
#4 0x000055c9b95eb792 in PGSemaphoreLock (sema=0x7f00fe94ba38) at pg_sema.c:316
#5 0x000055c9b965eaec in LWLockAcquire (lock=0x7f00fe96f880, mode=mode@entry=LW_EXCLUSIVE) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/storage/lmgr/lwlock.c:1233
#6 0x000055c9b96497f7 in dsm_create (size=size@entry=105544, flags=flags@entry=1) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/storage/ipc/dsm.c:493
#7 0x000055c9b94139ff in InitializeParallelDSM (pcxt=pcxt@entry=0x55c9bb8d9d58) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/access/transam/parallel.c:268
#8 0x000055c9b9538187 in ExecInitParallelPlan (planstate=0x55c9bb949178, estate=0x55c9bb948568, nworkers=1) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/executor/execParallel.c:470
#9 0x000055c9b9547d20 in ExecGather (pstate=0x55c9bb948e58) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/executor/nodeGather.c:158
#10 0x000055c9b95543d5 in ExecProcNode (node=0x55c9bb948e58) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/include/executor/executor.h:250
#11 ExecNestLoop (pstate=0x55c9bb948c68) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/executor/nodeNestloop.c:109
#12 0x000055c9b9556b56 in ExecProcNode (node=0x55c9bb948c68) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/include/executor/executor.h:250
#13 ExecSort (pstate=0x55c9bb9489f8) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/executor/nodeSort.c:106
#14 0x000055c9b9556e57 in ExecProcNode (node=0x55c9bb9489f8) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/include/executor/executor.h:250
#15 ExecUnique (pstate=0x55c9bb948808) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/executor/nodeUnique.c:73
#16 0x000055c9b9533fcb in ExecProcNode (node=0x55c9bb948808) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/include/executor/executor.h:250
#17 ExecutePlan (execute_once=<optimized out>, dest=0x55c9bb8a9328, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x55c9bb948808, estate=0x55c9bb948568)
at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/executor/execMain.c:1723
#18 standard_ExecutorRun (queryDesc=0x55c9bb98ba78, direction=<optimized out>, count=0, execute_once=<optimized out>) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/executor/execMain.c:364
#19 0x000055c9b9670d6b in PortalRunSelect (portal=portal@entry=0x55c9bb8f1d68, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, dest=dest@entry=0x55c9bb8a9328)
at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/tcop/pquery.c:932
#20 0x000055c9b9672380 in PortalRun (portal=portal@entry=0x55c9bb8f1d68, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', run_once=<optimized out>, dest=dest@entry=0x55c9bb8a9328, altdest=altdest@entry=0x55c9bb8a9328,
completionTag=0x7ffdde0387c0 "") at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/tcop/pquery.c:773
#21 0x000055c9b966eef9 in exec_execute_message (max_rows=9223372036854775807, portal_name=0x55c9bb8a8f18 "") at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/tcop/postgres.c:2007
#22 PostgresMain (argc=<optimized out>, argv=argv@entry=0x55c9bb845328, dbname=0x55c9bb7ec9e8 "prod", username=<optimized out>) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/tcop/postgres.c:4180
#23 0x000055c9b93a46b9 in BackendRun (port=0x55c9bb840ce0) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/postmaster/postmaster.c:4405
#24 BackendStartup (port=0x55c9bb840ce0) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/postmaster/postmaster.c:4077
#25 ServerLoop () at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/postmaster/postmaster.c:1755
#26 0x000055c9b95fd6fb in PostmasterMain (argc=5, argv=<optimized out>) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/postmaster/postmaster.c:1363
#27 0x000055c9b93a5ab5 in main (argc=5, argv=0x55c9bb7ea850) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/main/main.c:228
It looks like it might be a semaphore issue(?):
cat /proc/sys/kernel/sem
32000 1024000000 500 32000
32000 1024000000 500 32000
Thanks,
BorisOn Wed, Mar 6, 2019 at 1:39 AM Boris Sagadin <boris@infosplet.com> wrote: > PgSQL 10.7, Ubuntu 16.04 LTS > > Symptoms: > > - server accepts new queries until connections exhausted (all queries are SELECT) > - queries are active, never end, but no disk IO > - queries can't be killed with kill -TERM or pg_terminate_backend() > - system load is minimal (vmstat shows 100% idle) > - perf top shows nothing > - statement_timeout is ignored > - no locks with SELECT relation::regclass, * FROM pg_locks WHERE NOT GRANTED; > - server exits only on kill -9 > - strace on SELECT process indefinitely shows: > > futex(0x7f00fe94c938, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff^Cstrace: Process 121319 detached > <detached ...> > > GDB backtrace: > > (gdb) bt > #0 0x00007f05256f1827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f00fe94ba38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 > #1 do_futex_wait (sem=sem@entry=0x7f00fe94ba38, abstime=0x0) at sem_waitcommon.c:111 > #2 0x00007f05256f18d4 in __new_sem_wait_slow (sem=0x7f00fe94ba38, abstime=0x0) at sem_waitcommon.c:181 > #3 0x00007f05256f197a in __new_sem_wait (sem=sem@entry=0x7f00fe94ba38) at sem_wait.c:29 > #4 0x000055c9b95eb792 in PGSemaphoreLock (sema=0x7f00fe94ba38) at pg_sema.c:316 > #5 0x000055c9b965eaec in LWLockAcquire (lock=0x7f00fe96f880, mode=mode@entry=LW_EXCLUSIVE) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/storage/lmgr/lwlock.c:1233 > #6 0x000055c9b96497f7 in dsm_create (size=size@entry=105544, flags=flags@entry=1) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/storage/ipc/dsm.c:493 > #7 0x000055c9b94139ff in InitializeParallelDSM (pcxt=pcxt@entry=0x55c9bb8d9d58) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/access/transam/parallel.c:268 Hello Boris, This looks like a known symptom of a pair of bugs we recently tracked down and fixed: 1. "dsa_area could not attach to segment": dsm.c, fixed in commit 6c0fb941. 2. "cannot unpin a segment that is not pinned": dsm.c, fixed in commit 0b55aaac. Do you see one of those messages earlier in your logs? The bug was caused by a failure to allow for a corner case where a new shared memory segment has the same ID as a recently/concurrently destroyed one, and is made more likely to occur by not-very-random random numbers. If one of these errors occurs while cleaning up from a parallel query, a backend can self-deadlock while trying to cleanup the same thing again in the error-handling path, and then other backends will later block on that lock if they try to run a parallel query. The fix will be included in the next set of releases, but in the meantime you could consider turning off parallel query (set max_parallel_workers_per_gather = 0). In practice I think you could also avoid this problem by loading a library that calls something like srandom(getpid()) in _PG_init() (so it runs in every parallel worker making ID collisions extremely unlikely), but that's not really a serious recommendation since it requires writing C code. -- Thomas Munro https://enterprisedb.com
Hi Thomas,
thank you for your quick reply, much appreciated.
1. "dsa_area could not attach to segment": dsm.c, fixed in commit 6c0fb941.
2. "cannot unpin a segment that is not pinned": dsm.c, fixed in commit 0b55aaac.
Yes, I found both entries in our logs, each once per incident.
The fix will be included in the next set of releases, but in the
meantime you could consider turning off parallel query (set
max_parallel_workers_per_gather = 0). In practice I think you could
also avoid this problem by loading a library that calls something like
srandom(getpid()) in _PG_init() (so it runs in every parallel worker
making ID collisions extremely unlikely), but that's not really a
serious recommendation since it requires writing C code.
This is very reassuring, as I was slowly starting to doubt my DBA competence here. :)
I'll try this workaround until the next release is out, will report if anything out of ordinary re-occurs.
Boris