Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash - Mailing list pgsql-bugs
From | Tomas Vondra |
---|---|
Subject | Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash |
Date | |
Msg-id | 20191216210249.ht2udaevrgyyzbmr@development Whole thread Raw |
In response to | Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
|
List | pgsql-bugs |
On Mon, Dec 16, 2019 at 03:09:22PM +0100, Tomas Vondra wrote: >On Sat, Dec 14, 2019 at 02:39:12PM +0100, Tomas Vondra wrote: >>On Sat, Dec 14, 2019 at 05:01:05PM +1300, Thomas Munro wrote: >>>On Wed, Dec 11, 2019 at 6:32 AM Tomas Vondra >>><tomas.vondra@2ndquadrant.com> wrote: >>>>The annoying thing is that it's the workers that crash, and the leader >>>>failed to notice that, so it was waiting in WaitForParallelWorkersToExit >>>>forever. Not sure what the issue is. >>> >>>That's bad. Some edge case in >>>WaitForBackgroundWorkerShutdown()/GetBackgroundWorkerPid() that is >>>b0rked perhaps? Can you reproduce it, and tell us what the stack >>>looks like, maybe on a new thread? >> >>Sure. I'll try reproducing it and get this extra info. I was planning to >>do that, but then I lost the data for the report because of a silly >>mistake. Early next week, hopefully. >> > >I've tried reproducing this, but so far no luck. I've only managed to do >two runs, though, because this particular machine only has space for the >1TB of temp files on spinning rust. I wonder if it's possible to >reproduce the issue with a smaller data set? I've tried using smaller >work_mem, but then it does not trigger the issue :-( > And of course, the very next run after I send the "can't reproduce" message actually reproduces the issue. How typical ... So, there were ~8 workers while the query was running, with the leader process being 29417. At the moment, the list of processes looks like this: bench2 ~ # ps ax | grep postgres 11877 pts/2 S 0:00 su - postgres 14082 ? Ss 12:38 postgres: parallel worker for PID 29417 14083 ? Ss 12:28 postgres: parallel worker for PID 29417 17334 pts/0 S+ 0:00 grep --colour=auto postgres 29408 ? Ss 0:01 /var/lib/postgresql/pg-master/bin/postgres -D /mnt/data/data-master 29410 ? Ss 0:00 postgres: checkpointer 29411 ? Ss 0:01 postgres: background writer 29412 ? Ss 0:01 postgres: walwriter 29413 ? Ss 0:02 postgres: autovacuum launcher 29414 ? Ss 0:14 postgres: stats collector 29415 ? Ss 0:00 postgres: logical replication launcher 29417 ? Ss 75:11 postgres: postgres test [local] EXPLAIN and it failes like this: 2019-12-16 20:25:00.617 CET [14081] ERROR: invalid DSA memory alloc request size 1711276032 2019-12-16 20:25:00.617 CET [14081] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i); 2019-12-16 20:25:00.617 CET [29417] ERROR: invalid DSA memory alloc request size 1711276032 2019-12-16 20:25:00.617 CET [29417] CONTEXT: parallel worker 2019-12-16 20:25:00.617 CET [29417] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i); 2019-12-16 20:25:00.618 CET [14080] FATAL: terminating connection due to administrator command 2019-12-16 20:25:00.618 CET [14080] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i); 2019-12-16 20:25:00.618 CET [14085] FATAL: terminating connection due to administrator command 2019-12-16 20:25:00.618 CET [14085] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i); 2019-12-16 20:25:00.618 CET [14084] FATAL: terminating connection due to administrator command 2019-12-16 20:25:00.618 CET [14084] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i); 2019-12-16 20:25:00.618 CET [14079] FATAL: terminating connection due to administrator command 2019-12-16 20:25:00.618 CET [14079] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i); 2019-12-16 20:25:00.618 CET [14078] FATAL: terminating connection due to administrator command 2019-12-16 20:25:00.618 CET [14078] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i); 2019-12-16 20:25:00.746 CET [29408] LOG: background worker "parallel worker" (PID 14081) exited with exit code 1 2019-12-16 20:25:00.816 CET [29408] LOG: background worker "parallel worker" (PID 14079) exited with exit code 1 2019-12-16 20:25:00.816 CET [29408] LOG: background worker "parallel worker" (PID 14085) exited with exit code 1 2019-12-16 20:25:00.817 CET [29408] LOG: background worker "parallel worker" (PID 14084) exited with exit code 1 2019-12-16 20:25:00.817 CET [29408] LOG: background worker "parallel worker" (PID 14080) exited with exit code 1 2019-12-16 20:25:00.818 CET [29408] LOG: background worker "parallel worker" (PID 14078) exited with exit code 1 which is ~2h ago. There's no activity at all (no I/O or anything like that), the temp directory is ~1TB of allocated files: bench2 /var/lib/postgresql # du -s /mnt/raid/pgsql_tmp/ 1099760800 /mnt/raid/pgsql_tmp/ which means nothing got cleaned up. The backtraces for the remaining backends look like this: 1) 29417 (i.e. the leader) 0x00007f64684d2bfb in epoll_wait (epfd=5, events=0x55d887416dc0, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 30 return SYSCALL_CANCEL (epoll_wait, epfd, events, maxevents, timeout); (gdb) bt #0 0x00007f64684d2bfb in epoll_wait (epfd=5, events=0x55d887416dc0, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x000055d8864d4e3c in WaitEventSetWaitBlock (set=0x55d887416d48, cur_timeout=-1, occurred_events=0x7ffc6e805c70, nevents=1)at latch.c:1080 #2 0x000055d8864d4d00 in WaitEventSetWait (set=0x55d887416d48, timeout=-1, occurred_events=0x7ffc6e805c70, nevents=1, wait_event_info=134217728)at latch.c:1032 #3 0x000055d8864d4565 in WaitLatchOrSocket (latch=0x7f64670ac194, wakeEvents=17, sock=-1, timeout=-1, wait_event_info=134217728)at latch.c:407 #4 0x000055d8864d4421 in WaitLatch (latch=0x7f64670ac194, wakeEvents=17, timeout=0, wait_event_info=134217728) at latch.c:347 #5 0x000055d88644c18b in WaitForBackgroundWorkerShutdown (handle=0x55d88742abc8) at bgworker.c:1153 #6 0x000055d886147690 in WaitForParallelWorkersToExit (pcxt=0x55d88742a338) at parallel.c:842 #7 0x000055d8861478af in DestroyParallelContext (pcxt=0x55d88742a338) at parallel.c:924 #8 0x000055d886147feb in AtEOXact_Parallel (isCommit=false) at parallel.c:1197 #9 0x000055d886153774 in AbortTransaction () at xact.c:2654 #10 0x000055d886153f80 in AbortCurrentTransaction () at xact.c:3196 #11 0x000055d8865044e7 in PostgresMain (argc=1, argv=0x55d8873d41e8, dbname=0x55d8873d4118 "test", username=0x55d8873d40f8"postgres") at postgres.c:4081 #12 0x000055d88645f0d6 in BackendRun (port=0x55d8873ca150) at postmaster.c:4498 #13 0x000055d88645e7fc in BackendStartup (port=0x55d8873ca150) at postmaster.c:4189 #14 0x000055d88645abd0 in ServerLoop () at postmaster.c:1727 #15 0x000055d88645a366 in PostmasterMain (argc=3, argv=0x55d8873a3f70) at postmaster.c:1400 #16 0x000055d88636fab5 in main (argc=3, argv=0x55d8873a3f70) at main.c:210 2) 14082 0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 30 return SYSCALL_CANCEL (epoll_wait, epfd, events, maxevents, timeout); (gdb) bt #0 0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x000055d8864d4e3c in WaitEventSetWaitBlock (set=0x55d8873d5378, cur_timeout=-1, occurred_events=0x7ffc6e805130, nevents=1)at latch.c:1080 #2 0x000055d8864d4d00 in WaitEventSetWait (set=0x55d8873d5378, timeout=-1, occurred_events=0x7ffc6e805130, nevents=1, wait_event_info=134217742)at latch.c:1032 #3 0x000055d8864e1848 in ConditionVariableTimedSleep (cv=0x7f63602a060c, timeout=-1, wait_event_info=134217742) at condition_variable.c:186 #4 0x000055d8864e17b2 in ConditionVariableSleep (cv=0x7f63602a060c, wait_event_info=134217742) at condition_variable.c:126 #5 0x000055d8864d05f2 in BarrierArriveAndWait (barrier=0x7f63602a05f4, wait_event_info=134217742) at barrier.c:191 #6 0x000055d886328d4e in ExecParallelHashIncreaseNumBatches (hashtable=0x55d887481e70) at nodeHash.c:1189 #7 0x000055d88632ca1f in ExecParallelHashTuplePrealloc (hashtable=0x55d887481e70, batchno=237, size=40) at nodeHash.c:3308 #8 0x000055d886329dfe in ExecParallelHashTableInsert (hashtable=0x55d887481e70, slot=0x55d887470e10, hashvalue=994179508)at nodeHash.c:1725 #9 0x000055d8863275dc in MultiExecParallelHash (node=0x55d887470860) at nodeHash.c:288 #10 0x000055d8863271e3 in MultiExecHash (node=0x55d887470860) at nodeHash.c:111 #11 0x000055d88630b79c in MultiExecProcNode (node=0x55d887470860) at execProcnode.c:500 #12 0x000055d88632cfac in ExecHashJoinImpl (pstate=0x55d8874700a0, parallel=true) at nodeHashjoin.c:291 #13 0x000055d88632d617 in ExecParallelHashJoin (pstate=0x55d8874700a0) at nodeHashjoin.c:588 #14 0x000055d88630b6e9 in ExecProcNodeInstr (node=0x55d8874700a0) at execProcnode.c:460 #15 0x000055d88630b6ba in ExecProcNodeFirst (node=0x55d8874700a0) at execProcnode.c:444 #16 0x000055d886318a14 in ExecProcNode (node=0x55d8874700a0) at ../../../src/include/executor/executor.h:240 #17 0x000055d886318e8b in fetch_input_tuple (aggstate=0x55d88746f628) at nodeAgg.c:405 #18 0x000055d88631b2fa in agg_retrieve_direct (aggstate=0x55d88746f628) at nodeAgg.c:1757 #19 0x000055d88631af4a in ExecAgg (pstate=0x55d88746f628) at nodeAgg.c:1572 #20 0x000055d88630b6e9 in ExecProcNodeInstr (node=0x55d88746f628) at execProcnode.c:460 #21 0x000055d88630b6ba in ExecProcNodeFirst (node=0x55d88746f628) at execProcnode.c:444 #22 0x000055d886300882 in ExecProcNode (node=0x55d88746f628) at ../../../src/include/executor/executor.h:240 #23 0x000055d886303055 in ExecutePlan (estate=0x55d88746f398, planstate=0x55d88746f628, use_parallel_mode=false, operation=CMD_SELECT,sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x55d88741d408, execute_once=true)at execMain.c:1646 #24 0x000055d886300d9a in standard_ExecutorRun (queryDesc=0x55d88741e9e0, direction=ForwardScanDirection, count=0, execute_once=true)at execMain.c:364 #25 0x000055d886300c3c in ExecutorRun (queryDesc=0x55d88741e9e0, direction=ForwardScanDirection, count=0, execute_once=true)at execMain.c:308 #26 0x000055d886307c2b in ParallelQueryMain (seg=0x55d8873d23d8, toc=0x7f63602a0000) at execParallel.c:1401 #27 0x000055d886148632 in ParallelWorkerMain (main_arg=2143010421) at parallel.c:1430 #28 0x000055d88644ba56 in StartBackgroundWorker () at bgworker.c:834 #29 0x000055d88645ff82 in do_start_bgworker (rw=0x55d8873ca340) at postmaster.c:5839 #30 0x000055d88646038f in maybe_start_bgworkers () at postmaster.c:6065 #31 0x000055d88645f265 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5234 #32 <signal handler called> #33 0x00007f64684c956b in __GI___select (nfds=7, readfds=0x7ffc6e8060e0, writefds=0x0, exceptfds=0x0, timeout=0x7ffc6e806050)at ../sysdeps/unix/sysv/linux/select.c:41 #34 0x000055d88645aa3c in ServerLoop () at postmaster.c:1691 #35 0x000055d88645a366 in PostmasterMain (argc=3, argv=0x55d8873a3f70) at postmaster.c:1400 #36 0x000055d88636fab5 in main (argc=3, argv=0x55d8873a3f70) at main.c:210 3) 14083 0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 30 return SYSCALL_CANCEL (epoll_wait, epfd, events, maxevents, timeout); (gdb) bt #0 0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x000055d8864d4e3c in WaitEventSetWaitBlock (set=0x55d8873d5378, cur_timeout=-1, occurred_events=0x7ffc6e805120, nevents=1)at latch.c:1080 #2 0x000055d8864d4d00 in WaitEventSetWait (set=0x55d8873d5378, timeout=-1, occurred_events=0x7ffc6e805120, nevents=1, wait_event_info=134217742)at latch.c:1032 #3 0x000055d8864e1848 in ConditionVariableTimedSleep (cv=0x7f63602a060c, timeout=-1, wait_event_info=134217742) at condition_variable.c:186 #4 0x000055d8864e17b2 in ConditionVariableSleep (cv=0x7f63602a060c, wait_event_info=134217742) at condition_variable.c:126 #5 0x000055d8864d05f2 in BarrierArriveAndWait (barrier=0x7f63602a05f4, wait_event_info=134217742) at barrier.c:191 #6 0x000055d886328d4e in ExecParallelHashIncreaseNumBatches (hashtable=0x55d887481e70) at nodeHash.c:1189 #7 0x000055d88632b93c in ExecParallelHashTupleAlloc (hashtable=0x55d887481e70, size=40, shared=0x7ffc6e8052b8) at nodeHash.c:2805 #8 0x000055d886329d46 in ExecParallelHashTableInsert (hashtable=0x55d887481e70, slot=0x55d887470e10, hashvalue=2006491)at nodeHash.c:1702 #9 0x000055d8863275dc in MultiExecParallelHash (node=0x55d887470860) at nodeHash.c:288 #10 0x000055d8863271e3 in MultiExecHash (node=0x55d887470860) at nodeHash.c:111 #11 0x000055d88630b79c in MultiExecProcNode (node=0x55d887470860) at execProcnode.c:500 #12 0x000055d88632cfac in ExecHashJoinImpl (pstate=0x55d8874700a0, parallel=true) at nodeHashjoin.c:291 #13 0x000055d88632d617 in ExecParallelHashJoin (pstate=0x55d8874700a0) at nodeHashjoin.c:588 #14 0x000055d88630b6e9 in ExecProcNodeInstr (node=0x55d8874700a0) at execProcnode.c:460 #15 0x000055d88630b6ba in ExecProcNodeFirst (node=0x55d8874700a0) at execProcnode.c:444 #16 0x000055d886318a14 in ExecProcNode (node=0x55d8874700a0) at ../../../src/include/executor/executor.h:240 #17 0x000055d886318e8b in fetch_input_tuple (aggstate=0x55d88746f628) at nodeAgg.c:405 #18 0x000055d88631b2fa in agg_retrieve_direct (aggstate=0x55d88746f628) at nodeAgg.c:1757 #19 0x000055d88631af4a in ExecAgg (pstate=0x55d88746f628) at nodeAgg.c:1572 #20 0x000055d88630b6e9 in ExecProcNodeInstr (node=0x55d88746f628) at execProcnode.c:460 #21 0x000055d88630b6ba in ExecProcNodeFirst (node=0x55d88746f628) at execProcnode.c:444 #22 0x000055d886300882 in ExecProcNode (node=0x55d88746f628) at ../../../src/include/executor/executor.h:240 #23 0x000055d886303055 in ExecutePlan (estate=0x55d88746f398, planstate=0x55d88746f628, use_parallel_mode=false, operation=CMD_SELECT,sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x55d88741d408, execute_once=true)at execMain.c:1646 #24 0x000055d886300d9a in standard_ExecutorRun (queryDesc=0x55d88741e9e0, direction=ForwardScanDirection, count=0, execute_once=true)at execMain.c:364 #25 0x000055d886300c3c in ExecutorRun (queryDesc=0x55d88741e9e0, direction=ForwardScanDirection, count=0, execute_once=true)at execMain.c:308 #26 0x000055d886307c2b in ParallelQueryMain (seg=0x55d8873d23d8, toc=0x7f63602a0000) at execParallel.c:1401 #27 0x000055d886148632 in ParallelWorkerMain (main_arg=2143010421) at parallel.c:1430 #28 0x000055d88644ba56 in StartBackgroundWorker () at bgworker.c:834 #29 0x000055d88645ff82 in do_start_bgworker (rw=0x55d8873ca590) at postmaster.c:5839 #30 0x000055d88646038f in maybe_start_bgworkers () at postmaster.c:6065 #31 0x000055d88645f265 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5234 #32 <signal handler called> #33 0x00007f64684c956b in __GI___select (nfds=7, readfds=0x7ffc6e8060e0, writefds=0x0, exceptfds=0x0, timeout=0x7ffc6e806050)at ../sysdeps/unix/sysv/linux/select.c:41 #34 0x000055d88645aa3c in ServerLoop () at postmaster.c:1691 #35 0x000055d88645a366 in PostmasterMain (argc=3, argv=0x55d8873a3f70) at postmaster.c:1400 #36 0x000055d88636fab5 in main (argc=3, argv=0x55d8873a3f70) at main.c:210 So the only difference between the two remaining worker processes is frame #7, where 14082 calls ExecParallelHashTuplePrealloc while 14083 calls ExecParallelHashTupleAlloc. I'll leave the system in this "stuck" space in case we need to do further investigation. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-bugs by date: