Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash - Mailing list pgsql-bugs

From Tomas Vondra
Subject Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Date
Msg-id 20191216210249.ht2udaevrgyyzbmr@development
Whole thread Raw
In response to Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
List pgsql-bugs
On Mon, Dec 16, 2019 at 03:09:22PM +0100, Tomas Vondra wrote:
>On Sat, Dec 14, 2019 at 02:39:12PM +0100, Tomas Vondra wrote:
>>On Sat, Dec 14, 2019 at 05:01:05PM +1300, Thomas Munro wrote:
>>>On Wed, Dec 11, 2019 at 6:32 AM Tomas Vondra
>>><tomas.vondra@2ndquadrant.com> wrote:
>>>>The annoying thing is that it's the workers that crash, and the leader
>>>>failed to notice that, so it was waiting in WaitForParallelWorkersToExit
>>>>forever. Not sure what the issue is.
>>>
>>>That's bad.  Some edge case in
>>>WaitForBackgroundWorkerShutdown()/GetBackgroundWorkerPid() that is
>>>b0rked perhaps?  Can you reproduce it, and tell us what the stack
>>>looks like, maybe on a new thread?
>>
>>Sure. I'll try reproducing it and get this extra info. I was planning to
>>do that, but then I lost the data for the report because of a silly
>>mistake. Early next week, hopefully.
>>
>
>I've tried reproducing this, but so far no luck. I've only managed to do
>two runs, though, because this particular machine only has space for the
>1TB of temp files on spinning rust. I wonder if it's possible to
>reproduce the issue with a smaller data set? I've tried using smaller
>work_mem, but then it does not trigger the issue :-(
>

And of course, the very next run after I send the "can't reproduce"
message actually reproduces the issue. How typical ...

So, there were ~8 workers while the query was running, with the leader
process being 29417. At the moment, the list of processes looks like
this:

bench2 ~ # ps ax | grep postgres
11877 pts/2    S      0:00 su - postgres
14082 ?        Ss    12:38 postgres: parallel worker for PID 29417
14083 ?        Ss    12:28 postgres: parallel worker for PID 29417
17334 pts/0    S+     0:00 grep --colour=auto postgres
29408 ?        Ss     0:01 /var/lib/postgresql/pg-master/bin/postgres -D /mnt/data/data-master
29410 ?        Ss     0:00 postgres: checkpointer
29411 ?        Ss     0:01 postgres: background writer
29412 ?        Ss     0:01 postgres: walwriter
29413 ?        Ss     0:02 postgres: autovacuum launcher
29414 ?        Ss     0:14 postgres: stats collector
29415 ?        Ss     0:00 postgres: logical replication launcher
29417 ?        Ss    75:11 postgres: postgres test [local] EXPLAIN

and it failes like this:

2019-12-16 20:25:00.617 CET [14081] ERROR:  invalid DSA memory alloc request size 1711276032
2019-12-16 20:25:00.617 CET [14081] STATEMENT:  explain analyze select count(*) from sevenb t1 join sevenb t2 using
(i);
2019-12-16 20:25:00.617 CET [29417] ERROR:  invalid DSA memory alloc request size 1711276032
2019-12-16 20:25:00.617 CET [29417] CONTEXT:  parallel worker
2019-12-16 20:25:00.617 CET [29417] STATEMENT:  explain analyze select count(*) from sevenb t1 join sevenb t2 using
(i);
2019-12-16 20:25:00.618 CET [14080] FATAL:  terminating connection due to administrator command
2019-12-16 20:25:00.618 CET [14080] STATEMENT:  explain analyze select count(*) from sevenb t1 join sevenb t2 using
(i);
2019-12-16 20:25:00.618 CET [14085] FATAL:  terminating connection due to administrator command
2019-12-16 20:25:00.618 CET [14085] STATEMENT:  explain analyze select count(*) from sevenb t1 join sevenb t2 using
(i);
2019-12-16 20:25:00.618 CET [14084] FATAL:  terminating connection due to administrator command
2019-12-16 20:25:00.618 CET [14084] STATEMENT:  explain analyze select count(*) from sevenb t1 join sevenb t2 using
(i);
2019-12-16 20:25:00.618 CET [14079] FATAL:  terminating connection due to administrator command
2019-12-16 20:25:00.618 CET [14079] STATEMENT:  explain analyze select count(*) from sevenb t1 join sevenb t2 using
(i);
2019-12-16 20:25:00.618 CET [14078] FATAL:  terminating connection due to administrator command
2019-12-16 20:25:00.618 CET [14078] STATEMENT:  explain analyze select count(*) from sevenb t1 join sevenb t2 using
(i);
2019-12-16 20:25:00.746 CET [29408] LOG:  background worker "parallel worker" (PID 14081) exited with exit code 1
2019-12-16 20:25:00.816 CET [29408] LOG:  background worker "parallel worker" (PID 14079) exited with exit code 1
2019-12-16 20:25:00.816 CET [29408] LOG:  background worker "parallel worker" (PID 14085) exited with exit code 1
2019-12-16 20:25:00.817 CET [29408] LOG:  background worker "parallel worker" (PID 14084) exited with exit code 1
2019-12-16 20:25:00.817 CET [29408] LOG:  background worker "parallel worker" (PID 14080) exited with exit code 1
2019-12-16 20:25:00.818 CET [29408] LOG:  background worker "parallel worker" (PID 14078) exited with exit code 1

which is ~2h ago. There's no activity at all (no I/O or anything like
that),  the temp directory is ~1TB of allocated files:

   bench2 /var/lib/postgresql # du -s /mnt/raid/pgsql_tmp/
   1099760800    /mnt/raid/pgsql_tmp/

which means nothing got cleaned up. The backtraces for the remaining
backends look like this:

1) 29417 (i.e. the leader)

0x00007f64684d2bfb in epoll_wait (epfd=5, events=0x55d887416dc0, maxevents=1, timeout=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
30      return SYSCALL_CANCEL (epoll_wait, epfd, events, maxevents, timeout);
(gdb) bt
#0  0x00007f64684d2bfb in epoll_wait (epfd=5, events=0x55d887416dc0, maxevents=1, timeout=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x000055d8864d4e3c in WaitEventSetWaitBlock (set=0x55d887416d48, cur_timeout=-1, occurred_events=0x7ffc6e805c70,
nevents=1)at latch.c:1080
 
#2  0x000055d8864d4d00 in WaitEventSetWait (set=0x55d887416d48, timeout=-1, occurred_events=0x7ffc6e805c70, nevents=1,
wait_event_info=134217728)at latch.c:1032
 
#3  0x000055d8864d4565 in WaitLatchOrSocket (latch=0x7f64670ac194, wakeEvents=17, sock=-1, timeout=-1,
wait_event_info=134217728)at latch.c:407
 
#4  0x000055d8864d4421 in WaitLatch (latch=0x7f64670ac194, wakeEvents=17, timeout=0, wait_event_info=134217728) at
latch.c:347
#5  0x000055d88644c18b in WaitForBackgroundWorkerShutdown (handle=0x55d88742abc8) at bgworker.c:1153
#6  0x000055d886147690 in WaitForParallelWorkersToExit (pcxt=0x55d88742a338) at parallel.c:842
#7  0x000055d8861478af in DestroyParallelContext (pcxt=0x55d88742a338) at parallel.c:924
#8  0x000055d886147feb in AtEOXact_Parallel (isCommit=false) at parallel.c:1197
#9  0x000055d886153774 in AbortTransaction () at xact.c:2654
#10 0x000055d886153f80 in AbortCurrentTransaction () at xact.c:3196
#11 0x000055d8865044e7 in PostgresMain (argc=1, argv=0x55d8873d41e8, dbname=0x55d8873d4118 "test",
username=0x55d8873d40f8"postgres") at postgres.c:4081
 
#12 0x000055d88645f0d6 in BackendRun (port=0x55d8873ca150) at postmaster.c:4498
#13 0x000055d88645e7fc in BackendStartup (port=0x55d8873ca150) at postmaster.c:4189
#14 0x000055d88645abd0 in ServerLoop () at postmaster.c:1727
#15 0x000055d88645a366 in PostmasterMain (argc=3, argv=0x55d8873a3f70) at postmaster.c:1400
#16 0x000055d88636fab5 in main (argc=3, argv=0x55d8873a3f70) at main.c:210

2) 14082

0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
30      return SYSCALL_CANCEL (epoll_wait, epfd, events, maxevents, timeout);
(gdb) bt
#0  0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x000055d8864d4e3c in WaitEventSetWaitBlock (set=0x55d8873d5378, cur_timeout=-1, occurred_events=0x7ffc6e805130,
nevents=1)at latch.c:1080
 
#2  0x000055d8864d4d00 in WaitEventSetWait (set=0x55d8873d5378, timeout=-1, occurred_events=0x7ffc6e805130, nevents=1,
wait_event_info=134217742)at latch.c:1032
 
#3  0x000055d8864e1848 in ConditionVariableTimedSleep (cv=0x7f63602a060c, timeout=-1, wait_event_info=134217742) at
condition_variable.c:186
#4  0x000055d8864e17b2 in ConditionVariableSleep (cv=0x7f63602a060c, wait_event_info=134217742) at
condition_variable.c:126
#5  0x000055d8864d05f2 in BarrierArriveAndWait (barrier=0x7f63602a05f4, wait_event_info=134217742) at barrier.c:191
#6  0x000055d886328d4e in ExecParallelHashIncreaseNumBatches (hashtable=0x55d887481e70) at nodeHash.c:1189
#7  0x000055d88632ca1f in ExecParallelHashTuplePrealloc (hashtable=0x55d887481e70, batchno=237, size=40) at
nodeHash.c:3308
#8  0x000055d886329dfe in ExecParallelHashTableInsert (hashtable=0x55d887481e70, slot=0x55d887470e10,
hashvalue=994179508)at nodeHash.c:1725
 
#9  0x000055d8863275dc in MultiExecParallelHash (node=0x55d887470860) at nodeHash.c:288
#10 0x000055d8863271e3 in MultiExecHash (node=0x55d887470860) at nodeHash.c:111
#11 0x000055d88630b79c in MultiExecProcNode (node=0x55d887470860) at execProcnode.c:500
#12 0x000055d88632cfac in ExecHashJoinImpl (pstate=0x55d8874700a0, parallel=true) at nodeHashjoin.c:291
#13 0x000055d88632d617 in ExecParallelHashJoin (pstate=0x55d8874700a0) at nodeHashjoin.c:588
#14 0x000055d88630b6e9 in ExecProcNodeInstr (node=0x55d8874700a0) at execProcnode.c:460
#15 0x000055d88630b6ba in ExecProcNodeFirst (node=0x55d8874700a0) at execProcnode.c:444
#16 0x000055d886318a14 in ExecProcNode (node=0x55d8874700a0) at ../../../src/include/executor/executor.h:240
#17 0x000055d886318e8b in fetch_input_tuple (aggstate=0x55d88746f628) at nodeAgg.c:405
#18 0x000055d88631b2fa in agg_retrieve_direct (aggstate=0x55d88746f628) at nodeAgg.c:1757
#19 0x000055d88631af4a in ExecAgg (pstate=0x55d88746f628) at nodeAgg.c:1572
#20 0x000055d88630b6e9 in ExecProcNodeInstr (node=0x55d88746f628) at execProcnode.c:460
#21 0x000055d88630b6ba in ExecProcNodeFirst (node=0x55d88746f628) at execProcnode.c:444
#22 0x000055d886300882 in ExecProcNode (node=0x55d88746f628) at ../../../src/include/executor/executor.h:240
#23 0x000055d886303055 in ExecutePlan (estate=0x55d88746f398, planstate=0x55d88746f628, use_parallel_mode=false,
operation=CMD_SELECT,sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x55d88741d408,
execute_once=true)at execMain.c:1646
 
#24 0x000055d886300d9a in standard_ExecutorRun (queryDesc=0x55d88741e9e0, direction=ForwardScanDirection, count=0,
execute_once=true)at execMain.c:364
 
#25 0x000055d886300c3c in ExecutorRun (queryDesc=0x55d88741e9e0, direction=ForwardScanDirection, count=0,
execute_once=true)at execMain.c:308
 
#26 0x000055d886307c2b in ParallelQueryMain (seg=0x55d8873d23d8, toc=0x7f63602a0000) at execParallel.c:1401
#27 0x000055d886148632 in ParallelWorkerMain (main_arg=2143010421) at parallel.c:1430
#28 0x000055d88644ba56 in StartBackgroundWorker () at bgworker.c:834
#29 0x000055d88645ff82 in do_start_bgworker (rw=0x55d8873ca340) at postmaster.c:5839
#30 0x000055d88646038f in maybe_start_bgworkers () at postmaster.c:6065
#31 0x000055d88645f265 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5234
#32 <signal handler called>
#33 0x00007f64684c956b in __GI___select (nfds=7, readfds=0x7ffc6e8060e0, writefds=0x0, exceptfds=0x0,
timeout=0x7ffc6e806050)at ../sysdeps/unix/sysv/linux/select.c:41
 
#34 0x000055d88645aa3c in ServerLoop () at postmaster.c:1691
#35 0x000055d88645a366 in PostmasterMain (argc=3, argv=0x55d8873a3f70) at postmaster.c:1400
#36 0x000055d88636fab5 in main (argc=3, argv=0x55d8873a3f70) at main.c:210

3) 14083

0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
30      return SYSCALL_CANCEL (epoll_wait, epfd, events, maxevents, timeout);
(gdb) bt
#0  0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x000055d8864d4e3c in WaitEventSetWaitBlock (set=0x55d8873d5378, cur_timeout=-1, occurred_events=0x7ffc6e805120,
nevents=1)at latch.c:1080
 
#2  0x000055d8864d4d00 in WaitEventSetWait (set=0x55d8873d5378, timeout=-1, occurred_events=0x7ffc6e805120, nevents=1,
wait_event_info=134217742)at latch.c:1032
 
#3  0x000055d8864e1848 in ConditionVariableTimedSleep (cv=0x7f63602a060c, timeout=-1, wait_event_info=134217742) at
condition_variable.c:186
#4  0x000055d8864e17b2 in ConditionVariableSleep (cv=0x7f63602a060c, wait_event_info=134217742) at
condition_variable.c:126
#5  0x000055d8864d05f2 in BarrierArriveAndWait (barrier=0x7f63602a05f4, wait_event_info=134217742) at barrier.c:191
#6  0x000055d886328d4e in ExecParallelHashIncreaseNumBatches (hashtable=0x55d887481e70) at nodeHash.c:1189
#7  0x000055d88632b93c in ExecParallelHashTupleAlloc (hashtable=0x55d887481e70, size=40, shared=0x7ffc6e8052b8) at
nodeHash.c:2805
#8  0x000055d886329d46 in ExecParallelHashTableInsert (hashtable=0x55d887481e70, slot=0x55d887470e10,
hashvalue=2006491)at nodeHash.c:1702
 
#9  0x000055d8863275dc in MultiExecParallelHash (node=0x55d887470860) at nodeHash.c:288
#10 0x000055d8863271e3 in MultiExecHash (node=0x55d887470860) at nodeHash.c:111
#11 0x000055d88630b79c in MultiExecProcNode (node=0x55d887470860) at execProcnode.c:500
#12 0x000055d88632cfac in ExecHashJoinImpl (pstate=0x55d8874700a0, parallel=true) at nodeHashjoin.c:291
#13 0x000055d88632d617 in ExecParallelHashJoin (pstate=0x55d8874700a0) at nodeHashjoin.c:588
#14 0x000055d88630b6e9 in ExecProcNodeInstr (node=0x55d8874700a0) at execProcnode.c:460
#15 0x000055d88630b6ba in ExecProcNodeFirst (node=0x55d8874700a0) at execProcnode.c:444
#16 0x000055d886318a14 in ExecProcNode (node=0x55d8874700a0) at ../../../src/include/executor/executor.h:240
#17 0x000055d886318e8b in fetch_input_tuple (aggstate=0x55d88746f628) at nodeAgg.c:405
#18 0x000055d88631b2fa in agg_retrieve_direct (aggstate=0x55d88746f628) at nodeAgg.c:1757
#19 0x000055d88631af4a in ExecAgg (pstate=0x55d88746f628) at nodeAgg.c:1572
#20 0x000055d88630b6e9 in ExecProcNodeInstr (node=0x55d88746f628) at execProcnode.c:460
#21 0x000055d88630b6ba in ExecProcNodeFirst (node=0x55d88746f628) at execProcnode.c:444
#22 0x000055d886300882 in ExecProcNode (node=0x55d88746f628) at ../../../src/include/executor/executor.h:240
#23 0x000055d886303055 in ExecutePlan (estate=0x55d88746f398, planstate=0x55d88746f628, use_parallel_mode=false,
operation=CMD_SELECT,sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x55d88741d408,
execute_once=true)at execMain.c:1646
 
#24 0x000055d886300d9a in standard_ExecutorRun (queryDesc=0x55d88741e9e0, direction=ForwardScanDirection, count=0,
execute_once=true)at execMain.c:364
 
#25 0x000055d886300c3c in ExecutorRun (queryDesc=0x55d88741e9e0, direction=ForwardScanDirection, count=0,
execute_once=true)at execMain.c:308
 
#26 0x000055d886307c2b in ParallelQueryMain (seg=0x55d8873d23d8, toc=0x7f63602a0000) at execParallel.c:1401
#27 0x000055d886148632 in ParallelWorkerMain (main_arg=2143010421) at parallel.c:1430
#28 0x000055d88644ba56 in StartBackgroundWorker () at bgworker.c:834
#29 0x000055d88645ff82 in do_start_bgworker (rw=0x55d8873ca590) at postmaster.c:5839
#30 0x000055d88646038f in maybe_start_bgworkers () at postmaster.c:6065
#31 0x000055d88645f265 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5234
#32 <signal handler called>
#33 0x00007f64684c956b in __GI___select (nfds=7, readfds=0x7ffc6e8060e0, writefds=0x0, exceptfds=0x0,
timeout=0x7ffc6e806050)at ../sysdeps/unix/sysv/linux/select.c:41
 
#34 0x000055d88645aa3c in ServerLoop () at postmaster.c:1691
#35 0x000055d88645a366 in PostmasterMain (argc=3, argv=0x55d8873a3f70) at postmaster.c:1400
#36 0x000055d88636fab5 in main (argc=3, argv=0x55d8873a3f70) at main.c:210

So the only difference between the two remaining worker processes is
frame #7, where 14082 calls ExecParallelHashTuplePrealloc while 14083
calls ExecParallelHashTupleAlloc.


I'll leave the system in this "stuck" space in case we need to do
further investigation.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)
Next
From: Alexander Lakhin
Date:
Subject: Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)