Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash - Mailing list pgsql-bugs
From | Tomas Vondra |
---|---|
Subject | Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash |
Date | |
Msg-id | 20191217232124.3dtrycatgfm6oxxb@development Whole thread Raw |
In response to | Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
|
List | pgsql-bugs |
On Wed, Dec 18, 2019 at 11:21:11AM +1300, Thomas Munro wrote: >On Tue, Dec 17, 2019 at 10:02 AM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> On Mon, Dec 16, 2019 at 03:09:22PM +0100, Tomas Vondra wrote: >> >On Sat, Dec 14, 2019 at 02:39:12PM +0100, Tomas Vondra wrote: >> >>On Sat, Dec 14, 2019 at 05:01:05PM +1300, Thomas Munro wrote: >> >>>On Wed, Dec 11, 2019 at 6:32 AM Tomas Vondra >> >>><tomas.vondra@2ndquadrant.com> wrote: >> >>>>The annoying thing is that it's the workers that crash, and the leader >> >>>>failed to notice that, so it was waiting in WaitForParallelWorkersToExit >> >>>>forever. Not sure what the issue is. >> >>> >> >>>That's bad. Some edge case in >> >>>WaitForBackgroundWorkerShutdown()/GetBackgroundWorkerPid() that is >> >>>b0rked perhaps? Can you reproduce it, and tell us what the stack >> >>>looks like, maybe on a new thread? >> >> >> >>Sure. I'll try reproducing it and get this extra info. I was planning to >> >>do that, but then I lost the data for the report because of a silly >> >>mistake. Early next week, hopefully. >> >> >> > >> >I've tried reproducing this, but so far no luck. I've only managed to do >> >two runs, though, because this particular machine only has space for the >> >1TB of temp files on spinning rust. I wonder if it's possible to >> >reproduce the issue with a smaller data set? I've tried using smaller >> >work_mem, but then it does not trigger the issue :-( >> > >> >> And of course, the very next run after I send the "can't reproduce" >> message actually reproduces the issue. How typical ... > >Thanks for reproducing this. > >> 2019-12-16 20:25:00.617 CET [14081] ERROR: invalid DSA memory alloc request size 1711276032 >> 2019-12-16 20:25:00.617 CET [14081] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i); >> 2019-12-16 20:25:00.617 CET [29417] ERROR: invalid DSA memory alloc request size 1711276032 >> 2019-12-16 20:25:00.617 CET [29417] CONTEXT: parallel worker > >Ok, so 14081 raised an error and exited, ... > >> 2019-12-16 20:25:00.746 CET [29408] LOG: background worker "parallel worker" (PID 14081) exited with exit code 1 > >... and then the leader cancelled the query on 5 other workers, > >> 2019-12-16 20:25:00.816 CET [29408] LOG: background worker "parallel worker" (PID 14079) exited with exit code 1 >> 2019-12-16 20:25:00.816 CET [29408] LOG: background worker "parallel worker" (PID 14085) exited with exit code 1 >> 2019-12-16 20:25:00.817 CET [29408] LOG: background worker "parallel worker" (PID 14084) exited with exit code 1 >> 2019-12-16 20:25:00.817 CET [29408] LOG: background worker "parallel worker" (PID 14080) exited with exit code 1 >> 2019-12-16 20:25:00.818 CET [29408] LOG: background worker "parallel worker" (PID 14078) exited with exit code 1 > >... but somehow there were 2 more workers that didn't get the message >and are waiting: > >> #0 0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 >> #1 0x000055d8864d4e3c in WaitEventSetWaitBlock (set=0x55d8873d5378, cur_timeout=-1, occurred_events=0x7ffc6e805130, nevents=1)at latch.c:1080 >> #2 0x000055d8864d4d00 in WaitEventSetWait (set=0x55d8873d5378, timeout=-1, occurred_events=0x7ffc6e805130, nevents=1,wait_event_info=134217742) at latch.c:1032 >> #3 0x000055d8864e1848 in ConditionVariableTimedSleep (cv=0x7f63602a060c, timeout=-1, wait_event_info=134217742) at condition_variable.c:186 >> #4 0x000055d8864e17b2 in ConditionVariableSleep (cv=0x7f63602a060c, wait_event_info=134217742) at condition_variable.c:126 >> #5 0x000055d8864d05f2 in BarrierArriveAndWait (barrier=0x7f63602a05f4, wait_event_info=134217742) at barrier.c:191 >> #6 0x000055d886328d4e in ExecParallelHashIncreaseNumBatches (hashtable=0x55d887481e70) at nodeHash.c:1189 > >They're waiting on a condition variable, in a barrier, while trying to >coordinate an increase in the number of batches, while hashing. I >suspect the details of what exactly it's waiting for are not relevant >here though. ConditionVariableTimedSleep() contains a standard latch >wait loop with CHECK_FOR_INTERRUPTS(), and parallel workers run with >the standard SIGTERM handler die(). So what I expect to happen here >is: > >1. DestroyParallelContext() calls TerminateBackgroundWorker() for all workers. >2. TerminateBackgroundWorker() sets slot->terminate, and signals the >postmaster. >3. The postmaster sees the terminate flag and does kill(worker, SIGTERM). >4. The worker handles SIGTERM by setting ProcDiePending and setting >its own latch. >5. The ConditionVariableTimedSleep()'s syscall returs, >CHECK_FOR_INTERRUPTS() runs and sees ProcDiePending. > >So how did it escape this fate? Ooooh. Now I'm wondering if the >problem is that ConditionVariablePrepareToSleep() contains a call to >ResetLatch(), but doesn't then run CHECK_FOR_INTERRUPTS(). Perhaps >that ate the latch that was set by die(). There isn't another >CHECK_FOR_INTERRUPTS() until *after* the WaitEventSetWait() in >ConditionVariableTimedSleep(). > >> I'll leave the system in this "stuck" space in case we need to do >> further investigation. > >What are the value sof InterruptHoldoffCount, CritSectionCount, >InterruptPending and ProcDiePending in the stuck workers? In both processes it's like this: (gdb) p InterruptHoldoffCount $1 = 0 (gdb) p CritSectionCount $2 = 0 (gdb) p InterruptPending $3 = 1 (gdb) p ProcDiePending $5 = 1 So yeah, that probably means the theory about CHECK_FOR_INTERRUPTS call missing right after latch reset is correct. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-bugs by date: