Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash - Mailing list pgsql-bugs

From Tomas Vondra
Subject Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Date
Msg-id 20191217232124.3dtrycatgfm6oxxb@development
Whole thread Raw
In response to Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
List pgsql-bugs
On Wed, Dec 18, 2019 at 11:21:11AM +1300, Thomas Munro wrote:
>On Tue, Dec 17, 2019 at 10:02 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> On Mon, Dec 16, 2019 at 03:09:22PM +0100, Tomas Vondra wrote:
>> >On Sat, Dec 14, 2019 at 02:39:12PM +0100, Tomas Vondra wrote:
>> >>On Sat, Dec 14, 2019 at 05:01:05PM +1300, Thomas Munro wrote:
>> >>>On Wed, Dec 11, 2019 at 6:32 AM Tomas Vondra
>> >>><tomas.vondra@2ndquadrant.com> wrote:
>> >>>>The annoying thing is that it's the workers that crash, and the leader
>> >>>>failed to notice that, so it was waiting in WaitForParallelWorkersToExit
>> >>>>forever. Not sure what the issue is.
>> >>>
>> >>>That's bad.  Some edge case in
>> >>>WaitForBackgroundWorkerShutdown()/GetBackgroundWorkerPid() that is
>> >>>b0rked perhaps?  Can you reproduce it, and tell us what the stack
>> >>>looks like, maybe on a new thread?
>> >>
>> >>Sure. I'll try reproducing it and get this extra info. I was planning to
>> >>do that, but then I lost the data for the report because of a silly
>> >>mistake. Early next week, hopefully.
>> >>
>> >
>> >I've tried reproducing this, but so far no luck. I've only managed to do
>> >two runs, though, because this particular machine only has space for the
>> >1TB of temp files on spinning rust. I wonder if it's possible to
>> >reproduce the issue with a smaller data set? I've tried using smaller
>> >work_mem, but then it does not trigger the issue :-(
>> >
>>
>> And of course, the very next run after I send the "can't reproduce"
>> message actually reproduces the issue. How typical ...
>
>Thanks for reproducing this.
>
>> 2019-12-16 20:25:00.617 CET [14081] ERROR:  invalid DSA memory alloc request size 1711276032
>> 2019-12-16 20:25:00.617 CET [14081] STATEMENT:  explain analyze select count(*) from sevenb t1 join sevenb t2 using
(i);
>> 2019-12-16 20:25:00.617 CET [29417] ERROR:  invalid DSA memory alloc request size 1711276032
>> 2019-12-16 20:25:00.617 CET [29417] CONTEXT:  parallel worker
>
>Ok, so 14081 raised an error and exited, ...
>
>> 2019-12-16 20:25:00.746 CET [29408] LOG:  background worker "parallel worker" (PID 14081) exited with exit code 1
>
>... and then the leader cancelled the query on 5 other workers,
>
>> 2019-12-16 20:25:00.816 CET [29408] LOG:  background worker "parallel worker" (PID 14079) exited with exit code 1
>> 2019-12-16 20:25:00.816 CET [29408] LOG:  background worker "parallel worker" (PID 14085) exited with exit code 1
>> 2019-12-16 20:25:00.817 CET [29408] LOG:  background worker "parallel worker" (PID 14084) exited with exit code 1
>> 2019-12-16 20:25:00.817 CET [29408] LOG:  background worker "parallel worker" (PID 14080) exited with exit code 1
>> 2019-12-16 20:25:00.818 CET [29408] LOG:  background worker "parallel worker" (PID 14078) exited with exit code 1
>
>... but somehow there were 2 more workers that didn't get the message
>and are waiting:
>
>> #0  0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
>> #1  0x000055d8864d4e3c in WaitEventSetWaitBlock (set=0x55d8873d5378, cur_timeout=-1, occurred_events=0x7ffc6e805130,
nevents=1)at latch.c:1080
 
>> #2  0x000055d8864d4d00 in WaitEventSetWait (set=0x55d8873d5378, timeout=-1, occurred_events=0x7ffc6e805130,
nevents=1,wait_event_info=134217742) at latch.c:1032
 
>> #3  0x000055d8864e1848 in ConditionVariableTimedSleep (cv=0x7f63602a060c, timeout=-1, wait_event_info=134217742) at
condition_variable.c:186
>> #4  0x000055d8864e17b2 in ConditionVariableSleep (cv=0x7f63602a060c, wait_event_info=134217742) at
condition_variable.c:126
>> #5  0x000055d8864d05f2 in BarrierArriveAndWait (barrier=0x7f63602a05f4, wait_event_info=134217742) at barrier.c:191
>> #6  0x000055d886328d4e in ExecParallelHashIncreaseNumBatches (hashtable=0x55d887481e70) at nodeHash.c:1189
>
>They're waiting on a condition variable, in a barrier, while trying to
>coordinate an increase in the number of batches, while hashing.  I
>suspect the details of what exactly it's waiting for are not relevant
>here though.  ConditionVariableTimedSleep() contains a standard latch
>wait loop with CHECK_FOR_INTERRUPTS(), and parallel workers run with
>the standard SIGTERM handler die().  So what I expect to happen here
>is:
>
>1.  DestroyParallelContext() calls TerminateBackgroundWorker() for all workers.
>2.  TerminateBackgroundWorker() sets slot->terminate, and signals the
>postmaster.
>3.  The postmaster sees the terminate flag and does kill(worker, SIGTERM).
>4.  The worker handles SIGTERM by setting ProcDiePending and setting
>its own latch.
>5.  The ConditionVariableTimedSleep()'s syscall returs,
>CHECK_FOR_INTERRUPTS() runs and sees ProcDiePending.
>
>So how did it escape this fate?  Ooooh.  Now I'm wondering if the
>problem is that ConditionVariablePrepareToSleep() contains a call to
>ResetLatch(), but doesn't then run CHECK_FOR_INTERRUPTS().  Perhaps
>that ate the latch that was set by die().  There isn't another
>CHECK_FOR_INTERRUPTS() until *after* the WaitEventSetWait() in
>ConditionVariableTimedSleep().
>
>> I'll leave the system in this "stuck" space in case we need to do
>> further investigation.
>
>What are the value sof InterruptHoldoffCount, CritSectionCount,
>InterruptPending and ProcDiePending in the stuck workers?

In both processes it's like this:

(gdb) p InterruptHoldoffCount
$1 = 0
(gdb) p CritSectionCount
$2 = 0
(gdb) p InterruptPending
$3 = 1
(gdb) p ProcDiePending
$5 = 1

So yeah, that probably means the theory about CHECK_FOR_INTERRUPTS call
missing right after latch reset is correct.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



pgsql-bugs by date:

Previous
From: Thomas Munro
Date:
Subject: Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Next
From: Thomas Munro
Date:
Subject: Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash