Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev) - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev) |
Date | |
Msg-id | CA+hUKGJarB7fmBTZvvsBddN+XXDkKGBQ-yhpcum5Kk1M3dzY1g@mail.gmail.com Whole thread Raw |
In response to | Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev) (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev)
Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev) |
List | pgsql-hackers |
On Sun, Aug 25, 2019 at 3:15 PM Peter Geoghegan <pg@bowt.ie> wrote: > I was reminded of this issue from last year, which also appeared to > involve BufFileClose() and a double-free: > > https://postgr.es/m/87y3hmee19.fsf@news-spur.riddles.org.uk > > That was a BufFile that was under the control of a tuplestore, so it > was similar to but different from your case. I suspect it's related. Hmm. tuplestore.c follows the same coding pattern as nodeHashjoin.c: it always nukes its pointer after calling BufFileFlush(), so it shouldn't be capable of calling it twice for the same pointer, unless we have two copies of that pointer somehow. Merlin's reported a double-free apparently in ExecHashJoin(), not ExecHashJoinNewBatch() like this report. Unfortunately that tells us very little. On Sun, Aug 25, 2019 at 2:25 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > #4 0x00000039ff678dd0 in _int_free (av=0x39ff98e120, p=0x1d40b090, have_lock=0) at malloc.c:4846 > #5 0x00000000006269e5 in ExecHashJoinNewBatch (pstate=0x2771218) at nodeHashjoin.c:1058 Can you reproduce this or was it a one-off crash? Hmm. We don't have enough stack frames to know where in here, and presumably aset.c, this is, but as you noted it's got to be somewhere here: void BufFileClose(BufFile *file) { int i; /* flush any unwritten data */ BufFileFlush(file); /* close and delete the underlying file(s) */ for (i = 0; i < file->numFiles; i++) FileClose(file->files[i]); /* release the buffer space */ pfree(file->files); pfree(file->offsets); pfree(file); } BufFileFlush() and FileClose() don't seem to be able to reach free(). pfree() always reaches free() for sizes > allocChunkLimit (8KB in ExecutorState). Given numFiles = 1 (as you showed), I expect file->files and file->offsets to be small allocations, and file itself to be large due to the 8KB buffer inside it. Some possibilities: 1. Somehow we actually called BufFileClose() twice. Hard to see how we'd do that, as mentioned. 2. Somehow the BufFile was created in the wrong memory context, and the memory was freed earlier. Note that ExecHashJoinSaveTuple()'s comment requires you to call it with CurrentMemoryContext == the executor context, but nothing in the code enforces that. As for the tuplestore.c case, note also that tuplestore.c explicitly sets CurrentResourceOwner, but not CurrentMemoryContext. I suppose there might be some obscure path somewhere, possibly through a custom operator or suchlike, that leaves us in a strange memory context, or something like that? But then I feel like we'd have received reproducible reports and a test case by now. 3. Random memory corruption caused by buffer overrun who-knows-where. > glibc-2.12-1.192.el6.x86_64 > linux 2.6.32-754.3.5.el6.x86_64 Greetings, time traveller! I see that the tuplestore.c report was also on a system of that vintage. Hmm. > #10 ExecSort (pstate=0x2771108) at nodeSort.c:107 > plannode = <value optimized out> > outerNode = 0x2771218 > tupDesc = <value optimized out> > node = 0x2771108 > estate = 0x2770a40 > dir = ForwardScanDirection > tuplesortstate = 0x3c87160 > slot = <value optimized out> It's interesting that a sort was involved here and it owns a tuplestore, but hard to see the connection. > (gdb) p *innerFile > $2 = {numFiles = 1, files = 0xa421328, offsets = 0xa421310, isInterXact = false, dirty = false, readOnly = false, fileset= 0x0, name = 0x0, resowner = 0x24f93e0, curFile = 0, curOffset = 73016512, > > pos = 0, nbytes = 0, buffer = { > data = '\000' <repeats 44 times>, "Q\366\262h\220\004\000\000\000\000\000\000\000\000L\000\003\000(\377\377\377\377\377\177\372\377\377\017\000\000\000\000\000\000\000\000\257\321\345\333\063\002\000\003", '\000'<repeats 23 times>"\204, \003\000\000\000\000\000\000K\000\000\000\000\000\000\000K\000\000\000\000\000\000\000K\000\000\000\000\000\000\000K\000\000\000\000\000\000\000b\000\000\000\000\000\000\000c\000\000\000\000\000\000\000c\000\000\000\000\000\000\000d", '\000'<repeats 15 times>, "q\002\000\000\000\000\000\000d", '\000' <repeats 23 times>, "I\000\000\000\000\000\000\000\344H\a\000\000\000\000\000\017\000\210\026\000\310\024\000\024H\000\000\000\000\000\000\017\000\210\035\000H!\027\000\210\216\000T\vB\017\304\t\027\000\210\a\000\206\b\033\030.\"", '\000'<repeats 11 times>..., force_align_d = 0, force_align_i64 = 0}} > > (gdb) p innerFile->files[0] > $8 = 2397 Hmm. That all looks pretty sane from here, but tells us nothing about whether it was already freed and if so where. -- Thomas Munro https://enterprisedb.com
pgsql-hackers by date: