Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev) - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev)
Date
Msg-id 20190826014414.GC7201@telsasoft.com
Whole thread Raw
In response to Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev)  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
On Mon, Aug 26, 2019 at 01:09:19PM +1200, Thomas Munro wrote:
> On Sun, Aug 25, 2019 at 3:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I was reminded of this issue from last year, which also appeared to
> > involve BufFileClose() and a double-free:
> >
> > https://postgr.es/m/87y3hmee19.fsf@news-spur.riddles.org.uk
> >
> > That was a BufFile that was under the control of a tuplestore, so it
> > was similar to but different from your case. I suspect it's related.
> 
> Hmm.  tuplestore.c follows the same coding pattern as nodeHashjoin.c:
> it always nukes its pointer after calling BufFileFlush(), so it
> shouldn't be capable of calling it twice for the same pointer, unless
> we have two copies of that pointer somehow.
> 
> Merlin's reported a double-free apparently in ExecHashJoin(), not
> ExecHashJoinNewBatch() like this report.  Unfortunately that tells us
> very little.
> 
> On Sun, Aug 25, 2019 at 2:25 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > #4  0x00000039ff678dd0 in _int_free (av=0x39ff98e120, p=0x1d40b090, have_lock=0) at malloc.c:4846
> > #5  0x00000000006269e5 in ExecHashJoinNewBatch (pstate=0x2771218) at nodeHashjoin.c:1058
> 
> Can you reproduce this or was it a one-off crash?

The query was of our large reports, and this job runs every 15min against
recently-loaded data; in the immediate case, between
2019-08-24t08:00:00 and 2019-08-24 09:00:00

I can rerun it fine, and I ran it in a loop for awhile last night with no
issues.

time psql ts -f tmp/sql-2019-08-24.1 |wc
   5416  779356 9793941

Since it was asked in other thread Peter mentioned:

ts=# SHOW work_mem;
work_mem | 128MB

ts=# SHOW shared_buffers ;
shared_buffers | 1536MB

> might be some obscure path somewhere, possibly through a custom
> operator or suchlike, that leaves us in a strange memory context, or
> something like that?  But then I feel like we'd have received
> reproducible reports and a test case by now.

No custom operator in sight.  Just NATURAL JOIN on integers, and WHERE on
timestamp, some plpgsql and int[].

Justin



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev)
Next
From: Craig Ringer
Date:
Subject: Re: The serial pseudotypes