On Mon, Aug 26, 2019 at 01:09:19PM +1200, Thomas Munro wrote:
> On Sun, Aug 25, 2019 at 3:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I was reminded of this issue from last year, which also appeared to
> > involve BufFileClose() and a double-free:
> >
> > https://postgr.es/m/87y3hmee19.fsf@news-spur.riddles.org.uk
> >
> > That was a BufFile that was under the control of a tuplestore, so it
> > was similar to but different from your case. I suspect it's related.
>
> Hmm. tuplestore.c follows the same coding pattern as nodeHashjoin.c:
> it always nukes its pointer after calling BufFileFlush(), so it
> shouldn't be capable of calling it twice for the same pointer, unless
> we have two copies of that pointer somehow.
>
> Merlin's reported a double-free apparently in ExecHashJoin(), not
> ExecHashJoinNewBatch() like this report. Unfortunately that tells us
> very little.
>
> On Sun, Aug 25, 2019 at 2:25 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > #4 0x00000039ff678dd0 in _int_free (av=0x39ff98e120, p=0x1d40b090, have_lock=0) at malloc.c:4846
> > #5 0x00000000006269e5 in ExecHashJoinNewBatch (pstate=0x2771218) at nodeHashjoin.c:1058
>
> Can you reproduce this or was it a one-off crash?
The query was of our large reports, and this job runs every 15min against
recently-loaded data; in the immediate case, between
2019-08-24t08:00:00 and 2019-08-24 09:00:00
I can rerun it fine, and I ran it in a loop for awhile last night with no
issues.
time psql ts -f tmp/sql-2019-08-24.1 |wc
5416 779356 9793941
Since it was asked in other thread Peter mentioned:
ts=# SHOW work_mem;
work_mem | 128MB
ts=# SHOW shared_buffers ;
shared_buffers | 1536MB
> might be some obscure path somewhere, possibly through a custom
> operator or suchlike, that leaves us in a strange memory context, or
> something like that? But then I feel like we'd have received
> reproducible reports and a test case by now.
No custom operator in sight. Just NATURAL JOIN on integers, and WHERE on
timestamp, some plpgsql and int[].
Justin