Thread: pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev)

pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev)

From
Justin Pryzby
Date:
Core was generated by `postgres: telsasoft ts 10.100.2.162(33500) SELECT     '.
Program terminated with signal 6, Aborted.
#0  0x00000039ff6325e5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64        return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
Missing separate debuginfos, use: debuginfo-install audit-libs-2.4.5-3.el6.x86_64
cyrus-sasl-lib-2.1.23-15.el6_6.2.x86_64geos36-3.6.3-1.rhel6.1.x86_64 json-c-0.11-12.el6.x86_64
keyutils-libs-1.4-5.el6.x86_64libcom_err-1.41.12-22.el6.x86_64 libgcc-4.4.7-17.el6.x86_64 libicu-4.2.1-14.el6.x86_64
libselinux-2.0.94-7.el6.x86_64libstdc++-4.4.7-17.el6.x86_64 nspr-4.11.0-1.el6.x86_64 nss-3.21.0-8.el6.x86_64
nss-softokn-freebl-3.14.3-23.3.el6_8.x86_64nss-util-3.21.0-2.el6.x86_64 postgis24_11-2.4.5-1.rhel6.1.x86_64
proj49-4.9.3-3.rhel6.1.x86_64zlib-1.2.3-29.el6.x86_64
 
(gdb) bt
#0  0x00000039ff6325e5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00000039ff633dc5 in abort () at abort.c:92
#2  0x00000039ff6704f7 in __libc_message (do_abort=2, fmt=0x39ff758a60 "*** glibc detected *** %s: %s: 0x%s ***\n") at
../sysdeps/unix/sysv/linux/libc_fatal.c:198
#3  0x00000039ff675f3e in malloc_printerr (action=3, str=0x39ff758df0 "double free or corruption (!prev)", ptr=<value
optimizedout>, ar_ptr=<value optimized out>) at malloc.c:6360
 
#4  0x00000039ff678dd0 in _int_free (av=0x39ff98e120, p=0x1d40b090, have_lock=0) at malloc.c:4846
#5  0x00000000006269e5 in ExecHashJoinNewBatch (pstate=0x2771218) at nodeHashjoin.c:1058
#6  ExecHashJoinImpl (pstate=0x2771218) at nodeHashjoin.c:539
#7  ExecHashJoin (pstate=0x2771218) at nodeHashjoin.c:565
#8  0x00000000006131d0 in ExecProcNodeInstr (node=0x2771218) at execProcnode.c:461
#9  0x0000000000633286 in ExecProcNode (pstate=0x2771108) at ../../../src/include/executor/executor.h:247
#10 ExecSort (pstate=0x2771108) at nodeSort.c:107
#11 0x00000000006131d0 in ExecProcNodeInstr (node=0x2771108) at execProcnode.c:461
#12 0x000000000061d51e in ExecProcNode (aggstate=0x2770d30) at ../../../src/include/executor/executor.h:247
#13 fetch_input_tuple (aggstate=0x2770d30) at nodeAgg.c:406
#14 0x000000000061ee70 in agg_retrieve_direct (pstate=0x2770d30) at nodeAgg.c:1755
#15 ExecAgg (pstate=0x2770d30) at nodeAgg.c:1570
#16 0x00000000006131d0 in ExecProcNodeInstr (node=0x2770d30) at execProcnode.c:461
#17 0x000000000060f4b7 in ExecProcNode (queryDesc=0x2940950, direction=<value optimized out>, count=0, execute_once=48)
at../../../src/include/executor/executor.h:247
 
#18 ExecutePlan (queryDesc=0x2940950, direction=<value optimized out>, count=0, execute_once=48) at execMain.c:1723
#19 standard_ExecutorRun (queryDesc=0x2940950, direction=<value optimized out>, count=0, execute_once=48) at
execMain.c:364
#20 0x00007ff08f74f618 in pgss_ExecutorRun (queryDesc=0x2940950, direction=ForwardScanDirection, count=0,
execute_once=true)at pg_stat_statements.c:892
 
#21 0x00007ff08f2cc81d in explain_ExecutorRun (queryDesc=0x2940950, direction=ForwardScanDirection, count=0,
execute_once=true)at auto_explain.c:281
 
#22 0x000000000075444b in PortalRunSelect (portal=0x2571f00, forward=<value optimized out>, count=0, dest=<value
optimizedout>) at pquery.c:932
 
#23 0x0000000000755631 in PortalRun (portal=0x2571f00, count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x928e7a8,altdest=0x928e7a8, completionTag=0x7ffcaa940670 "") at pquery.c:773
 
#24 0x0000000000752267 in exec_simple_query (
    query_string=0x2753e08 "--BEGIN SQL\nSELECT * FROM\n\n(SELECT site_office AS site_gran,\n \tsite_location AS
cell,\n\tdisplay_site_name(sect_name, site_name, sect_mscid) AS sitename,\n\tperiod, data_cell.bh_day,\n\n", '-'
<repeats17 times>, "Ac"...) at postgres.c:1145
 
#25 0x0000000000753211 in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x252b1f8 "ts",
username=<valueoptimized out>) at postgres.c:4182
 
#26 0x00000000006e2587 in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4358
#27 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4030
#28 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1707
#29 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1380
#30 0x0000000000656a80 in main (argc=3, argv=0x24f7570) at main.c:228

I found this report from 9.6 last month; copying those parties.
https://www.postgresql.org/message-id/CAHyXU0xzwzgnUTyK62ZkG0_1CQsBHwnTVT2TSX7iwTEv1ve9ag%40mail.gmail.com

Merlin: what OS are you running on and what postgres package, or was it
compiled locally ?

We're running:
postgresql11-server-11.5-1PGDG.rhel6.x86_64
CentOS release 6.8
glibc-2.12-1.192.el6.x86_64
linux 2.6.32-754.3.5.el6.x86_64
DMI: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
Model name:            Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz

I see this is available but not installed: 2.12-1.212.el6_10.3

The VM has no known issues.

ts=# SELECT * FROM pg_config();
..
| CONFIGURE         | '--enable-rpath' '--prefix=/usr/pgsql-11' '--includedir=/usr/pgsql-11/include'
'--mandir=/usr/pgsql-11/share/man''--datadir=/usr/pgsql-11/share' '--libdir=/usr/pgsql-11/lib' '--with-icu'
'--with-perl''--with-python' '--with-tcl' '--with-tclconfig=/usr/lib64' '--with-openssl' '--with-pam' '--with-gssapi'
'--with-includes=/usr/include''--with-libraries=/usr/lib64' '--enable-nls' '--with-uuid=e2fs' '--with-libxml'
'--with-libxslt''--with-ldap' '--with-system-tzdata=/usr/share/zoneinfo' '--sysconfdir=/etc/sysconfig/pgsql'
'--docdir=/usr/pgsql-11/doc''--htmldir=/usr/pgsql-11/doc/html' 'CFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
-fexceptions-fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic' 'LDFLAGS=-Wl,--as-needed'
'PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig''ICU_CFLAGS=-I/usr/include' 'ICU_LIBS=-L/usr/lib64
-licui18n-licuuc -licudata'
 
| CC                | gcc
| CPPFLAGS          | -I/usr/include -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include
| CFLAGS            | -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
-fexceptions-fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic
 
| CFLAGS_SL         | -fPIC
| LDFLAGS           | -Wl,--as-needed -L/usr/lib64 -Wl,--as-needed -Wl,-rpath,'/usr/pgsql-11/lib',--enable-new-dtags
| LDFLAGS_EX        | 
| LDFLAGS_SL        | 
| LIBS              | -lpgcommon -lpgport -lpthread -lxslt -lxml2 -lpam -lssl -lcrypto -lgssapi_krb5 -lz -lreadline
-lrt-lcrypt -ldl -lm 
 

#0  0x00000039ff6325e5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
        resultvar = 0
        pid = <value optimized out>
        selftid = <value optimized out>
#1  0x00000039ff633dc5 in abort () at abort.c:92
        save_stage = 2
        act = {__sigaction_handler = {sa_handler = 0x7ffcaa93f5c8, sa_sigaction = 0x7ffcaa93f5c8}, sa_mask = {__val =
{140723170309552,140723170319605, 49, 249099024494, 3, 140723170309562, 6, 
 
              249099024498, 2, 140723170309550, 2, 249099017763, 1, 249099024494, 3, 140723170309558}}, sa_flags = 10,
sa_restorer= 0x39ff757872}
 
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00000039ff6704f7 in __libc_message (do_abort=2, fmt=0x39ff758a60 "*** glibc detected *** %s: %s: 0x%s ***\n") at
../sysdeps/unix/sysv/linux/libc_fatal.c:198
        ap = {{gp_offset = 40, fp_offset = 48, overflow_arg_area = 0x7ffcaa93ff30, reg_save_area = 0x7ffcaa93fe40}}
        ap_copy = {{gp_offset = 16, fp_offset = 48, overflow_arg_area = 0x7ffcaa93ff30, reg_save_area =
0x7ffcaa93fe40}}
        fd = 2
        on_2 = <value optimized out>
        list = <value optimized out>
        nlist = <value optimized out>
        cp = <value optimized out>
        written = <value optimized out>
#3  0x00000039ff675f3e in malloc_printerr (action=3, str=0x39ff758df0 "double free or corruption (!prev)", ptr=<value
optimizedout>, ar_ptr=<value optimized out>) at malloc.c:6360 
        buf = "000000001d40b0a0"
        cp = <value optimized out>
#4  0x00000039ff678dd0 in _int_free (av=0x39ff98e120, p=0x1d40b090, have_lock=0) at malloc.c:4846
        size = <value optimized out>
        fb = <value optimized out>
        nextchunk = <value optimized out>
        nextsize = <value optimized out>
        nextinuse = <value optimized out>
        prevsize = <value optimized out>
        bck = <value optimized out>
        fwd = <value optimized out>
        errstr = <value optimized out>
        locked = <value optimized out>
#5  0x00000000006269e5 in ExecHashJoinNewBatch (pstate=0x2771218) at nodeHashjoin.c:1058
        nbatch = <value optimized out>
        curbatch = <value optimized out>
        slot = <value optimized out>
        hashvalue = 1756558929
        hashtable = 0x3c4ed48
        innerFile = 0x1d40b0d8
#6  ExecHashJoinImpl (pstate=0x2771218) at nodeHashjoin.c:539
        outerNode = 0x27714d8
        hashNode = 0x2772348
        econtext = <value optimized out>
        node = 0x2771218
        joinqual = 0x0
        otherqual = 0x0
---Type <return> to continue, or q <return> to quit---
        hashtable = 0xc8
        hashvalue = 0
        batchno = 41357576
        parallel_state = <value optimized out>
#7  ExecHashJoin (pstate=0x2771218) at nodeHashjoin.c:565
No locals.
#8  0x00000000006131d0 in ExecProcNodeInstr (node=0x2771218) at execProcnode.c:461
        result = <value optimized out>
#9  0x0000000000633286 in ExecProcNode (pstate=0x2771108) at ../../../src/include/executor/executor.h:247
No locals.
#10 ExecSort (pstate=0x2771108) at nodeSort.c:107
        plannode = <value optimized out>
        outerNode = 0x2771218
        tupDesc = <value optimized out>
        node = 0x2771108
        estate = 0x2770a40
        dir = ForwardScanDirection
        tuplesortstate = 0x3c87160
        slot = <value optimized out>
...


If GDB isn't lying and I'm not being confuse by git, that's:
                BufFileClose(innerFile);

Which does (at least):

        pfree(file->files);
        pfree(file->offsets);
        pfree(file);


(gdb) p *innerFile
$2 = {numFiles = 1, files = 0xa421328, offsets = 0xa421310, isInterXact = false, dirty = false, readOnly = false,
fileset= 0x0, name = 0x0, resowner = 0x24f93e0, curFile = 0, curOffset = 73016512, 
 
  pos = 0, nbytes = 0, buffer = {
    data = '\000' <repeats 44 times>,
"Q\366\262h\220\004\000\000\000\000\000\000\000\000L\000\003\000(\377\377\377\377\377\177\372\377\377\017\000\000\000\000\000\000\000\000\257\321\345\333\063\002\000\003",
'\000'<repeats 23 times>"\204,
\003\000\000\000\000\000\000K\000\000\000\000\000\000\000K\000\000\000\000\000\000\000K\000\000\000\000\000\000\000K\000\000\000\000\000\000\000b\000\000\000\000\000\000\000c\000\000\000\000\000\000\000c\000\000\000\000\000\000\000d",
'\000'<repeats 15 times>, "q\002\000\000\000\000\000\000d", '\000' <repeats 23 times>,
"I\000\000\000\000\000\000\000\344H\a\000\000\000\000\000\017\000\210\026\000\310\024\000\024H\000\000\000\000\000\000\017\000\210\035\000H!\027\000\210\216\000T\vB\017\304\t\027\000\210\a\000\206\b\033\030.\"",
'\000'<repeats 11 times>..., force_align_d = 0, force_align_i64 = 0}}
 

(gdb) p innerFile->files[0]
$8 = 2397





On Sat, Aug 24, 2019 at 7:25 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> Merlin: what OS are you running on and what postgres package, or was it
> compiled locally ?

I was reminded of this issue from last year, which also appeared to
involve BufFileClose() and a double-free:

https://postgr.es/m/87y3hmee19.fsf@news-spur.riddles.org.uk

That was a BufFile that was under the control of a tuplestore, so it
was similar to but different from your case. I suspect it's related.

-- 
Peter Geoghegan



On Sun, Aug 25, 2019 at 3:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I was reminded of this issue from last year, which also appeared to
> involve BufFileClose() and a double-free:
>
> https://postgr.es/m/87y3hmee19.fsf@news-spur.riddles.org.uk
>
> That was a BufFile that was under the control of a tuplestore, so it
> was similar to but different from your case. I suspect it's related.

Hmm.  tuplestore.c follows the same coding pattern as nodeHashjoin.c:
it always nukes its pointer after calling BufFileFlush(), so it
shouldn't be capable of calling it twice for the same pointer, unless
we have two copies of that pointer somehow.

Merlin's reported a double-free apparently in ExecHashJoin(), not
ExecHashJoinNewBatch() like this report.  Unfortunately that tells us
very little.

On Sun, Aug 25, 2019 at 2:25 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> #4  0x00000039ff678dd0 in _int_free (av=0x39ff98e120, p=0x1d40b090, have_lock=0) at malloc.c:4846
> #5  0x00000000006269e5 in ExecHashJoinNewBatch (pstate=0x2771218) at nodeHashjoin.c:1058

Can you reproduce this or was it a one-off crash?

Hmm.  We don't have enough stack frames to know where in here, and
presumably aset.c, this is, but as you noted it's got to be somewhere
here:

void
BufFileClose(BufFile *file)
{
        int                     i;

        /* flush any unwritten data */
        BufFileFlush(file);
        /* close and delete the underlying file(s) */
        for (i = 0; i < file->numFiles; i++)
                FileClose(file->files[i]);
        /* release the buffer space */
        pfree(file->files);
        pfree(file->offsets);
        pfree(file);
}

BufFileFlush() and FileClose() don't seem to be able to reach free().
pfree() always reaches free() for sizes > allocChunkLimit (8KB in
ExecutorState).  Given numFiles = 1 (as you showed), I expect
file->files and file->offsets to be small allocations, and file itself
to be large due to the 8KB buffer inside it.

Some possibilities:

1.  Somehow we actually called BufFileClose() twice.  Hard to see how
we'd do that, as mentioned.

2.  Somehow the BufFile was created in the wrong memory context, and
the memory was freed earlier.  Note that ExecHashJoinSaveTuple()'s
comment requires you to call it with CurrentMemoryContext == the
executor context, but nothing in the code enforces that.  As for the
tuplestore.c case, note also that tuplestore.c explicitly sets
CurrentResourceOwner, but not CurrentMemoryContext.  I suppose there
might be some obscure path somewhere, possibly through a custom
operator or suchlike, that leaves us in a strange memory context, or
something like that?  But then I feel like we'd have received
reproducible reports and a test case by now.

3.  Random memory corruption caused by buffer overrun who-knows-where.

> glibc-2.12-1.192.el6.x86_64
> linux 2.6.32-754.3.5.el6.x86_64

Greetings, time traveller!  I see that the tuplestore.c report was
also on a system of that vintage.  Hmm.

> #10 ExecSort (pstate=0x2771108) at nodeSort.c:107
>         plannode = <value optimized out>
>         outerNode = 0x2771218
>         tupDesc = <value optimized out>
>         node = 0x2771108
>         estate = 0x2770a40
>         dir = ForwardScanDirection
>         tuplesortstate = 0x3c87160
>         slot = <value optimized out>

It's interesting that a sort was involved here and it owns a
tuplestore, but hard to see the connection.

> (gdb) p *innerFile
> $2 = {numFiles = 1, files = 0xa421328, offsets = 0xa421310, isInterXact = false, dirty = false, readOnly = false,
fileset= 0x0, name = 0x0, resowner = 0x24f93e0, curFile = 0, curOffset = 73016512, 
>
>   pos = 0, nbytes = 0, buffer = {
>     data = '\000' <repeats 44 times>,
"Q\366\262h\220\004\000\000\000\000\000\000\000\000L\000\003\000(\377\377\377\377\377\177\372\377\377\017\000\000\000\000\000\000\000\000\257\321\345\333\063\002\000\003",
'\000'<repeats 23 times>"\204,
\003\000\000\000\000\000\000K\000\000\000\000\000\000\000K\000\000\000\000\000\000\000K\000\000\000\000\000\000\000K\000\000\000\000\000\000\000b\000\000\000\000\000\000\000c\000\000\000\000\000\000\000c\000\000\000\000\000\000\000d",
'\000'<repeats 15 times>, "q\002\000\000\000\000\000\000d", '\000' <repeats 23 times>,
"I\000\000\000\000\000\000\000\344H\a\000\000\000\000\000\017\000\210\026\000\310\024\000\024H\000\000\000\000\000\000\017\000\210\035\000H!\027\000\210\216\000T\vB\017\304\t\027\000\210\a\000\206\b\033\030.\"",
'\000'<repeats 11 times>..., force_align_d = 0, force_align_i64 = 0}} 
>
> (gdb) p innerFile->files[0]
> $8 = 2397

Hmm.  That all looks pretty sane from here, but tells us nothing about
whether it was already freed and if so where.

--
Thomas Munro
https://enterprisedb.com



On Mon, Aug 26, 2019 at 01:09:19PM +1200, Thomas Munro wrote:
> On Sun, Aug 25, 2019 at 3:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I was reminded of this issue from last year, which also appeared to
> > involve BufFileClose() and a double-free:
> >
> > https://postgr.es/m/87y3hmee19.fsf@news-spur.riddles.org.uk
> >
> > That was a BufFile that was under the control of a tuplestore, so it
> > was similar to but different from your case. I suspect it's related.
> 
> Hmm.  tuplestore.c follows the same coding pattern as nodeHashjoin.c:
> it always nukes its pointer after calling BufFileFlush(), so it
> shouldn't be capable of calling it twice for the same pointer, unless
> we have two copies of that pointer somehow.
> 
> Merlin's reported a double-free apparently in ExecHashJoin(), not
> ExecHashJoinNewBatch() like this report.  Unfortunately that tells us
> very little.
> 
> On Sun, Aug 25, 2019 at 2:25 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > #4  0x00000039ff678dd0 in _int_free (av=0x39ff98e120, p=0x1d40b090, have_lock=0) at malloc.c:4846
> > #5  0x00000000006269e5 in ExecHashJoinNewBatch (pstate=0x2771218) at nodeHashjoin.c:1058
> 
> Can you reproduce this or was it a one-off crash?

The query was of our large reports, and this job runs every 15min against
recently-loaded data; in the immediate case, between
2019-08-24t08:00:00 and 2019-08-24 09:00:00

I can rerun it fine, and I ran it in a loop for awhile last night with no
issues.

time psql ts -f tmp/sql-2019-08-24.1 |wc
   5416  779356 9793941

Since it was asked in other thread Peter mentioned:

ts=# SHOW work_mem;
work_mem | 128MB

ts=# SHOW shared_buffers ;
shared_buffers | 1536MB

> might be some obscure path somewhere, possibly through a custom
> operator or suchlike, that leaves us in a strange memory context, or
> something like that?  But then I feel like we'd have received
> reproducible reports and a test case by now.

No custom operator in sight.  Just NATURAL JOIN on integers, and WHERE on
timestamp, some plpgsql and int[].

Justin



On Mon, Aug 26, 2019 at 1:44 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> On Mon, Aug 26, 2019 at 01:09:19PM +1200, Thomas Munro wrote:
> > On Sun, Aug 25, 2019 at 3:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > > I was reminded of this issue from last year, which also appeared to
> > > involve BufFileClose() and a double-free:
> > >
> > > https://postgr.es/m/87y3hmee19.fsf@news-spur.riddles.org.uk
> > >
> > > That was a BufFile that was under the control of a tuplestore, so it
> > > was similar to but different from your case. I suspect it's related.
> >
> > Hmm.  tuplestore.c follows the same coding pattern as nodeHashjoin.c:
> > it always nukes its pointer after calling BufFileFlush(), so it
> > shouldn't be capable of calling it twice for the same pointer, unless
> > we have two copies of that pointer somehow.
> >
> > Merlin's reported a double-free apparently in ExecHashJoin(), not
> > ExecHashJoinNewBatch() like this report.  Unfortunately that tells us
> > very little.

Here's another one:

https://www.postgresql.org/message-id/flat/20170601081104.1500.56202%40wrigleys.postgresql.org

Hmm.  Also on RHEL/CentOS 6, and also involving sorting, hashing,
BufFileClose() but this time the glibc double free error is in
repalloc().

And another one (repeatedly happening):

https://www.postgresql.org/message-id/flat/3976998C-8D3B-4825-9B10-69ECB70A597A%40appnexus.com

Also on RHEL/CentOS 6, this time a sort in once case and a hash join
in another case.

Of course it's entirely possible that we have a bug here and I'm very
keen to find it, but I can't help noticing the common factor here is
that they're all running ancient RHEL 6.x releases, except Merlin who
didn't say.  Merlin?

-- 
Thomas Munro
https://enterprisedb.com



I'm not sure but maybe this is useful ?

|(gdb) p VfdCache[2397]
|$9 = {fd = -1, fdstate = 0, resowner = 0x24f93e0, nextFree = 2393, lruMoreRecently = 0, lruLessRecently = 2360,
seekPos= 73016512, fileSize = 0, fileName = 0x0, fileFlags = 2, fileMode = 384}
 

Knowing this report, very possibly this was a tempfile, possible a parallel fileset.
I don't see parallel workers in the query plan, but I do have external sort:

                                                               ->  Sort  (cost=20801.16..20801.60 rows=175 width=628)
(actualtime=39150.619..40730.793 rows=2002440 loops=1)
 
                                                                     Sort Key: data_cell.sect_id
                                                                     Sort Method: external sort  Disk: 1613224kB
                                                                     ->  Hash Join  (cost=20649.46..20794.64 rows=175
width=628)(actual time=736.734..2423.020 rows=2002440 loops=1)
 
                                                                           Hash Cond:
(eric_enodeb_cell_201908.start_time= data_cell.period)
 

Note, we run report with enable_nestloop=off (say what you will).  And the
report runs within an transaction which we roll back.  I suspect that's maybe
relevant for cleaning up files.

Ah, it's more than 24h old but saved the logfile with crash marker, so I found:

sudo zgrep 26188 /var/log/postgresql/crash-postgresql-2019-08-24_121600.log.gz
< 2019-08-24 12:16:17.037 CDT telsasoft >LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp26188.5", size 336726940
< 2019-08-24 12:16:17.038 CDT telsasoft >LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp26188.54", size 6354
< 2019-08-24 12:16:20.081 CDT telsasoft >LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp26188.6", size 270914376
< 2019-08-24 12:16:20.083 CDT telsasoft >LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp26188.39", size 7024
< 2019-08-24 12:16:23.464 CDT telsasoft >LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp26188.7", size 213332328
< 2019-08-24 12:16:23.465 CDT telsasoft >LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp26188.37", size 5984
...
< 2019-08-24 12:17:42.966 CDT telsasoft >LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp26188.125", size 6534
< 2019-08-24 12:17:43.035 CDT telsasoft >LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp26188.97", size 72828152
< 2019-08-24 12:17:43.036 CDT telsasoft >LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp26188.112", size 6474
< 2019-08-24 12:17:43.114 CDT telsasoft >LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp26188.90", size 73016512
< 2019-08-24 12:17:50.109 CDT  >LOG:  server process (PID 26188) was terminated by signal 6: Aborted

sudo zgrep -cFw 'pgsql_tmp26188' /var/log/postgresql/crash-postgresql-2019-08-24_121600.log.gz 
=> 116

sudo zgrep -Fw 'pgsql_tmp26188' /var/log/postgresql/crash-postgresql-2019-08-24_121600.log.gz |awk '{s+=$NF}END{print
s/1024^3}'
9.86781

FWIW:

 log_temp_files       | 0
 stats_temp_directory | pg_stat_tmp
 temp_file_limit      | -1
 temp_tablespaces     | """"""""""""""

Apparently, the last is garbage from our upgrade script, but in any case it
went to PGDATA.

Justin



On Mon, Aug 26, 2019 at 02:34:31PM +1200, Thomas Munro wrote:
>On Mon, Aug 26, 2019 at 1:44 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>> On Mon, Aug 26, 2019 at 01:09:19PM +1200, Thomas Munro wrote:
>> > On Sun, Aug 25, 2019 at 3:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> > > I was reminded of this issue from last year, which also appeared to
>> > > involve BufFileClose() and a double-free:
>> > >
>> > > https://postgr.es/m/87y3hmee19.fsf@news-spur.riddles.org.uk
>> > >
>> > > That was a BufFile that was under the control of a tuplestore, so it
>> > > was similar to but different from your case. I suspect it's related.
>> >
>> > Hmm.  tuplestore.c follows the same coding pattern as nodeHashjoin.c:
>> > it always nukes its pointer after calling BufFileFlush(), so it
>> > shouldn't be capable of calling it twice for the same pointer, unless
>> > we have two copies of that pointer somehow.
>> >
>> > Merlin's reported a double-free apparently in ExecHashJoin(), not
>> > ExecHashJoinNewBatch() like this report.  Unfortunately that tells us
>> > very little.
>
>Here's another one:
>
>https://www.postgresql.org/message-id/flat/20170601081104.1500.56202%40wrigleys.postgresql.org
>
>Hmm.  Also on RHEL/CentOS 6, and also involving sorting, hashing,
>BufFileClose() but this time the glibc double free error is in
>repalloc().
>
>And another one (repeatedly happening):
>
>https://www.postgresql.org/message-id/flat/3976998C-8D3B-4825-9B10-69ECB70A597A%40appnexus.com
>
>Also on RHEL/CentOS 6, this time a sort in once case and a hash join
>in another case.
>
>Of course it's entirely possible that we have a bug here and I'm very
>keen to find it, but I can't help noticing the common factor here is
>that they're all running ancient RHEL 6.x releases, except Merlin who
>didn't say.  Merlin?
>

It'd be interesting to know the exact glibc version for those machines.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Thomas Munro <thomas.munro@gmail.com> writes:
> Of course it's entirely possible that we have a bug here and I'm very
> keen to find it, but I can't help noticing the common factor here is
> that they're all running ancient RHEL 6.x releases, except Merlin who
> didn't say.  Merlin?

Hm, Justin said he had
glibc-2.12-1.192.el6.x86_64
which is a little bit behind the times but not *that* ancient.

For reference, attached is the rpm --changelog output on an up-to-date
RHEL6 box for glibc running back to 2.12-1.192.  Unfortunately a lot
of the referenced bugzilla entries aren't public, but the one-liner
descriptions don't seem to suggest that they found any heap overruns
that would be interesting for us.

However ... there is some pretty interesting info at
https://bugzilla.redhat.com/show_bug.cgi?id=1338673
suggesting that compiling with a late-model gcc against older RHEL6
headers could result in bad code.  I wonder whether the reporters'
servers were built using such a configuration.  (Although the linkage,
if any, to this report still wouldn't be very clear.)

            regards, tom lane


* Mon Apr 01 2019 Florian Weimer <fweimer@redhat.com> - 2.12-1.212.3
- Remove abort() warning in manual (#1577437)

* Mon Apr 01 2019 Florian Weimer <fweimer@redhat.com> - 2.12-1.212.2
- ja_JP: Add new Japanese Era name (#1555930)

* Mon Apr 01 2019 Florian Weimer <fweimer@redhat.com> - 2.12-1.212.1
- regex: Fix false match in trailing characters (#1668169)

* Fri Nov 17 2017 Patsy Franklin <pfrankli@redhat.com> - 2.12-1.212
- CVE-2017-15670: glob: Fix one-byte overflow with GLOB_TILDE (#1504810)
- CVE-2017-15804: glob: Fix buffer overflow in GLOB_TILDE unescaping (#1504810)

* Mon Jun 19 2017 Florian Weimer <fweimer@redhat.com> - 2.12-1.211
- Avoid large allocas in the dynamic linker (#1452717)

* Wed Mar 29 2017 Carlos O'Donell <carlos@redhat.com> - 2.12-1.210
- Fix thread cancellation issues for setmntent() and others (#1437147).

* Wed Jan 25 2017 Florian Weimer <fweimer@redhat.com> - 2.12-1.209
- Fix AF_INET6 getaddrinfo with nscd (#1416496)

* Tue Oct 18 2016 Carlos O'Donell <carlos@redhat.com> - 2.12-1.208
- Update tests for struct sockaddr_storage changes (#1338673)

* Mon Oct 17 2016 Martin Sebor <msebor@redhat.com> - 2.12-1.207
- Use FL_CLOEXEC in internal calls to fopen (#1012343).

* Mon Oct 17 2016 Carlos O'Donell <carlos@redhat.com> - 2.12-1.206
- Fix CVE-2015-8779 glibc: Unbounded stack allocation in catopen function
  (#1358015).

* Mon Oct 17 2016 DJ Delorie <dj@redhat.com> - 2.12-1.205
- Make padding in struct sockaddr_storage explicit (#1338673)

* Thu Oct 13 2016 Carlos O'Donell <carlos@redhat.com> - 2.12-1.204
- Fix detection of Intel FMA hardware (#1384281).

* Tue Oct 11 2016 Carlos O'Donell <carlos@redhat.com> - 2.12-1.203
- Add support for el_GR@euro, ur_IN, and wal_ET locales (#1101858).

* Tue Oct 11 2016 Patsy Franklin <pfrankli@redhat.com> - 2.12-1.202
- Change malloc/tst-malloc-thread-exit.c to use fewer threads and
  avoid timeout (#1318380).

* Tue Oct 11 2016 Patsy Franklin <pfrankli@redhat.com> - 2.12-1.201
- df can fail on some systems (#1307029).

* Wed Sep 21 2016 DJ Delorie <dj@redhat.com> - 2.12-1.200
- Log uname, cpuinfo, meminfo during build (#1307029).

* Mon Sep 12 2016 DJ Delorie <dj@redhat.com> - 2.12-1.199
- Draw graphs for heap and stack only if MAXSIZE_HEAP and MAXSIZE_STACK
  are non-zero (#1331304).

* Mon Sep 12 2016 DJ Delorie <dj@redhat.com> - 2.12-1.198
- Avoid unneeded calls to __check_pf in getadddrinfo (#1270950)

* Mon Sep 12 2016 Martin Sebor <msebor@redhat.com> - 2.12-1.197
- Fix CVE-2015-8778 glibc: Integer overflow in hcreate and hcreate_r
  (#1358013).

* Mon Sep 12 2016 Martin Sebor <msebor@redhat.com> - 2.12-1.196
- Fix CVE-2015-8776 glibc: Segmentation fault caused by passing
  out-of-range data to strftime() (#1358011).

* Mon Sep 12 2016 Florian Weimer <fweimer@redhat.com> - 2.12-1.195
- tzdata-update: Ignore umask setting (#1373646)

* Thu Sep 08 2016 Florian Weimer <fweimer@redhat.com> - 2.12-1.194
- CVE-2014-9761: Fix unbounded stack allocation in nan* (#1358014)

* Thu Feb 04 2016 Florian Weimer <fweimer@redhat.com> - 2.12-1.193
- Avoid using uninitialized data in getaddrinfo (#1223095)

* Thu Jan 28 2016 Carlos O'Donell <carlos@redhat.com> - 2.12-1.192
- Update fix for CVE-2015-7547 (#1296029).



On Mon, Aug 26, 2019 at 12:45:24PM -0400, Tom Lane wrote:
> However ... there is some pretty interesting info at
> https://bugzilla.redhat.com/show_bug.cgi?id=1338673
> suggesting that compiling with a late-model gcc against older RHEL6
> headers could result in bad code.  I wonder whether the reporters'
> servers were built using such a configuration.  (Although the linkage,
> if any, to this report still wouldn't be very clear.)

That's a question for Devrim ?

Can you tell which glibc headers were installed when compiling this package?
postgresql11-server-11.5-1PGDG.rhel6.x86_64

I can tell it was compiled using
version | PostgreSQL 11.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23), 64-bit

Maybe it doesn't matter, since, for better or worse, is not a "late model" cc.

https://www.gnu.org/software/gcc/gcc-4.4/
GCC 4.4.7
    March 13, 2012 (changes) 
This release series is no longer maintained.

Justin



Justin Pryzby <pryzby@telsasoft.com> writes:
> On Mon, Aug 26, 2019 at 12:45:24PM -0400, Tom Lane wrote:
>> However ... there is some pretty interesting info at
>> https://bugzilla.redhat.com/show_bug.cgi?id=1338673
>> suggesting that compiling with a late-model gcc against older RHEL6
>> headers could result in bad code.  I wonder whether the reporters'
>> servers were built using such a configuration.  (Although the linkage,
>> if any, to this report still wouldn't be very clear.)

> I can tell it was compiled using
> version | PostgreSQL 11.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23), 64-bit

Ah, that appears to be the default compiler for RHEL6, so that theory
is out the window.  It's still interesting that we're only seeing this
reported from RHEL6 ... maybe there's something specific to the code
that this gcc version generates?

            regards, tom lane



On Sun, Aug 25, 2019 at 9:35 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Mon, Aug 26, 2019 at 1:44 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > On Mon, Aug 26, 2019 at 01:09:19PM +1200, Thomas Munro wrote:
> > > On Sun, Aug 25, 2019 at 3:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > > > I was reminded of this issue from last year, which also appeared to
> > > > involve BufFileClose() and a double-free:
> > > >
> > > > https://postgr.es/m/87y3hmee19.fsf@news-spur.riddles.org.uk
> > > >
> > > > That was a BufFile that was under the control of a tuplestore, so it
> > > > was similar to but different from your case. I suspect it's related.
> > >
> > > Hmm.  tuplestore.c follows the same coding pattern as nodeHashjoin.c:
> > > it always nukes its pointer after calling BufFileFlush(), so it
> > > shouldn't be capable of calling it twice for the same pointer, unless
> > > we have two copies of that pointer somehow.
> > >
> > > Merlin's reported a double-free apparently in ExecHashJoin(), not
> > > ExecHashJoinNewBatch() like this report.  Unfortunately that tells us
> > > very little.
>
> Here's another one:
>
> https://www.postgresql.org/message-id/flat/20170601081104.1500.56202%40wrigleys.postgresql.org
>
> Hmm.  Also on RHEL/CentOS 6, and also involving sorting, hashing,
> BufFileClose() but this time the glibc double free error is in
> repalloc().
>
> And another one (repeatedly happening):
>
> https://www.postgresql.org/message-id/flat/3976998C-8D3B-4825-9B10-69ECB70A597A%40appnexus.com
>
> Also on RHEL/CentOS 6, this time a sort in once case and a hash join
> in another case.
>
> Of course it's entirely possible that we have a bug here and I'm very
> keen to find it, but I can't help noticing the common factor here is
> that they're all running ancient RHEL 6.x releases, except Merlin who
> didn't say.  Merlin?

Just noticed this.
redhat-release: "Red Hat Enterprise Linux Server release 6.9 (Santiago)"

merlin



On Mon, Aug 26, 2019 at 12:01 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Justin Pryzby <pryzby@telsasoft.com> writes:
> > On Mon, Aug 26, 2019 at 12:45:24PM -0400, Tom Lane wrote:
> >> However ... there is some pretty interesting info at
> >> https://bugzilla.redhat.com/show_bug.cgi?id=1338673
> >> suggesting that compiling with a late-model gcc against older RHEL6
> >> headers could result in bad code.  I wonder whether the reporters'
> >> servers were built using such a configuration.  (Although the linkage,
> >> if any, to this report still wouldn't be very clear.)
>
> > I can tell it was compiled using
> > version | PostgreSQL 11.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23), 64-bit
>
> Ah, that appears to be the default compiler for RHEL6, so that theory
> is out the window.  It's still interesting that we're only seeing this
> reported from RHEL6 ... maybe there's something specific to the code
> that this gcc version generates?

FWIW, I've got the same compiler (which is natural, we are likely
using the same packaging):
PostgreSQL 9.6.12 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313
(Red Hat 4.4.7-23), 64-bit

merlin



Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev)

From
Merlin Moncure
Date:
On Tue, Aug 27, 2019 at 5:52 PM Merlin Moncure <mmoncure@gmail.com> wrote:
>
> On Mon, Aug 26, 2019 at 12:01 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Justin Pryzby <pryzby@telsasoft.com> writes:
> > > On Mon, Aug 26, 2019 at 12:45:24PM -0400, Tom Lane wrote:
> > >> However ... there is some pretty interesting info at
> > >> https://bugzilla.redhat.com/show_bug.cgi?id=1338673
> > >> suggesting that compiling with a late-model gcc against older RHEL6
> > >> headers could result in bad code.  I wonder whether the reporters'
> > >> servers were built using such a configuration.  (Although the linkage,
> > >> if any, to this report still wouldn't be very clear.)
> >
> > > I can tell it was compiled using
> > > version | PostgreSQL 11.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23),
64-bit
> >
> > Ah, that appears to be the default compiler for RHEL6, so that theory
> > is out the window.  It's still interesting that we're only seeing this
> > reported from RHEL6 ... maybe there's something specific to the code
> > that this gcc version generates?
>
> FWIW, I've got the same compiler (which is natural, we are likely
> using the same packaging):
> PostgreSQL 9.6.12 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313
> (Red Hat 4.4.7-23), 64-bit

I'm looking for more common threads here.   One interesting fact about
this server is that we just realized (!) that cron was attempting to
email huge log files and failing due to misconfigured email.  The mail
queue was building into the gigabytes so that on half hour cadence the
server was running out of memory until the mail process crashed.
Postgres ran just fine through this process (which is pretty cool).
So, question: were there any memory issues on the other server on or
around the crash?

merlin