Thread: pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev)
pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev)
From
Justin Pryzby
Date:
Core was generated by `postgres: telsasoft ts 10.100.2.162(33500) SELECT '. Program terminated with signal 6, Aborted. #0 0x00000039ff6325e5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 64 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); Missing separate debuginfos, use: debuginfo-install audit-libs-2.4.5-3.el6.x86_64 cyrus-sasl-lib-2.1.23-15.el6_6.2.x86_64geos36-3.6.3-1.rhel6.1.x86_64 json-c-0.11-12.el6.x86_64 keyutils-libs-1.4-5.el6.x86_64libcom_err-1.41.12-22.el6.x86_64 libgcc-4.4.7-17.el6.x86_64 libicu-4.2.1-14.el6.x86_64 libselinux-2.0.94-7.el6.x86_64libstdc++-4.4.7-17.el6.x86_64 nspr-4.11.0-1.el6.x86_64 nss-3.21.0-8.el6.x86_64 nss-softokn-freebl-3.14.3-23.3.el6_8.x86_64nss-util-3.21.0-2.el6.x86_64 postgis24_11-2.4.5-1.rhel6.1.x86_64 proj49-4.9.3-3.rhel6.1.x86_64zlib-1.2.3-29.el6.x86_64 (gdb) bt #0 0x00000039ff6325e5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00000039ff633dc5 in abort () at abort.c:92 #2 0x00000039ff6704f7 in __libc_message (do_abort=2, fmt=0x39ff758a60 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:198 #3 0x00000039ff675f3e in malloc_printerr (action=3, str=0x39ff758df0 "double free or corruption (!prev)", ptr=<value optimizedout>, ar_ptr=<value optimized out>) at malloc.c:6360 #4 0x00000039ff678dd0 in _int_free (av=0x39ff98e120, p=0x1d40b090, have_lock=0) at malloc.c:4846 #5 0x00000000006269e5 in ExecHashJoinNewBatch (pstate=0x2771218) at nodeHashjoin.c:1058 #6 ExecHashJoinImpl (pstate=0x2771218) at nodeHashjoin.c:539 #7 ExecHashJoin (pstate=0x2771218) at nodeHashjoin.c:565 #8 0x00000000006131d0 in ExecProcNodeInstr (node=0x2771218) at execProcnode.c:461 #9 0x0000000000633286 in ExecProcNode (pstate=0x2771108) at ../../../src/include/executor/executor.h:247 #10 ExecSort (pstate=0x2771108) at nodeSort.c:107 #11 0x00000000006131d0 in ExecProcNodeInstr (node=0x2771108) at execProcnode.c:461 #12 0x000000000061d51e in ExecProcNode (aggstate=0x2770d30) at ../../../src/include/executor/executor.h:247 #13 fetch_input_tuple (aggstate=0x2770d30) at nodeAgg.c:406 #14 0x000000000061ee70 in agg_retrieve_direct (pstate=0x2770d30) at nodeAgg.c:1755 #15 ExecAgg (pstate=0x2770d30) at nodeAgg.c:1570 #16 0x00000000006131d0 in ExecProcNodeInstr (node=0x2770d30) at execProcnode.c:461 #17 0x000000000060f4b7 in ExecProcNode (queryDesc=0x2940950, direction=<value optimized out>, count=0, execute_once=48) at../../../src/include/executor/executor.h:247 #18 ExecutePlan (queryDesc=0x2940950, direction=<value optimized out>, count=0, execute_once=48) at execMain.c:1723 #19 standard_ExecutorRun (queryDesc=0x2940950, direction=<value optimized out>, count=0, execute_once=48) at execMain.c:364 #20 0x00007ff08f74f618 in pgss_ExecutorRun (queryDesc=0x2940950, direction=ForwardScanDirection, count=0, execute_once=true)at pg_stat_statements.c:892 #21 0x00007ff08f2cc81d in explain_ExecutorRun (queryDesc=0x2940950, direction=ForwardScanDirection, count=0, execute_once=true)at auto_explain.c:281 #22 0x000000000075444b in PortalRunSelect (portal=0x2571f00, forward=<value optimized out>, count=0, dest=<value optimizedout>) at pquery.c:932 #23 0x0000000000755631 in PortalRun (portal=0x2571f00, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x928e7a8,altdest=0x928e7a8, completionTag=0x7ffcaa940670 "") at pquery.c:773 #24 0x0000000000752267 in exec_simple_query ( query_string=0x2753e08 "--BEGIN SQL\nSELECT * FROM\n\n(SELECT site_office AS site_gran,\n \tsite_location AS cell,\n\tdisplay_site_name(sect_name, site_name, sect_mscid) AS sitename,\n\tperiod, data_cell.bh_day,\n\n", '-' <repeats17 times>, "Ac"...) at postgres.c:1145 #25 0x0000000000753211 in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x252b1f8 "ts", username=<valueoptimized out>) at postgres.c:4182 #26 0x00000000006e2587 in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4358 #27 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4030 #28 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1707 #29 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1380 #30 0x0000000000656a80 in main (argc=3, argv=0x24f7570) at main.c:228 I found this report from 9.6 last month; copying those parties. https://www.postgresql.org/message-id/CAHyXU0xzwzgnUTyK62ZkG0_1CQsBHwnTVT2TSX7iwTEv1ve9ag%40mail.gmail.com Merlin: what OS are you running on and what postgres package, or was it compiled locally ? We're running: postgresql11-server-11.5-1PGDG.rhel6.x86_64 CentOS release 6.8 glibc-2.12-1.192.el6.x86_64 linux 2.6.32-754.3.5.el6.x86_64 DMI: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015 Model name: Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz I see this is available but not installed: 2.12-1.212.el6_10.3 The VM has no known issues. ts=# SELECT * FROM pg_config(); .. | CONFIGURE | '--enable-rpath' '--prefix=/usr/pgsql-11' '--includedir=/usr/pgsql-11/include' '--mandir=/usr/pgsql-11/share/man''--datadir=/usr/pgsql-11/share' '--libdir=/usr/pgsql-11/lib' '--with-icu' '--with-perl''--with-python' '--with-tcl' '--with-tclconfig=/usr/lib64' '--with-openssl' '--with-pam' '--with-gssapi' '--with-includes=/usr/include''--with-libraries=/usr/lib64' '--enable-nls' '--with-uuid=e2fs' '--with-libxml' '--with-libxslt''--with-ldap' '--with-system-tzdata=/usr/share/zoneinfo' '--sysconfdir=/etc/sysconfig/pgsql' '--docdir=/usr/pgsql-11/doc''--htmldir=/usr/pgsql-11/doc/html' 'CFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions-fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic' 'LDFLAGS=-Wl,--as-needed' 'PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig''ICU_CFLAGS=-I/usr/include' 'ICU_LIBS=-L/usr/lib64 -licui18n-licuuc -licudata' | CC | gcc | CPPFLAGS | -I/usr/include -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include | CFLAGS | -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions-fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic | CFLAGS_SL | -fPIC | LDFLAGS | -Wl,--as-needed -L/usr/lib64 -Wl,--as-needed -Wl,-rpath,'/usr/pgsql-11/lib',--enable-new-dtags | LDFLAGS_EX | | LDFLAGS_SL | | LIBS | -lpgcommon -lpgport -lpthread -lxslt -lxml2 -lpam -lssl -lcrypto -lgssapi_krb5 -lz -lreadline -lrt-lcrypt -ldl -lm #0 0x00000039ff6325e5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 resultvar = 0 pid = <value optimized out> selftid = <value optimized out> #1 0x00000039ff633dc5 in abort () at abort.c:92 save_stage = 2 act = {__sigaction_handler = {sa_handler = 0x7ffcaa93f5c8, sa_sigaction = 0x7ffcaa93f5c8}, sa_mask = {__val = {140723170309552,140723170319605, 49, 249099024494, 3, 140723170309562, 6, 249099024498, 2, 140723170309550, 2, 249099017763, 1, 249099024494, 3, 140723170309558}}, sa_flags = 10, sa_restorer= 0x39ff757872} sigs = {__val = {32, 0 <repeats 15 times>}} #2 0x00000039ff6704f7 in __libc_message (do_abort=2, fmt=0x39ff758a60 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:198 ap = {{gp_offset = 40, fp_offset = 48, overflow_arg_area = 0x7ffcaa93ff30, reg_save_area = 0x7ffcaa93fe40}} ap_copy = {{gp_offset = 16, fp_offset = 48, overflow_arg_area = 0x7ffcaa93ff30, reg_save_area = 0x7ffcaa93fe40}} fd = 2 on_2 = <value optimized out> list = <value optimized out> nlist = <value optimized out> cp = <value optimized out> written = <value optimized out> #3 0x00000039ff675f3e in malloc_printerr (action=3, str=0x39ff758df0 "double free or corruption (!prev)", ptr=<value optimizedout>, ar_ptr=<value optimized out>) at malloc.c:6360 buf = "000000001d40b0a0" cp = <value optimized out> #4 0x00000039ff678dd0 in _int_free (av=0x39ff98e120, p=0x1d40b090, have_lock=0) at malloc.c:4846 size = <value optimized out> fb = <value optimized out> nextchunk = <value optimized out> nextsize = <value optimized out> nextinuse = <value optimized out> prevsize = <value optimized out> bck = <value optimized out> fwd = <value optimized out> errstr = <value optimized out> locked = <value optimized out> #5 0x00000000006269e5 in ExecHashJoinNewBatch (pstate=0x2771218) at nodeHashjoin.c:1058 nbatch = <value optimized out> curbatch = <value optimized out> slot = <value optimized out> hashvalue = 1756558929 hashtable = 0x3c4ed48 innerFile = 0x1d40b0d8 #6 ExecHashJoinImpl (pstate=0x2771218) at nodeHashjoin.c:539 outerNode = 0x27714d8 hashNode = 0x2772348 econtext = <value optimized out> node = 0x2771218 joinqual = 0x0 otherqual = 0x0 ---Type <return> to continue, or q <return> to quit--- hashtable = 0xc8 hashvalue = 0 batchno = 41357576 parallel_state = <value optimized out> #7 ExecHashJoin (pstate=0x2771218) at nodeHashjoin.c:565 No locals. #8 0x00000000006131d0 in ExecProcNodeInstr (node=0x2771218) at execProcnode.c:461 result = <value optimized out> #9 0x0000000000633286 in ExecProcNode (pstate=0x2771108) at ../../../src/include/executor/executor.h:247 No locals. #10 ExecSort (pstate=0x2771108) at nodeSort.c:107 plannode = <value optimized out> outerNode = 0x2771218 tupDesc = <value optimized out> node = 0x2771108 estate = 0x2770a40 dir = ForwardScanDirection tuplesortstate = 0x3c87160 slot = <value optimized out> ... If GDB isn't lying and I'm not being confuse by git, that's: BufFileClose(innerFile); Which does (at least): pfree(file->files); pfree(file->offsets); pfree(file); (gdb) p *innerFile $2 = {numFiles = 1, files = 0xa421328, offsets = 0xa421310, isInterXact = false, dirty = false, readOnly = false, fileset= 0x0, name = 0x0, resowner = 0x24f93e0, curFile = 0, curOffset = 73016512, pos = 0, nbytes = 0, buffer = { data = '\000' <repeats 44 times>, "Q\366\262h\220\004\000\000\000\000\000\000\000\000L\000\003\000(\377\377\377\377\377\177\372\377\377\017\000\000\000\000\000\000\000\000\257\321\345\333\063\002\000\003", '\000'<repeats 23 times>"\204, \003\000\000\000\000\000\000K\000\000\000\000\000\000\000K\000\000\000\000\000\000\000K\000\000\000\000\000\000\000K\000\000\000\000\000\000\000b\000\000\000\000\000\000\000c\000\000\000\000\000\000\000c\000\000\000\000\000\000\000d", '\000'<repeats 15 times>, "q\002\000\000\000\000\000\000d", '\000' <repeats 23 times>, "I\000\000\000\000\000\000\000\344H\a\000\000\000\000\000\017\000\210\026\000\310\024\000\024H\000\000\000\000\000\000\017\000\210\035\000H!\027\000\210\216\000T\vB\017\304\t\027\000\210\a\000\206\b\033\030.\"", '\000'<repeats 11 times>..., force_align_d = 0, force_align_i64 = 0}} (gdb) p innerFile->files[0] $8 = 2397
Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev)
From
Peter Geoghegan
Date:
On Sat, Aug 24, 2019 at 7:25 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > Merlin: what OS are you running on and what postgres package, or was it > compiled locally ? I was reminded of this issue from last year, which also appeared to involve BufFileClose() and a double-free: https://postgr.es/m/87y3hmee19.fsf@news-spur.riddles.org.uk That was a BufFile that was under the control of a tuplestore, so it was similar to but different from your case. I suspect it's related. -- Peter Geoghegan
Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev)
From
Thomas Munro
Date:
On Sun, Aug 25, 2019 at 3:15 PM Peter Geoghegan <pg@bowt.ie> wrote: > I was reminded of this issue from last year, which also appeared to > involve BufFileClose() and a double-free: > > https://postgr.es/m/87y3hmee19.fsf@news-spur.riddles.org.uk > > That was a BufFile that was under the control of a tuplestore, so it > was similar to but different from your case. I suspect it's related. Hmm. tuplestore.c follows the same coding pattern as nodeHashjoin.c: it always nukes its pointer after calling BufFileFlush(), so it shouldn't be capable of calling it twice for the same pointer, unless we have two copies of that pointer somehow. Merlin's reported a double-free apparently in ExecHashJoin(), not ExecHashJoinNewBatch() like this report. Unfortunately that tells us very little. On Sun, Aug 25, 2019 at 2:25 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > #4 0x00000039ff678dd0 in _int_free (av=0x39ff98e120, p=0x1d40b090, have_lock=0) at malloc.c:4846 > #5 0x00000000006269e5 in ExecHashJoinNewBatch (pstate=0x2771218) at nodeHashjoin.c:1058 Can you reproduce this or was it a one-off crash? Hmm. We don't have enough stack frames to know where in here, and presumably aset.c, this is, but as you noted it's got to be somewhere here: void BufFileClose(BufFile *file) { int i; /* flush any unwritten data */ BufFileFlush(file); /* close and delete the underlying file(s) */ for (i = 0; i < file->numFiles; i++) FileClose(file->files[i]); /* release the buffer space */ pfree(file->files); pfree(file->offsets); pfree(file); } BufFileFlush() and FileClose() don't seem to be able to reach free(). pfree() always reaches free() for sizes > allocChunkLimit (8KB in ExecutorState). Given numFiles = 1 (as you showed), I expect file->files and file->offsets to be small allocations, and file itself to be large due to the 8KB buffer inside it. Some possibilities: 1. Somehow we actually called BufFileClose() twice. Hard to see how we'd do that, as mentioned. 2. Somehow the BufFile was created in the wrong memory context, and the memory was freed earlier. Note that ExecHashJoinSaveTuple()'s comment requires you to call it with CurrentMemoryContext == the executor context, but nothing in the code enforces that. As for the tuplestore.c case, note also that tuplestore.c explicitly sets CurrentResourceOwner, but not CurrentMemoryContext. I suppose there might be some obscure path somewhere, possibly through a custom operator or suchlike, that leaves us in a strange memory context, or something like that? But then I feel like we'd have received reproducible reports and a test case by now. 3. Random memory corruption caused by buffer overrun who-knows-where. > glibc-2.12-1.192.el6.x86_64 > linux 2.6.32-754.3.5.el6.x86_64 Greetings, time traveller! I see that the tuplestore.c report was also on a system of that vintage. Hmm. > #10 ExecSort (pstate=0x2771108) at nodeSort.c:107 > plannode = <value optimized out> > outerNode = 0x2771218 > tupDesc = <value optimized out> > node = 0x2771108 > estate = 0x2770a40 > dir = ForwardScanDirection > tuplesortstate = 0x3c87160 > slot = <value optimized out> It's interesting that a sort was involved here and it owns a tuplestore, but hard to see the connection. > (gdb) p *innerFile > $2 = {numFiles = 1, files = 0xa421328, offsets = 0xa421310, isInterXact = false, dirty = false, readOnly = false, fileset= 0x0, name = 0x0, resowner = 0x24f93e0, curFile = 0, curOffset = 73016512, > > pos = 0, nbytes = 0, buffer = { > data = '\000' <repeats 44 times>, "Q\366\262h\220\004\000\000\000\000\000\000\000\000L\000\003\000(\377\377\377\377\377\177\372\377\377\017\000\000\000\000\000\000\000\000\257\321\345\333\063\002\000\003", '\000'<repeats 23 times>"\204, \003\000\000\000\000\000\000K\000\000\000\000\000\000\000K\000\000\000\000\000\000\000K\000\000\000\000\000\000\000K\000\000\000\000\000\000\000b\000\000\000\000\000\000\000c\000\000\000\000\000\000\000c\000\000\000\000\000\000\000d", '\000'<repeats 15 times>, "q\002\000\000\000\000\000\000d", '\000' <repeats 23 times>, "I\000\000\000\000\000\000\000\344H\a\000\000\000\000\000\017\000\210\026\000\310\024\000\024H\000\000\000\000\000\000\017\000\210\035\000H!\027\000\210\216\000T\vB\017\304\t\027\000\210\a\000\206\b\033\030.\"", '\000'<repeats 11 times>..., force_align_d = 0, force_align_i64 = 0}} > > (gdb) p innerFile->files[0] > $8 = 2397 Hmm. That all looks pretty sane from here, but tells us nothing about whether it was already freed and if so where. -- Thomas Munro https://enterprisedb.com
Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev)
From
Justin Pryzby
Date:
On Mon, Aug 26, 2019 at 01:09:19PM +1200, Thomas Munro wrote: > On Sun, Aug 25, 2019 at 3:15 PM Peter Geoghegan <pg@bowt.ie> wrote: > > I was reminded of this issue from last year, which also appeared to > > involve BufFileClose() and a double-free: > > > > https://postgr.es/m/87y3hmee19.fsf@news-spur.riddles.org.uk > > > > That was a BufFile that was under the control of a tuplestore, so it > > was similar to but different from your case. I suspect it's related. > > Hmm. tuplestore.c follows the same coding pattern as nodeHashjoin.c: > it always nukes its pointer after calling BufFileFlush(), so it > shouldn't be capable of calling it twice for the same pointer, unless > we have two copies of that pointer somehow. > > Merlin's reported a double-free apparently in ExecHashJoin(), not > ExecHashJoinNewBatch() like this report. Unfortunately that tells us > very little. > > On Sun, Aug 25, 2019 at 2:25 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > #4 0x00000039ff678dd0 in _int_free (av=0x39ff98e120, p=0x1d40b090, have_lock=0) at malloc.c:4846 > > #5 0x00000000006269e5 in ExecHashJoinNewBatch (pstate=0x2771218) at nodeHashjoin.c:1058 > > Can you reproduce this or was it a one-off crash? The query was of our large reports, and this job runs every 15min against recently-loaded data; in the immediate case, between 2019-08-24t08:00:00 and 2019-08-24 09:00:00 I can rerun it fine, and I ran it in a loop for awhile last night with no issues. time psql ts -f tmp/sql-2019-08-24.1 |wc 5416 779356 9793941 Since it was asked in other thread Peter mentioned: ts=# SHOW work_mem; work_mem | 128MB ts=# SHOW shared_buffers ; shared_buffers | 1536MB > might be some obscure path somewhere, possibly through a custom > operator or suchlike, that leaves us in a strange memory context, or > something like that? But then I feel like we'd have received > reproducible reports and a test case by now. No custom operator in sight. Just NATURAL JOIN on integers, and WHERE on timestamp, some plpgsql and int[]. Justin
Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev)
From
Thomas Munro
Date:
On Mon, Aug 26, 2019 at 1:44 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > On Mon, Aug 26, 2019 at 01:09:19PM +1200, Thomas Munro wrote: > > On Sun, Aug 25, 2019 at 3:15 PM Peter Geoghegan <pg@bowt.ie> wrote: > > > I was reminded of this issue from last year, which also appeared to > > > involve BufFileClose() and a double-free: > > > > > > https://postgr.es/m/87y3hmee19.fsf@news-spur.riddles.org.uk > > > > > > That was a BufFile that was under the control of a tuplestore, so it > > > was similar to but different from your case. I suspect it's related. > > > > Hmm. tuplestore.c follows the same coding pattern as nodeHashjoin.c: > > it always nukes its pointer after calling BufFileFlush(), so it > > shouldn't be capable of calling it twice for the same pointer, unless > > we have two copies of that pointer somehow. > > > > Merlin's reported a double-free apparently in ExecHashJoin(), not > > ExecHashJoinNewBatch() like this report. Unfortunately that tells us > > very little. Here's another one: https://www.postgresql.org/message-id/flat/20170601081104.1500.56202%40wrigleys.postgresql.org Hmm. Also on RHEL/CentOS 6, and also involving sorting, hashing, BufFileClose() but this time the glibc double free error is in repalloc(). And another one (repeatedly happening): https://www.postgresql.org/message-id/flat/3976998C-8D3B-4825-9B10-69ECB70A597A%40appnexus.com Also on RHEL/CentOS 6, this time a sort in once case and a hash join in another case. Of course it's entirely possible that we have a bug here and I'm very keen to find it, but I can't help noticing the common factor here is that they're all running ancient RHEL 6.x releases, except Merlin who didn't say. Merlin? -- Thomas Munro https://enterprisedb.com
Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev)
From
Justin Pryzby
Date:
I'm not sure but maybe this is useful ? |(gdb) p VfdCache[2397] |$9 = {fd = -1, fdstate = 0, resowner = 0x24f93e0, nextFree = 2393, lruMoreRecently = 0, lruLessRecently = 2360, seekPos= 73016512, fileSize = 0, fileName = 0x0, fileFlags = 2, fileMode = 384} Knowing this report, very possibly this was a tempfile, possible a parallel fileset. I don't see parallel workers in the query plan, but I do have external sort: -> Sort (cost=20801.16..20801.60 rows=175 width=628) (actualtime=39150.619..40730.793 rows=2002440 loops=1) Sort Key: data_cell.sect_id Sort Method: external sort Disk: 1613224kB -> Hash Join (cost=20649.46..20794.64 rows=175 width=628)(actual time=736.734..2423.020 rows=2002440 loops=1) Hash Cond: (eric_enodeb_cell_201908.start_time= data_cell.period) Note, we run report with enable_nestloop=off (say what you will). And the report runs within an transaction which we roll back. I suspect that's maybe relevant for cleaning up files. Ah, it's more than 24h old but saved the logfile with crash marker, so I found: sudo zgrep 26188 /var/log/postgresql/crash-postgresql-2019-08-24_121600.log.gz < 2019-08-24 12:16:17.037 CDT telsasoft >LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp26188.5", size 336726940 < 2019-08-24 12:16:17.038 CDT telsasoft >LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp26188.54", size 6354 < 2019-08-24 12:16:20.081 CDT telsasoft >LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp26188.6", size 270914376 < 2019-08-24 12:16:20.083 CDT telsasoft >LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp26188.39", size 7024 < 2019-08-24 12:16:23.464 CDT telsasoft >LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp26188.7", size 213332328 < 2019-08-24 12:16:23.465 CDT telsasoft >LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp26188.37", size 5984 ... < 2019-08-24 12:17:42.966 CDT telsasoft >LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp26188.125", size 6534 < 2019-08-24 12:17:43.035 CDT telsasoft >LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp26188.97", size 72828152 < 2019-08-24 12:17:43.036 CDT telsasoft >LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp26188.112", size 6474 < 2019-08-24 12:17:43.114 CDT telsasoft >LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp26188.90", size 73016512 < 2019-08-24 12:17:50.109 CDT >LOG: server process (PID 26188) was terminated by signal 6: Aborted sudo zgrep -cFw 'pgsql_tmp26188' /var/log/postgresql/crash-postgresql-2019-08-24_121600.log.gz => 116 sudo zgrep -Fw 'pgsql_tmp26188' /var/log/postgresql/crash-postgresql-2019-08-24_121600.log.gz |awk '{s+=$NF}END{print s/1024^3}' 9.86781 FWIW: log_temp_files | 0 stats_temp_directory | pg_stat_tmp temp_file_limit | -1 temp_tablespaces | """""""""""""" Apparently, the last is garbage from our upgrade script, but in any case it went to PGDATA. Justin
Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev)
From
Tomas Vondra
Date:
On Mon, Aug 26, 2019 at 02:34:31PM +1200, Thomas Munro wrote: >On Mon, Aug 26, 2019 at 1:44 PM Justin Pryzby <pryzby@telsasoft.com> wrote: >> On Mon, Aug 26, 2019 at 01:09:19PM +1200, Thomas Munro wrote: >> > On Sun, Aug 25, 2019 at 3:15 PM Peter Geoghegan <pg@bowt.ie> wrote: >> > > I was reminded of this issue from last year, which also appeared to >> > > involve BufFileClose() and a double-free: >> > > >> > > https://postgr.es/m/87y3hmee19.fsf@news-spur.riddles.org.uk >> > > >> > > That was a BufFile that was under the control of a tuplestore, so it >> > > was similar to but different from your case. I suspect it's related. >> > >> > Hmm. tuplestore.c follows the same coding pattern as nodeHashjoin.c: >> > it always nukes its pointer after calling BufFileFlush(), so it >> > shouldn't be capable of calling it twice for the same pointer, unless >> > we have two copies of that pointer somehow. >> > >> > Merlin's reported a double-free apparently in ExecHashJoin(), not >> > ExecHashJoinNewBatch() like this report. Unfortunately that tells us >> > very little. > >Here's another one: > >https://www.postgresql.org/message-id/flat/20170601081104.1500.56202%40wrigleys.postgresql.org > >Hmm. Also on RHEL/CentOS 6, and also involving sorting, hashing, >BufFileClose() but this time the glibc double free error is in >repalloc(). > >And another one (repeatedly happening): > >https://www.postgresql.org/message-id/flat/3976998C-8D3B-4825-9B10-69ECB70A597A%40appnexus.com > >Also on RHEL/CentOS 6, this time a sort in once case and a hash join >in another case. > >Of course it's entirely possible that we have a bug here and I'm very >keen to find it, but I can't help noticing the common factor here is >that they're all running ancient RHEL 6.x releases, except Merlin who >didn't say. Merlin? > It'd be interesting to know the exact glibc version for those machines. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free or corruption (!prev)
From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes: > Of course it's entirely possible that we have a bug here and I'm very > keen to find it, but I can't help noticing the common factor here is > that they're all running ancient RHEL 6.x releases, except Merlin who > didn't say. Merlin? Hm, Justin said he had glibc-2.12-1.192.el6.x86_64 which is a little bit behind the times but not *that* ancient. For reference, attached is the rpm --changelog output on an up-to-date RHEL6 box for glibc running back to 2.12-1.192. Unfortunately a lot of the referenced bugzilla entries aren't public, but the one-liner descriptions don't seem to suggest that they found any heap overruns that would be interesting for us. However ... there is some pretty interesting info at https://bugzilla.redhat.com/show_bug.cgi?id=1338673 suggesting that compiling with a late-model gcc against older RHEL6 headers could result in bad code. I wonder whether the reporters' servers were built using such a configuration. (Although the linkage, if any, to this report still wouldn't be very clear.) regards, tom lane * Mon Apr 01 2019 Florian Weimer <fweimer@redhat.com> - 2.12-1.212.3 - Remove abort() warning in manual (#1577437) * Mon Apr 01 2019 Florian Weimer <fweimer@redhat.com> - 2.12-1.212.2 - ja_JP: Add new Japanese Era name (#1555930) * Mon Apr 01 2019 Florian Weimer <fweimer@redhat.com> - 2.12-1.212.1 - regex: Fix false match in trailing characters (#1668169) * Fri Nov 17 2017 Patsy Franklin <pfrankli@redhat.com> - 2.12-1.212 - CVE-2017-15670: glob: Fix one-byte overflow with GLOB_TILDE (#1504810) - CVE-2017-15804: glob: Fix buffer overflow in GLOB_TILDE unescaping (#1504810) * Mon Jun 19 2017 Florian Weimer <fweimer@redhat.com> - 2.12-1.211 - Avoid large allocas in the dynamic linker (#1452717) * Wed Mar 29 2017 Carlos O'Donell <carlos@redhat.com> - 2.12-1.210 - Fix thread cancellation issues for setmntent() and others (#1437147). * Wed Jan 25 2017 Florian Weimer <fweimer@redhat.com> - 2.12-1.209 - Fix AF_INET6 getaddrinfo with nscd (#1416496) * Tue Oct 18 2016 Carlos O'Donell <carlos@redhat.com> - 2.12-1.208 - Update tests for struct sockaddr_storage changes (#1338673) * Mon Oct 17 2016 Martin Sebor <msebor@redhat.com> - 2.12-1.207 - Use FL_CLOEXEC in internal calls to fopen (#1012343). * Mon Oct 17 2016 Carlos O'Donell <carlos@redhat.com> - 2.12-1.206 - Fix CVE-2015-8779 glibc: Unbounded stack allocation in catopen function (#1358015). * Mon Oct 17 2016 DJ Delorie <dj@redhat.com> - 2.12-1.205 - Make padding in struct sockaddr_storage explicit (#1338673) * Thu Oct 13 2016 Carlos O'Donell <carlos@redhat.com> - 2.12-1.204 - Fix detection of Intel FMA hardware (#1384281). * Tue Oct 11 2016 Carlos O'Donell <carlos@redhat.com> - 2.12-1.203 - Add support for el_GR@euro, ur_IN, and wal_ET locales (#1101858). * Tue Oct 11 2016 Patsy Franklin <pfrankli@redhat.com> - 2.12-1.202 - Change malloc/tst-malloc-thread-exit.c to use fewer threads and avoid timeout (#1318380). * Tue Oct 11 2016 Patsy Franklin <pfrankli@redhat.com> - 2.12-1.201 - df can fail on some systems (#1307029). * Wed Sep 21 2016 DJ Delorie <dj@redhat.com> - 2.12-1.200 - Log uname, cpuinfo, meminfo during build (#1307029). * Mon Sep 12 2016 DJ Delorie <dj@redhat.com> - 2.12-1.199 - Draw graphs for heap and stack only if MAXSIZE_HEAP and MAXSIZE_STACK are non-zero (#1331304). * Mon Sep 12 2016 DJ Delorie <dj@redhat.com> - 2.12-1.198 - Avoid unneeded calls to __check_pf in getadddrinfo (#1270950) * Mon Sep 12 2016 Martin Sebor <msebor@redhat.com> - 2.12-1.197 - Fix CVE-2015-8778 glibc: Integer overflow in hcreate and hcreate_r (#1358013). * Mon Sep 12 2016 Martin Sebor <msebor@redhat.com> - 2.12-1.196 - Fix CVE-2015-8776 glibc: Segmentation fault caused by passing out-of-range data to strftime() (#1358011). * Mon Sep 12 2016 Florian Weimer <fweimer@redhat.com> - 2.12-1.195 - tzdata-update: Ignore umask setting (#1373646) * Thu Sep 08 2016 Florian Weimer <fweimer@redhat.com> - 2.12-1.194 - CVE-2014-9761: Fix unbounded stack allocation in nan* (#1358014) * Thu Feb 04 2016 Florian Weimer <fweimer@redhat.com> - 2.12-1.193 - Avoid using uninitialized data in getaddrinfo (#1223095) * Thu Jan 28 2016 Carlos O'Donell <carlos@redhat.com> - 2.12-1.192 - Update fix for CVE-2015-7547 (#1296029).
Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev)
From
Justin Pryzby
Date:
On Mon, Aug 26, 2019 at 12:45:24PM -0400, Tom Lane wrote: > However ... there is some pretty interesting info at > https://bugzilla.redhat.com/show_bug.cgi?id=1338673 > suggesting that compiling with a late-model gcc against older RHEL6 > headers could result in bad code. I wonder whether the reporters' > servers were built using such a configuration. (Although the linkage, > if any, to this report still wouldn't be very clear.) That's a question for Devrim ? Can you tell which glibc headers were installed when compiling this package? postgresql11-server-11.5-1PGDG.rhel6.x86_64 I can tell it was compiled using version | PostgreSQL 11.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23), 64-bit Maybe it doesn't matter, since, for better or worse, is not a "late model" cc. https://www.gnu.org/software/gcc/gcc-4.4/ GCC 4.4.7 March 13, 2012 (changes) This release series is no longer maintained. Justin
Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free or corruption (!prev)
From
Tom Lane
Date:
Justin Pryzby <pryzby@telsasoft.com> writes: > On Mon, Aug 26, 2019 at 12:45:24PM -0400, Tom Lane wrote: >> However ... there is some pretty interesting info at >> https://bugzilla.redhat.com/show_bug.cgi?id=1338673 >> suggesting that compiling with a late-model gcc against older RHEL6 >> headers could result in bad code. I wonder whether the reporters' >> servers were built using such a configuration. (Although the linkage, >> if any, to this report still wouldn't be very clear.) > I can tell it was compiled using > version | PostgreSQL 11.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23), 64-bit Ah, that appears to be the default compiler for RHEL6, so that theory is out the window. It's still interesting that we're only seeing this reported from RHEL6 ... maybe there's something specific to the code that this gcc version generates? regards, tom lane
Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev)
From
Merlin Moncure
Date:
On Sun, Aug 25, 2019 at 9:35 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Mon, Aug 26, 2019 at 1:44 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Mon, Aug 26, 2019 at 01:09:19PM +1200, Thomas Munro wrote: > > > On Sun, Aug 25, 2019 at 3:15 PM Peter Geoghegan <pg@bowt.ie> wrote: > > > > I was reminded of this issue from last year, which also appeared to > > > > involve BufFileClose() and a double-free: > > > > > > > > https://postgr.es/m/87y3hmee19.fsf@news-spur.riddles.org.uk > > > > > > > > That was a BufFile that was under the control of a tuplestore, so it > > > > was similar to but different from your case. I suspect it's related. > > > > > > Hmm. tuplestore.c follows the same coding pattern as nodeHashjoin.c: > > > it always nukes its pointer after calling BufFileFlush(), so it > > > shouldn't be capable of calling it twice for the same pointer, unless > > > we have two copies of that pointer somehow. > > > > > > Merlin's reported a double-free apparently in ExecHashJoin(), not > > > ExecHashJoinNewBatch() like this report. Unfortunately that tells us > > > very little. > > Here's another one: > > https://www.postgresql.org/message-id/flat/20170601081104.1500.56202%40wrigleys.postgresql.org > > Hmm. Also on RHEL/CentOS 6, and also involving sorting, hashing, > BufFileClose() but this time the glibc double free error is in > repalloc(). > > And another one (repeatedly happening): > > https://www.postgresql.org/message-id/flat/3976998C-8D3B-4825-9B10-69ECB70A597A%40appnexus.com > > Also on RHEL/CentOS 6, this time a sort in once case and a hash join > in another case. > > Of course it's entirely possible that we have a bug here and I'm very > keen to find it, but I can't help noticing the common factor here is > that they're all running ancient RHEL 6.x releases, except Merlin who > didn't say. Merlin? Just noticed this. redhat-release: "Red Hat Enterprise Linux Server release 6.9 (Santiago)" merlin
Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev)
From
Merlin Moncure
Date:
On Mon, Aug 26, 2019 at 12:01 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Justin Pryzby <pryzby@telsasoft.com> writes: > > On Mon, Aug 26, 2019 at 12:45:24PM -0400, Tom Lane wrote: > >> However ... there is some pretty interesting info at > >> https://bugzilla.redhat.com/show_bug.cgi?id=1338673 > >> suggesting that compiling with a late-model gcc against older RHEL6 > >> headers could result in bad code. I wonder whether the reporters' > >> servers were built using such a configuration. (Although the linkage, > >> if any, to this report still wouldn't be very clear.) > > > I can tell it was compiled using > > version | PostgreSQL 11.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23), 64-bit > > Ah, that appears to be the default compiler for RHEL6, so that theory > is out the window. It's still interesting that we're only seeing this > reported from RHEL6 ... maybe there's something specific to the code > that this gcc version generates? FWIW, I've got the same compiler (which is natural, we are likely using the same packaging): PostgreSQL 9.6.12 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23), 64-bit merlin
Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free orcorruption (!prev)
From
Merlin Moncure
Date:
On Tue, Aug 27, 2019 at 5:52 PM Merlin Moncure <mmoncure@gmail.com> wrote: > > On Mon, Aug 26, 2019 at 12:01 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Justin Pryzby <pryzby@telsasoft.com> writes: > > > On Mon, Aug 26, 2019 at 12:45:24PM -0400, Tom Lane wrote: > > >> However ... there is some pretty interesting info at > > >> https://bugzilla.redhat.com/show_bug.cgi?id=1338673 > > >> suggesting that compiling with a late-model gcc against older RHEL6 > > >> headers could result in bad code. I wonder whether the reporters' > > >> servers were built using such a configuration. (Although the linkage, > > >> if any, to this report still wouldn't be very clear.) > > > > > I can tell it was compiled using > > > version | PostgreSQL 11.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23), 64-bit > > > > Ah, that appears to be the default compiler for RHEL6, so that theory > > is out the window. It's still interesting that we're only seeing this > > reported from RHEL6 ... maybe there's something specific to the code > > that this gcc version generates? > > FWIW, I've got the same compiler (which is natural, we are likely > using the same packaging): > PostgreSQL 9.6.12 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 > (Red Hat 4.4.7-23), 64-bit I'm looking for more common threads here. One interesting fact about this server is that we just realized (!) that cron was attempting to email huge log files and failing due to misconfigured email. The mail queue was building into the gigabytes so that on half hour cadence the server was running out of memory until the mail process crashed. Postgres ran just fine through this process (which is pretty cool). So, question: were there any memory issues on the other server on or around the crash? merlin