Thread: BUG #14344: string_agg(DISTINCT ..) crash
VGhlIGZvbGxvd2luZyBidWcgaGFzIGJlZW4gbG9nZ2VkIG9uIHRoZSB3ZWJz aXRlOgoKQnVnIHJlZmVyZW5jZTogICAgICAxNDM0NApMb2dnZWQgYnk6ICAg ICAgICAgIFJlZ2luYSBPYmUKRW1haWwgYWRkcmVzczogICAgICBsckBwY29y cC51cwpQb3N0Z3JlU1FMIHZlcnNpb246IDkuNnJjMQpPcGVyYXRpbmcgc3lz dGVtOiAgIFdpbmRvd3MgMjAxMiBhbmQgV2luZG93cyA3IDY0LWJpdApEZXNj cmlwdGlvbjogICAgICAgIAoKTGV0IG1lIGtub3cgaWYgeW91IG5lZWQgbW9y ZSBpbmZvIG9yIGlmIHRoaXMgaGFzIGFscmVhZHkgYmVlbiBmaXhlZC4NCg0K SSB3YXMgdGVzdGluZyB1bmRlciANCg0KUG9zdGdyZVNRTCA5LjZyYzEsIGNv bXBpbGVkIGJ5IFZpc3VhbCBDKysgYnVpbGQgMTgwMCwgNjQtYml0DQoNCkFu ZCB3YXMgYWJsZSB0byB0cmlnZ2VyIGEgY3Jhc2ggd2l0aCB0aGlzOg0KDQot LSBjcmVhdGUgdGVzdCB0YWJsZQ0KQ1JFQVRFIFRBQkxFIHN0cmluZ19hZ2df Y3Jhc2hfdGVzdCBBUyANClNFTEVDVCBpIEFTIGlkLCBzdGF0ZXNbbW9kKGks NikgKyAxXSBBUyBzdGF0ZSwgJ2p1bmsnIHx8IGNocihtb2QoaSwgMjYpICsg MSkKQVMganVuaw0KRlJPTSBnZW5lcmF0ZV9zZXJpZXMoMSwxMDAwMDApIEFT IGksIChTRUxFQ1QKJ3tOVUxMLE1BLEhJLElMLEtTLE5ZLE5KfSc6OnZhcmNo YXJbXSBBUyBzdGF0ZXMpIEFTIHMgOw0KDQotLSB0cnkgdG8gc3VtbWFyaXpl IGNyYXNoZXMgd2l0aCBwYXJhbGxlbCBvciBub3QgDQotLSBzbyBkb2Vzbid0 IHNlZW0gdG8gYmUgcGFyYWxsZWwgd29ya2VyIHJlbGF0ZWQuDQoNCnNldCBt YXhfcGFyYWxsZWxfd29ya2Vyc19wZXJfZ2F0aGVyID0gMDsNClNFTEVDVCBz dHJpbmdfYWdnKERJU1RJTkNUIHN0YXRlLCAnLCcgT1JERVIgQlkgc3RhdGUp DQpGUk9NIHN0cmluZ19hZ2dfY3Jhc2hfdGVzdDsNCg0KLS0gaXQgbXVzdCBo YXZlIHNvbWV0aGluZyB0byBkbyB3aXRoIHRoZSBESVNUSU5DVCwgYmVjYXVz ZQ0KLS0gdGhpcyBkb2Vzbid0IGNyYXNoDQpzZXQgbWF4X3BhcmFsbGVsX3dv cmtlcnNfcGVyX2dhdGhlciA9IDA7DQpTRUxFQ1Qgc3RyaW5nX2FnZyhzdGF0 ZSwgJywnIE9SREVSIEJZIHN0YXRlKQ0KRlJPTSBzdHJpbmdfYWdnX2NyYXNo X3Rlc3Q7DQoNCg0KVGhlIGxvZ3MgZG9uJ3QgZ2l2ZSBtZSBhbnkgdXNlZnVs IGluZm8ganVzdA0KDQpXQVJOSU5HOiAgdGVybWluYXRpbmcgY29ubmVjdGlv biBiZWNhdXNlIG9mIGNyYXNoIG9mIGFub3RoZXIgc2VydmVyCnByb2Nlc3MN CgoK
lr@pcorp.us writes: > Let me know if you need more info or if this has already been fixed. Hm, works OK for me: regression=# SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM string_agg_crash_test; string_agg ------------------- HI,IL,KS,MA,MA,NY (1 row) But I see no post-rc1 fixes that look like they might explain this. regards, tom lane
On Wed, Sep 28, 2016 at 9:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > lr@pcorp.us writes: >> Let me know if you need more info or if this has already been fixed. > regression=# SELECT string_agg(DISTINCT state, ',' ORDER BY state) > FROM string_agg_crash_test; > string_agg > ------------------- > HI,IL,KS,MA,MA,NY > (1 row) > > But I see no post-rc1 fixes that look like they might explain this. No problems on OSX or Linux, but I am getting a crash on Windows (Win7, MSVC 2010) even with latest REL9_6_STABLE. Oops. -- Michael
> lr@pcorp.us writes: > Let me know if you need more info or if this has already been fixed. > Hm, works OK for me: > regression=# SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM string_agg_crash_test; > string_agg > ------------------- > HI,IL,KS,MA,MA,NY > (1 row) > But I see no post-rc1 fixes that look like they might explain this. > regards, tom lane I was able to get it to crash under my mingw64 9.6beta4ish build too. Gdb looks like this: Reading symbols from C:\ming64gcc48\projects\postgresql\rel\pg9.6w64gcc48\bin\postgres.exe...done . 0x000000007732ae11 in ntdll!DbgBreakPoint () from C:\Windows\SYSTEM32\ntdll.dll (gdb) cont Continuing. [Thread 6944.0x61c exited with code 0] Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 6944.0x2384] toast_raw_datum_size (value=242751291) at tuptoaster.c:358 358 tuptoaster.c: No such file or directory. (gdb) bt #0 toast_raw_datum_size (value=242751291) at tuptoaster.c:358 #1 0x000000000078e616 in texteq (fcinfo=0x299e760) at varlena.c:1615 #2 0x00000000007bb045 in FunctionCall2Coll (flinfo=flinfo@entry=0x50ea400, collation=collation@entry=0, arg1=arg1@entry=85010827, arg2=<optimized out>) at fmgr.c:1319 #3 0x0000000000585548 in execTuplesMatch (slot1=slot1@entry=0x50e9e28, slot2=slot2@entry=0x50ea248, numCols=numCols@entry=2, matchColIdx=0x50ea350, eqfunctions=0x50ea3d0, evalContext=evalContext@entry=0x38d550) at execGrouping.c:100 #4 0x00000000005984e5 in process_ordered_aggregate_multi ( pergroupstate=0x50e9980, pertrans=0x50e8968, aggstate=0x50aaaf0) at nodeAgg.c:1244 #5 finalize_aggregates (aggstate=aggstate@entry=0x50aaaf0, peraggs=peraggs@entry=0x50abfb0, pergroup=pergroup@entry=0x50e9980, currentSet=currentSet@entry=0) at nodeAgg.c:1552 #6 0x0000000000598c7a in agg_retrieve_direct (aggstate=0x50aaaf0) at nodeAgg.c:2157 #7 ExecAgg (node=node@entry=0x50aaaf0) at nodeAgg.c:1837 #8 0x000000000058b228 in ExecProcNode (node=node@entry=0x50aaaf0) at execProcnode.c:503 #9 0x00000000005871c5 in ExecutePlan (dest=0x50e5838, direction=<optimized out>, numberTuples=0, sendTuples=1 '\001', operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x50aaaf0, estate=0x50aa9d8) at execMain.c:1567 #10 standard_ExecutorRun (queryDesc=0x5057288, direction=<optimized out>, count=0) at execMain.c:338 #11 0x00000000006b1fe6 in PortalRunSelect (portal=portal@entry=0x509e978, forward=forward@entry=1 '\001', count=0, count@entry=43643472, dest=dest@entry=0x0) at pquery.c:948 #12 0x00000000006b3686 in PortalRun (portal=0x299ee90, portal@entry=0x509e978, count=43643472, count@entry=2147483647, isTopLevel=isTopLevel@entry=0 '\000', dest=0x0, dest@entry=0x50e5838, altdest=altdest@entry=0x50e5838, completionTag=0x299f210 "", completionTag@entry=0x4000000007b <error: Cannot access memory at address 0x4000000007b>) at pquery.c:789 #13 0x00000000006b0e54 in exec_simple_query ( query_string=0x100000000000000 <error: Cannot access memory at address 0x100000000000000>) at postgres.c:1094 #14 PostgresMain (argc=<optimized out>, argv=argv@entry=0x2893e0, dbname=0x18001700160015 <error: Cannot access memory at address 0x18001700160015>, username=<optimized out>) at postgres.c:4074 #15 0x0000000000647aad in BackendRun (port=0x299f400) at postmaster.c:4262 #16 SubPostmasterMain (argc=argc@entry=3, argv=argv@entry=0x2f7ea0) at postmaster.c:4752 #17 0x0000000000803d68 in main (argc=3, argv=0x2f7ea0) at main.c:205 (gdb) I'll compile against the latest 9.6 branch then and see if it crashes there too. Might be windows specific perhaps. Thanks, Regina
On Wed, Sep 28, 2016 at 9:23 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Sep 28, 2016 at 9:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> lr@pcorp.us writes: >>> Let me know if you need more info or if this has already been fixed. >> regression=# SELECT string_agg(DISTINCT state, ',' ORDER BY state) >> FROM string_agg_crash_test; >> string_agg >> ------------------- >> HI,IL,KS,MA,MA,NY >> (1 row) >> >> But I see no post-rc1 fixes that look like they might explain this. > > No problems on OSX or Linux, but I am getting a crash on Windows > (Win7, MSVC 2010) even with latest REL9_6_STABLE. Oops. After rolling in crashdumps/ and remembering some stuff about windbg: - .sympath to point to symbols - !analyze -v to load the backtrace: 00000000`0042eff0 00000001`40380293 : 000007fe`fda07a01 000007fe`fda07a11 00000000`ffff0000 00000000`000004e4 : postgres!toast_raw_datum_size+0x4 [c:\users\ioltas\git\postgres\src\backend\access\heap\tuptoaster.c @ 358] 00000000`0042f030 00000001`403ab6dc : 00000000`009f0000 00000000`00000002 00000000`008491f8 00000000`008f3caa : postgres!texteq+0x33 [c:\users\ioltas\git\postgres\src\backend\utils\adt\varlena.c @ 1616] 00000000`0042f060 00000001`4012a29e : 00000000`00000002 00000000`008491f8 00000000`008f3caa 00000000`00000001 : postgres!FunctionCall2Coll+0x5c [c:\users\ioltas\git\postgres\src\backend\utils\fmgr\fmgr.c @ 1322] 00000000`0042f450 00000001`4013d05c : 00000000`00000002 00000000`01b33b30 00000000`008f3b00 00000000`008f3830 : postgres!execTuplesMatch+0xae [c:\users\ioltas\git\postgres\src\backend\executor\execgrouping.c @ 101] 00000000`0042f4a0 00000001`4013d61c : 00000000`00895f00 00000000`00000000 00000000`01b359d8 00000000`00849308 : postgres!process_ordered_aggregate_multi+0xfc [c:\users\ioltas\git\postgres\src\backend\executor\nodeagg.c @ 1248] 00000000`0042f520 00000001`4013f081 : 00000000`00000001 00000000`00000000 00000000`01b33f98 00000000`00000000 : postgres!finalize_aggregates+0x9c [c:\users\ioltas\git\postgres\src\backend\executor\nodeagg.c @ 1557] 00000000`0042f580 00000001`4013fe9a : 00000000`008ee800 00000000`00000003 00000000`01b33ee8 00000000`01b33f98 : postgres!agg_retrieve_direct+0x461 [c:\users\ioltas\git\postgres\src\backend\executor\nodeagg.c @ 2163] 00000000`0042f620 00000001`40130352 : 00000000`00000054 00000001`4029eaea 00000000`00000000 00000000`00000001 : postgres!ExecAgg+0x4a [c:\users\ioltas\git\postgres\src\backend\executor\nodeagg.c @ 1838] 00000000`0042f650 00000001`4012c98d : 00000000`01b33a20 00000001`4000a42c 00000000`ffffffff 00000000`00000001 : postgres!ExecProcNode+0x202 [c:\users\ioltas\git\postgres\src\backend\executor\execprocnode.c @ 503] 00000000`0042f680 00000001`4012dc58 : 00000000`008d9b60 00000000`00000001 00000000`01b33a20 00000000`00000000 : postgres!ExecutePlan+0x7d [c:\users\ioltas\git\postgres\src\backend\executor\execmain.c @ 1573] 00000000`0042f6d0 00000001`4029f659 : 00000000`008df510 00000000`00000000 00000000`008df510 00000001`403cb144 : postgres!standard_ExecutorRun+0xb8 [c:\users\ioltas\git\postgres\src\backend\executor\execmain.c @ 350] 00000000`0042f740 00000001`4029fe6a : 00000000`008df510 00000000`008df510 00000000`00843d50 00000000`0042f8b0 : postgres!PortalRunSelect+0x99 [c:\users\ioltas\git\postgres\src\backend\tcop\pquery.c @ 949] 00000000`0042f7a0 00000001`4029d723 : 00000000`00000000 00000000`00843d50 00000000`00000000 00000000`00000000 : postgres!PortalRun+0x1ca [c:\users\ioltas\git\postgres\src\backend\tcop\pquery.c @ 796] 00000000`0042f980 00000001`4029e1fc : 00000000`00000051 00000000`00894e60 00000000`0042fb70 00000000`009f8220 : postgres!exec_simple_query+0x573 [c:\users\ioltas\git\postgres\src\backend\tcop\postgres.c @ 1101] 00000000`0042fa70 00000001`4022eaea : 00000001`405f4300 00000001`404a37b8 00000000`000010a6 00000001`404a3938 : postgres!PostgresMain+0x63c [c:\users\ioltas\git\postgres\src\backend\tcop\postgres.c @ 4072] 00000000`0042fc60 00000001`40232645 : 00000001`00000001 00000000`000861df 00000000`1f7f5412 00000000`00000000 : postgres!BackendRun+0x1ca [c:\users\ioltas\git\postgres\src\backend\postmaster\postmaster.c @ 4271] 00000000`0042fcc0 00000001`4016ca71 : 00000000`00000003 00000000`009f2329 00000000`00000003 00000000`00000000 : postgres!SubPostmasterMain+0x255 [c:\users\ioltas\git\postgres\src\backend\postmaster\postmaster.c @ 4762] 00000000`0042ff00 00000001`403f1bfa : 00000000`00000000 00000000`00000001 00000000`00000000 00000000`00000000 : postgres!main+0x3c1 [c:\users\ioltas\git\postgres\src\backend\main\main.c @ 216] 00000000`0042ff30 00000000`76cc59cd : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : postgres!__tmainCRTStartup+0x11a [f:\dd\vctools\crt_bld\self_64_amd64\crt\src\crtexe.c @ 555] 00000000`0042ff60 00000000`76efa2e1 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0xd 00000000`0042ff90 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x21 STACK_COMMAND: ~0s; .ecxr ; kb FAULTING_SOURCE_CODE: 354: { 355: struct varlena *attr = (struct varlena *) DatumGetPointer(value); 356: Size result; 357: > 358: if (VARATT_IS_EXTERNAL_ONDISK(attr)) 359: { 360: /* va_rawsize is the size of the original datum -- including header */ 361: struct varatt_external toast_pointer; 362: 363: VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr); -- Michael
On Wed, Sep 28, 2016 at 9:28 PM, Regina Obe <lr@pcorp.us> wrote: > I'll compile against the latest 9.6 branch then and see if it crashes there > too. Might be windows specific perhaps. Confirmed. I got the same trace.That's an aggregate bug in the executor. -- Michael
On 29 September 2016 at 17:36, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Sep 28, 2016 at 9:28 PM, Regina Obe <lr@pcorp.us> wrote: >> I'll compile against the latest 9.6 branch then and see if it crashes there >> too. Might be windows specific perhaps. > > Confirmed. I got the same trace.That's an aggregate bug in the executor. hmm, I would be surprised about that. The planner would never generate a parallel aggregate plan for that given that string_agg has no combine function, and there's also a DISTINCT in there, which is never made parallel. The code for the non parallel case is not really changed very much in 9.6. My guess would be a sorting bug... Let me check.... Looks like: commit 0011c0091e886b874e485a46ff2c94222ffbf550 Author: Robert Haas <rhaas@postgresql.org> Date: Thu Mar 17 16:05:05 2016 -0400 Improve memory management for external sorts. Introduce a new memory context which stores tuple data, and reset it at the end of each merge pass; this helps avoid memory fragmentation and, consequently, overallocation. Also, for the final merge patch, eliminate memory context chunk header overhead entirely by allocating all of the memory used for buffering tuples during the merge in a single chunk. Since this modestly increases the number of tuples we can store, grow the memtuples array a bit so that we're less likely to run short of slots there. Peter Geoghegan. Review and testing of patches in this series by Jeff Janes, Greg Stark, Mithun Cy, and me. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 29 September 2016 at 18:41, David Rowley <david.rowley@2ndquadrant.com> wrote: > My guess would be a sorting bug... > > Let me check.... > > Looks like: > > commit 0011c0091e886b874e485a46ff2c94222ffbf550 Just for the record I can only get this to crash on Windows too, but I don't have much confidence that it's a windows only issue. I compiled with TRACE_SORT on Windows and Linux to find out what the sort does differently on each platform. On windows: LOG: begin tuple sort: nkeys = 2, workMem = 4096, randomAccess = f STATEMENT: SELECT string_agg(DISTINCT state, ',' ORDER BY state) from string_Agg_crash_test; LOG: switching to external sort with 15 tapes: CPU 0.00s/0.03u sec elapsed 0.02 sec STATEMENT: SELECT string_agg(DISTINCT state, ',' ORDER BY state) from string_Agg_crash_test; LOG: performsort starting: CPU 0.00s/0.37u sec elapsed 0.37 sec STATEMENT: SELECT string_agg(DISTINCT state, ',' ORDER BY state) from string_Agg_crash_test; LOG: finished writing run 1 to tape 0: CPU 0.00s/0.43u sec elapsed 0.43 sec STATEMENT: SELECT string_agg(DISTINCT state, ',' ORDER BY state) from string_Agg_crash_test; LOG: finished writing final run 2 to tape 1: CPU 0.00s/0.48u sec elapsed 0.48 sec STATEMENT: SELECT string_agg(DISTINCT state, ',' ORDER BY state) from string_Agg_crash_test; LOG: grew memtuples 1.29x from 58253 (1366 KB) to 74896 (1756 KB) for final merge STATEMENT: SELECT string_agg(DISTINCT state, ',' ORDER BY state) from string_Agg_crash_test; LOG: tape 0 initially used 878 KB of 1147 KB batch (0.765) and 37446 out of 37447 slots (1.000) STATEMENT: SELECT string_agg(DISTINCT state, ',' ORDER BY state) from string_Agg_crash_test; LOG: tape 1 initially used 309 KB of 1147 KB batch (0.269) and 13182 out of 37447 slots (0.352) STATEMENT: SELECT string_agg(DISTINCT state, ',' ORDER BY state) from string_Agg_crash_test; LOG: performsort done (except 2-way final merge): CPU 0.01s/0.48u sec elapsed 0.49 sec STATEMENT: SELECT string_agg(DISTINCT state, ',' ORDER BY state) from string_Agg_crash_test; LOG: server process (PID 10848) was terminated by exception 0xC0000005 On Linux: LOG: begin tuple sort: nkeys = 2, workMem = 4096, randomAccess = f STATEMENT: SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM string_agg_crash_test; LOG: switching to external sort with 15 tapes: CPU 0.00s/0.03u sec elapsed 0.03 sec STATEMENT: SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM string_agg_crash_test; LOG: replacement selection will sort 58253 first run tuples STATEMENT: SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM string_agg_crash_test; LOG: performsort starting: CPU 0.00s/0.16u sec elapsed 0.16 sec STATEMENT: SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM string_agg_crash_test; LOG: finished incrementally writing first run 1 to tape 0: CPU 0.00s/0.20u sec elapsed 0.20 sec STATEMENT: SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM string_agg_crash_test; LOG: starting quicksort of run 2: CPU 0.00s/0.20u sec elapsed 0.20 sec STATEMENT: SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM string_agg_crash_test; LOG: finished quicksort of run 2: CPU 0.00s/0.20u sec elapsed 0.20 sec STATEMENT: SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM string_agg_crash_test; LOG: finished writing run 2 to tape 1: CPU 0.00s/0.21u sec elapsed 0.21 sec STATEMENT: SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM string_agg_crash_test; LOG: grew memtuples 1.29x from 58253 (1366 KB) to 74896 (1756 KB) for final merge STATEMENT: SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM string_agg_crash_test; LOG: tape 0 initially used 878 KB of 1147 KB batch (0.765) and 37446 out of 37447 slots (1.000) STATEMENT: SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM string_agg_crash_test; LOG: tape 1 initially used 309 KB of 1147 KB batch (0.269) and 13182 out of 37447 slots (0.352) STATEMENT: SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM string_agg_crash_test; LOG: performsort done (except 2-way final merge): CPU 0.00s/0.21u sec elapsed 0.21 sec STATEMENT: SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM string_agg_crash_test; LOG: external sort ended, 177 disk blocks used: CPU 0.00s/0.24u sec elapsed 0.24 sec STATEMENT: SELECT string_agg(DISTINCT state, ',' ORDER BY state) FROM string_agg_crash_test; Both versions are 6ad8ac6 -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Sep 29, 2016 at 6:41 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > Let me check.... > > Looks like: > > commit 0011c0091e886b874e485a46ff2c94222ffbf550 Why do you say that? My first suspicion is f1f5ec1efafe74ca45e24e0bf3371b1d6985c8ee, fwiw. -- Peter Geoghegan
On 30 September 2016 at 00:29, Peter Geoghegan <pg@heroku.com> wrote: > On Thu, Sep 29, 2016 at 6:41 AM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> Let me check.... >> >> Looks like: >> >> commit 0011c0091e886b874e485a46ff2c94222ffbf550 > > Why do you say that? My first suspicion is > f1f5ec1efafe74ca45e24e0bf3371b1d6985c8ee, fwiw. git bisect -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Sep 29, 2016 at 12:30 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > git bisect That commit might just change things in a way that makes the test case accidentally not crash. The characteristics of the merge will be significantly different. Could you tell me if reverting f1f5ec1efafe74ca45e24e0bf3371b1d6985c8ee on Windows helps? -- Peter Geoghegan
On Thu, Sep 29, 2016 at 12:34 PM, Peter Geoghegan <pg@heroku.com> wrote: > That commit might just change things in a way that makes the test case > accidentally not crash. Of course, I meant that reverting that commit might merely mask the problem, rather than removing it. -- Peter Geoghegan
On 30 September 2016 at 00:34, Peter Geoghegan <pg@heroku.com> wrote: > On Thu, Sep 29, 2016 at 12:30 PM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> git bisect > > That commit might just change things in a way that makes the test case > accidentally not crash. The characteristics of the merge will be > significantly different. Could you tell me if reverting > f1f5ec1efafe74ca45e24e0bf3371b1d6985c8ee on Windows helps? I didn't get a clean revert due to some conflicts. I manually ended up with the attached, which, when applied still crashes on windows. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Thu, Sep 29, 2016 at 1:04 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > I didn't get a clean revert due to some conflicts. I manually ended up > with the attached, which, when applied still crashes on windows. Thanks. I have been able to reproduce the problem, in a fashion, so can now probably debug it easily enough. Specifically, I can get Valgrind to complain about Regina's test case. Valgrind memcheck with track_origins enabled, running Postgres git tip, prints the following: 2016-09-29 05:06:56.382 PDT 15292 LOG: tape 1 initially used 409 KB of 1147 KB batch (0.357) and 17448 out of 37447 slots (0.466) 2016-09-29 05:06:56.383 PDT 15292 LOG: performsort done (except 2-way final merge): CPU 0.02s/13.32u sec elapsed 13.35 sec ==15292== Invalid read of size 1 ==15292== at 0x4A2F13: toast_raw_datum_size (tuptoaster.c:358) ==15292== by 0x7A41EF: texteq (varlena.c:1615) ==15292== by 0x7CA538: FunctionCall2Coll (fmgr.c:1319) ==15292== by 0x5BBC48: execTuplesMatch (execGrouping.c:100) ==15292== by 0x5CD7D5: process_ordered_aggregate_multi (nodeAgg.c:1244) ==15292== by 0x5CD7D5: finalize_aggregates (nodeAgg.c:1552) ==15292== by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157) ==15292== by 0x5CE328: ExecAgg (nodeAgg.c:1837) ==15292== by 0x5C135B: ExecProcNode (execProcnode.c:503) ==15292== by 0x5BD86C: ExecutePlan (execMain.c:1567) ==15292== by 0x5BD86C: standard_ExecutorRun (execMain.c:338) ==15292== by 0x5BD969: ExecutorRun (execMain.c:286) ==15292== by 0x6CE99E: PortalRunSelect (pquery.c:948) ==15292== by 0x6CFDC7: PortalRun (pquery.c:789) ==15292== by 0x6CCC43: exec_simple_query (postgres.c:1094) ==15292== by 0x6CCC43: PostgresMain (postgres.c:4070) ==15292== Address 0xc3e573b is 399,099 bytes inside a block of size 1,173,864 free'd ==15292== at 0x4C2EDEB: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==15292== by 0x7E51C6: AllocSetFree (aset.c:989) ==15292== by 0x7E6C5D: pfree (mcxt.c:1012) ==15292== by 0x7ED2CC: mergebatchfreetape (tuplesort.c:3100) ==15292== by 0x7ED2CC: tuplesort_gettuple_common (tuplesort.c:2017) ==15292== by 0x7F11A0: tuplesort_gettupleslot (tuplesort.c:2066) ==15292== by 0x5CD876: process_ordered_aggregate_multi (nodeAgg.c:1231) ==15292== by 0x5CD876: finalize_aggregates (nodeAgg.c:1552) ==15292== by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157) ==15292== by 0x5CE328: ExecAgg (nodeAgg.c:1837) ==15292== by 0x5C135B: ExecProcNode (execProcnode.c:503) ==15292== by 0x5BD86C: ExecutePlan (execMain.c:1567) ==15292== by 0x5BD86C: standard_ExecutorRun (execMain.c:338) ==15292== by 0x5BD969: ExecutorRun (execMain.c:286) ==15292== by 0x6CE99E: PortalRunSelect (pquery.c:948) ==15292== by 0x6CFDC7: PortalRun (pquery.c:789) ==15292== Block was alloc'd at ==15292== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==15292== by 0x7E4897: AllocSetAlloc (aset.c:691) ==15292== by 0x7E6F6D: MemoryContextAllocHuge (mcxt.c:1083) ==15292== by 0x7ECBF0: mergebatch (tuplesort.c:2983) ==15292== by 0x7ECBF0: beginmerge (tuplesort.c:2793) ==15292== by 0x7F10F0: mergeruns (tuplesort.c:2547) ==15292== by 0x7F10F0: tuplesort_performsort (tuplesort.c:1797) ==15292== by 0x5CD757: process_ordered_aggregate_multi (nodeAgg.c:1225) ==15292== by 0x5CD757: finalize_aggregates (nodeAgg.c:1552) ==15292== by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157) ==15292== by 0x5CE328: ExecAgg (nodeAgg.c:1837) ==15292== by 0x5C135B: ExecProcNode (execProcnode.c:503) ==15292== by 0x5BD86C: ExecutePlan (execMain.c:1567) ==15292== by 0x5BD86C: standard_ExecutorRun (execMain.c:338) ==15292== by 0x5BD969: ExecutorRun (execMain.c:286) ==15292== by 0x6CE99E: PortalRunSelect (pquery.c:948) ==15292== by 0x6CFDC7: PortalRun (pquery.c:789) ==15292== { <insert_a_suppression_name_here> Memcheck:Addr1 fun:toast_raw_datum_size fun:texteq fun:FunctionCall2Coll fun:execTuplesMatch fun:process_ordered_aggregate_multi fun:finalize_aggregates fun:agg_retrieve_direct fun:ExecAgg fun:ExecProcNode fun:ExecutePlan fun:standard_ExecutorRun fun:ExecutorRun fun:PortalRunSelect fun:PortalRun fun:exec_simple_query fun:PostgresMain } ==15292== Invalid read of size 1 ==15292== at 0x7CC7AF: pg_detoast_datum_packed (fmgr.c:2214) ==15292== by 0x7A420C: texteq (varlena.c:1621) ==15292== by 0x7CA538: FunctionCall2Coll (fmgr.c:1319) ==15292== by 0x5BBC48: execTuplesMatch (execGrouping.c:100) ==15292== by 0x5CD7D5: process_ordered_aggregate_multi (nodeAgg.c:1244) ==15292== by 0x5CD7D5: finalize_aggregates (nodeAgg.c:1552) ==15292== by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157) ==15292== by 0x5CE328: ExecAgg (nodeAgg.c:1837) ==15292== by 0x5C135B: ExecProcNode (execProcnode.c:503) ==15292== by 0x5BD86C: ExecutePlan (execMain.c:1567) ==15292== by 0x5BD86C: standard_ExecutorRun (execMain.c:338) ==15292== by 0x5BD969: ExecutorRun (execMain.c:286) ==15292== by 0x6CE99E: PortalRunSelect (pquery.c:948) ==15292== by 0x6CFDC7: PortalRun (pquery.c:789) ==15292== by 0x6CCC43: exec_simple_query (postgres.c:1094) ==15292== by 0x6CCC43: PostgresMain (postgres.c:4070) ==15292== Address 0xc3e573b is 399,099 bytes inside a block of size 1,173,864 free'd ==15292== at 0x4C2EDEB: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==15292== by 0x7E51C6: AllocSetFree (aset.c:989) ==15292== by 0x7E6C5D: pfree (mcxt.c:1012) ==15292== by 0x7ED2CC: mergebatchfreetape (tuplesort.c:3100) ==15292== by 0x7ED2CC: tuplesort_gettuple_common (tuplesort.c:2017) ==15292== by 0x7F11A0: tuplesort_gettupleslot (tuplesort.c:2066) ==15292== by 0x5CD876: process_ordered_aggregate_multi (nodeAgg.c:1231) ==15292== by 0x5CD876: finalize_aggregates (nodeAgg.c:1552) ==15292== by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157) ==15292== by 0x5CE328: ExecAgg (nodeAgg.c:1837) ==15292== by 0x5C135B: ExecProcNode (execProcnode.c:503) ==15292== by 0x5BD86C: ExecutePlan (execMain.c:1567) ==15292== by 0x5BD86C: standard_ExecutorRun (execMain.c:338) ==15292== by 0x5BD969: ExecutorRun (execMain.c:286) ==15292== by 0x6CE99E: PortalRunSelect (pquery.c:948) ==15292== by 0x6CFDC7: PortalRun (pquery.c:789) ==15292== Block was alloc'd at ==15292== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==15292== by 0x7E4897: AllocSetAlloc (aset.c:691) ==15292== by 0x7E6F6D: MemoryContextAllocHuge (mcxt.c:1083) ==15292== by 0x7ECBF0: mergebatch (tuplesort.c:2983) ==15292== by 0x7ECBF0: beginmerge (tuplesort.c:2793) ==15292== by 0x7F10F0: mergeruns (tuplesort.c:2547) ==15292== by 0x7F10F0: tuplesort_performsort (tuplesort.c:1797) ==15292== by 0x5CD757: process_ordered_aggregate_multi (nodeAgg.c:1225) ==15292== by 0x5CD757: finalize_aggregates (nodeAgg.c:1552) ==15292== by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157) ==15292== by 0x5CE328: ExecAgg (nodeAgg.c:1837) ==15292== by 0x5C135B: ExecProcNode (execProcnode.c:503) ==15292== by 0x5BD86C: ExecutePlan (execMain.c:1567) ==15292== by 0x5BD86C: standard_ExecutorRun (execMain.c:338) ==15292== by 0x5BD969: ExecutorRun (execMain.c:286) ==15292== by 0x6CE99E: PortalRunSelect (pquery.c:948) ==15292== by 0x6CFDC7: PortalRun (pquery.c:789) ==15292== { <insert_a_suppression_name_here> Memcheck:Addr1 fun:pg_detoast_datum_packed fun:texteq fun:FunctionCall2Coll fun:execTuplesMatch fun:process_ordered_aggregate_multi fun:finalize_aggregates fun:agg_retrieve_direct fun:ExecAgg fun:ExecProcNode fun:ExecutePlan fun:standard_ExecutorRun fun:ExecutorRun fun:PortalRunSelect fun:PortalRun fun:exec_simple_query fun:PostgresMain } ==15292== Invalid read of size 1 ==15292== at 0x7A4214: texteq (varlena.c:1623) ==15292== by 0x7CA538: FunctionCall2Coll (fmgr.c:1319) ==15292== by 0x5BBC48: execTuplesMatch (execGrouping.c:100) ==15292== by 0x5CD7D5: process_ordered_aggregate_multi (nodeAgg.c:1244) ==15292== by 0x5CD7D5: finalize_aggregates (nodeAgg.c:1552) ==15292== by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157) ==15292== by 0x5CE328: ExecAgg (nodeAgg.c:1837) ==15292== by 0x5C135B: ExecProcNode (execProcnode.c:503) ==15292== by 0x5BD86C: ExecutePlan (execMain.c:1567) ==15292== by 0x5BD86C: standard_ExecutorRun (execMain.c:338) ==15292== by 0x5BD969: ExecutorRun (execMain.c:286) ==15292== by 0x6CE99E: PortalRunSelect (pquery.c:948) ==15292== by 0x6CFDC7: PortalRun (pquery.c:789) ==15292== by 0x6CCC43: exec_simple_query (postgres.c:1094) ==15292== by 0x6CCC43: PostgresMain (postgres.c:4070) ==15292== by 0x670D36: BackendRun (postmaster.c:4260) ==15292== by 0x670D36: BackendStartup (postmaster.c:3934) ==15292== by 0x670D36: ServerLoop (postmaster.c:1691) ==15292== by 0x670D36: PostmasterMain (postmaster.c:1299) ==15292== Address 0xc3e573b is 399,099 bytes inside a block of size 1,173,864 free'd ==15292== at 0x4C2EDEB: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==15292== by 0x7E51C6: AllocSetFree (aset.c:989) ==15292== by 0x7E6C5D: pfree (mcxt.c:1012) ==15292== by 0x7ED2CC: mergebatchfreetape (tuplesort.c:3100) ==15292== by 0x7ED2CC: tuplesort_gettuple_common (tuplesort.c:2017) ==15292== by 0x7F11A0: tuplesort_gettupleslot (tuplesort.c:2066) ==15292== by 0x5CD876: process_ordered_aggregate_multi (nodeAgg.c:1231) ==15292== by 0x5CD876: finalize_aggregates (nodeAgg.c:1552) ==15292== by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157) ==15292== by 0x5CE328: ExecAgg (nodeAgg.c:1837) ==15292== by 0x5C135B: ExecProcNode (execProcnode.c:503) ==15292== by 0x5BD86C: ExecutePlan (execMain.c:1567) ==15292== by 0x5BD86C: standard_ExecutorRun (execMain.c:338) ==15292== by 0x5BD969: ExecutorRun (execMain.c:286) ==15292== by 0x6CE99E: PortalRunSelect (pquery.c:948) ==15292== by 0x6CFDC7: PortalRun (pquery.c:789) ==15292== Block was alloc'd at ==15292== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==15292== by 0x7E4897: AllocSetAlloc (aset.c:691) ==15292== by 0x7E6F6D: MemoryContextAllocHuge (mcxt.c:1083) ==15292== by 0x7ECBF0: mergebatch (tuplesort.c:2983) ==15292== by 0x7ECBF0: beginmerge (tuplesort.c:2793) ==15292== by 0x7F10F0: mergeruns (tuplesort.c:2547) ==15292== by 0x7F10F0: tuplesort_performsort (tuplesort.c:1797) ==15292== by 0x5CD757: process_ordered_aggregate_multi (nodeAgg.c:1225) ==15292== by 0x5CD757: finalize_aggregates (nodeAgg.c:1552) ==15292== by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157) ==15292== by 0x5CE328: ExecAgg (nodeAgg.c:1837) ==15292== by 0x5C135B: ExecProcNode (execProcnode.c:503) ==15292== by 0x5BD86C: ExecutePlan (execMain.c:1567) ==15292== by 0x5BD86C: standard_ExecutorRun (execMain.c:338) ==15292== by 0x5BD969: ExecutorRun (execMain.c:286) ==15292== by 0x6CE99E: PortalRunSelect (pquery.c:948) ==15292== by 0x6CFDC7: PortalRun (pquery.c:789) ==15292== { <insert_a_suppression_name_here> Memcheck:Addr1 fun:texteq fun:FunctionCall2Coll fun:execTuplesMatch fun:process_ordered_aggregate_multi fun:finalize_aggregates fun:agg_retrieve_direct fun:ExecAgg fun:ExecProcNode fun:ExecutePlan fun:standard_ExecutorRun fun:ExecutorRun fun:PortalRunSelect fun:PortalRun fun:exec_simple_query fun:PostgresMain fun:BackendRun fun:BackendStartup fun:ServerLoop fun:PostmasterMain } ==15292== Invalid read of size 1 ==15292== at 0x4C33D28: __memcmp_sse4_1 (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==15292== by 0x7A4241: texteq (varlena.c:1623) ==15292== by 0x7CA538: FunctionCall2Coll (fmgr.c:1319) ==15292== by 0x5BBC48: execTuplesMatch (execGrouping.c:100) ==15292== by 0x5CD7D5: process_ordered_aggregate_multi (nodeAgg.c:1244) ==15292== by 0x5CD7D5: finalize_aggregates (nodeAgg.c:1552) ==15292== by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157) ==15292== by 0x5CE328: ExecAgg (nodeAgg.c:1837) ==15292== by 0x5C135B: ExecProcNode (execProcnode.c:503) ==15292== by 0x5BD86C: ExecutePlan (execMain.c:1567) ==15292== by 0x5BD86C: standard_ExecutorRun (execMain.c:338) ==15292== by 0x5BD969: ExecutorRun (execMain.c:286) ==15292== by 0x6CE99E: PortalRunSelect (pquery.c:948) ==15292== by 0x6CFDC7: PortalRun (pquery.c:789) ==15292== by 0x6CCC43: exec_simple_query (postgres.c:1094) ==15292== by 0x6CCC43: PostgresMain (postgres.c:4070) ==15292== Address 0xc3e573c is 399,100 bytes inside a block of size 1,173,864 free'd ==15292== at 0x4C2EDEB: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==15292== by 0x7E51C6: AllocSetFree (aset.c:989) ==15292== by 0x7E6C5D: pfree (mcxt.c:1012) ==15292== by 0x7ED2CC: mergebatchfreetape (tuplesort.c:3100) ==15292== by 0x7ED2CC: tuplesort_gettuple_common (tuplesort.c:2017) ==15292== by 0x7F11A0: tuplesort_gettupleslot (tuplesort.c:2066) ==15292== by 0x5CD876: process_ordered_aggregate_multi (nodeAgg.c:1231) ==15292== by 0x5CD876: finalize_aggregates (nodeAgg.c:1552) ==15292== by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157) ==15292== by 0x5CE328: ExecAgg (nodeAgg.c:1837) ==15292== by 0x5C135B: ExecProcNode (execProcnode.c:503) ==15292== by 0x5BD86C: ExecutePlan (execMain.c:1567) ==15292== by 0x5BD86C: standard_ExecutorRun (execMain.c:338) ==15292== by 0x5BD969: ExecutorRun (execMain.c:286) ==15292== by 0x6CE99E: PortalRunSelect (pquery.c:948) ==15292== by 0x6CFDC7: PortalRun (pquery.c:789) ==15292== Block was alloc'd at ==15292== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==15292== by 0x7E4897: AllocSetAlloc (aset.c:691) ==15292== by 0x7E6F6D: MemoryContextAllocHuge (mcxt.c:1083) ==15292== by 0x7ECBF0: mergebatch (tuplesort.c:2983) ==15292== by 0x7ECBF0: beginmerge (tuplesort.c:2793) ==15292== by 0x7F10F0: mergeruns (tuplesort.c:2547) ==15292== by 0x7F10F0: tuplesort_performsort (tuplesort.c:1797) ==15292== by 0x5CD757: process_ordered_aggregate_multi (nodeAgg.c:1225) ==15292== by 0x5CD757: finalize_aggregates (nodeAgg.c:1552) ==15292== by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157) ==15292== by 0x5CE328: ExecAgg (nodeAgg.c:1837) ==15292== by 0x5C135B: ExecProcNode (execProcnode.c:503) ==15292== by 0x5BD86C: ExecutePlan (execMain.c:1567) ==15292== by 0x5BD86C: standard_ExecutorRun (execMain.c:338) ==15292== by 0x5BD969: ExecutorRun (execMain.c:286) ==15292== by 0x6CE99E: PortalRunSelect (pquery.c:948) ==15292== by 0x6CFDC7: PortalRun (pquery.c:789) ==15292== { <insert_a_suppression_name_here> Memcheck:Addr1 fun:__memcmp_sse4_1 fun:texteq fun:FunctionCall2Coll fun:execTuplesMatch fun:process_ordered_aggregate_multi fun:finalize_aggregates fun:agg_retrieve_direct fun:ExecAgg fun:ExecProcNode fun:ExecutePlan fun:standard_ExecutorRun fun:ExecutorRun fun:PortalRunSelect fun:PortalRun fun:exec_simple_query fun:PostgresMain } ==15292== Invalid read of size 1 ==15292== at 0x4C33D45: __memcmp_sse4_1 (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==15292== by 0x7A4241: texteq (varlena.c:1623) ==15292== by 0x7CA538: FunctionCall2Coll (fmgr.c:1319) ==15292== by 0x5BBC48: execTuplesMatch (execGrouping.c:100) ==15292== by 0x5CD7D5: process_ordered_aggregate_multi (nodeAgg.c:1244) ==15292== by 0x5CD7D5: finalize_aggregates (nodeAgg.c:1552) ==15292== by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157) ==15292== by 0x5CE328: ExecAgg (nodeAgg.c:1837) ==15292== by 0x5C135B: ExecProcNode (execProcnode.c:503) ==15292== by 0x5BD86C: ExecutePlan (execMain.c:1567) ==15292== by 0x5BD86C: standard_ExecutorRun (execMain.c:338) ==15292== by 0x5BD969: ExecutorRun (execMain.c:286) ==15292== by 0x6CE99E: PortalRunSelect (pquery.c:948) ==15292== by 0x6CFDC7: PortalRun (pquery.c:789) ==15292== by 0x6CCC43: exec_simple_query (postgres.c:1094) ==15292== by 0x6CCC43: PostgresMain (postgres.c:4070) ==15292== Address 0xc3e573a is 399,098 bytes inside a block of size 1,173,864 free'd ==15292== at 0x4C2EDEB: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==15292== by 0x7E51C6: AllocSetFree (aset.c:989) ==15292== by 0x7E6C5D: pfree (mcxt.c:1012) ==15292== by 0x7ED2CC: mergebatchfreetape (tuplesort.c:3100) ==15292== by 0x7ED2CC: tuplesort_gettuple_common (tuplesort.c:2017) ==15292== by 0x7F11A0: tuplesort_gettupleslot (tuplesort.c:2066) ==15292== by 0x5CD876: process_ordered_aggregate_multi (nodeAgg.c:1231) ==15292== by 0x5CD876: finalize_aggregates (nodeAgg.c:1552) ==15292== by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157) ==15292== by 0x5CE328: ExecAgg (nodeAgg.c:1837) ==15292== by 0x5C135B: ExecProcNode (execProcnode.c:503) ==15292== by 0x5BD86C: ExecutePlan (execMain.c:1567) ==15292== by 0x5BD86C: standard_ExecutorRun (execMain.c:338) ==15292== by 0x5BD969: ExecutorRun (execMain.c:286) ==15292== by 0x6CE99E: PortalRunSelect (pquery.c:948) ==15292== by 0x6CFDC7: PortalRun (pquery.c:789) ==15292== Block was alloc'd at ==15292== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==15292== by 0x7E4897: AllocSetAlloc (aset.c:691) ==15292== by 0x7E6F6D: MemoryContextAllocHuge (mcxt.c:1083) ==15292== by 0x7ECBF0: mergebatch (tuplesort.c:2983) ==15292== by 0x7ECBF0: beginmerge (tuplesort.c:2793) ==15292== by 0x7F10F0: mergeruns (tuplesort.c:2547) ==15292== by 0x7F10F0: tuplesort_performsort (tuplesort.c:1797) ==15292== by 0x5CD757: process_ordered_aggregate_multi (nodeAgg.c:1225) ==15292== by 0x5CD757: finalize_aggregates (nodeAgg.c:1552) ==15292== by 0x5CE328: agg_retrieve_direct (nodeAgg.c:2157) ==15292== by 0x5CE328: ExecAgg (nodeAgg.c:1837) ==15292== by 0x5C135B: ExecProcNode (execProcnode.c:503) ==15292== by 0x5BD86C: ExecutePlan (execMain.c:1567) ==15292== by 0x5BD86C: standard_ExecutorRun (execMain.c:338) ==15292== by 0x5BD969: ExecutorRun (execMain.c:286) ==15292== by 0x6CE99E: PortalRunSelect (pquery.c:948) ==15292== by 0x6CFDC7: PortalRun (pquery.c:789) ==15292== { <insert_a_suppression_name_here> Memcheck:Addr1 fun:__memcmp_sse4_1 fun:texteq fun:FunctionCall2Coll fun:execTuplesMatch fun:process_ordered_aggregate_multi fun:finalize_aggregates fun:agg_retrieve_direct fun:ExecAgg fun:ExecProcNode fun:ExecutePlan fun:standard_ExecutorRun fun:ExecutorRun fun:PortalRunSelect fun:PortalRun fun:exec_simple_query fun:PostgresMain } --15292-- warning: evaluate_Dwarf3_Expr: unhandled DW_OP_ 0xf2 --15292-- warning: evaluate_Dwarf3_Expr: unhandled DW_OP_ 0xf3 2016-09-29 05:06:58.308 PDT 15292 LOG: external sort ended, 177 disk blocks used: CPU 0.02s/15.24u sec elapsed 15.27 sec -- Peter Geoghegan
On Thu, Sep 29, 2016 at 5:10 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Hm, works OK for me: > > regression=# SELECT string_agg(DISTINCT state, ',' ORDER BY state) > FROM string_agg_crash_test; > string_agg > ------------------- > HI,IL,KS,MA,MA,NY > (1 row) Actually, it doesn't work for you, since "MA" appears twice. :-( A fix is in the works. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > On Thu, Sep 29, 2016 at 5:10 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hm, works OK for me: >> >> regression=# SELECT string_agg(DISTINCT state, ',' ORDER BY state) >> FROM string_agg_crash_test; >> string_agg >> ------------------- >> HI,IL,KS,MA,MA,NY >> (1 row) > Actually, it doesn't work for you, since "MA" appears twice. :-( Hah ... it was too late at night for me to notice that ;-) > A fix is in the works. So you've identified the problem and the rest of us don't need to chase it anymore? regards, tom lane
On Thu, Sep 29, 2016 at 2:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > So you've identified the problem and the rest of us don't need to > chase it anymore? Yes. Just verifying my understanding in detail. Shouldn't be long. -- Peter Geoghegan
On Thu, Sep 29, 2016 at 1:10 PM, Peter Geoghegan <pg@heroku.com> wrote: > I have been able to reproduce the problem, in a fashion, so can now > probably debug it easily enough. Specifically, I can get Valgrind to > complain about Regina's test case. I see what the problem is. This use-after-free bug occurs because, for some reason, tuplesort_gettupleslot()'s contract was not updated by me alongside analogous routines like tuplesort_getindextuple() as part of the batch memory commit. This meant that at least one particular case (process_ordered_aggregate_multi() calls to tuplesort_gettupleslot(), with distinct columns) felt entitled to reuse some slot's tuple, a tuple located in memory managed by tuplesort.c as batch memory. The reuse was across calls to tuplesort_gettupleslot(). Attached patch fixes the bug, and updates the contract of tuplesort_gettupleslot() so it's no longer the odd one out. However, I'm not recommending this be committed without further discussion, because the performance overhead could be a concern. Also, it looks like hypothetical_dense_rank_final() requires similar treatment, which isn't addressed by this patch. I did audit other callers of tuplesort_gettupleslot() and analogous routines for other classes of caller-tuple (e.g., tuplesort_getindextuple() callers were examined too). Look like it's just these two tuplesort callers that are affected. How much could it hurt performance to fix the bug in this way? Does anyone see an alternative? Pushing knowledge of this special case into tuplesort.c seems like an unappealing alternative. -- Peter Geoghegan
Attachment
Peter Geoghegan <pg@heroku.com> writes: > I see what the problem is. This use-after-free bug occurs because, for > some reason, tuplesort_gettupleslot()'s contract was not updated by me > alongside analogous routines like tuplesort_getindextuple() as part of > the batch memory commit. This meant that at least one particular case > (process_ordered_aggregate_multi() calls to tuplesort_gettupleslot(), > with distinct columns) felt entitled to reuse some slot's tuple, a > tuple located in memory managed by tuplesort.c as batch memory. My immediate reaction to this is WTF. It seems like you have completely broken the expected contract of tuplesort_gettupleslot, which is that it copies the data into caller-owned memory. That cannot stand. Even if you are willing to hack all the core callers so that they no longer expect that, you've probably broken --- subtly --- callers in extensions. regards, tom lane
On Thu, Sep 29, 2016 at 4:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > My immediate reaction to this is WTF. It seems like you have completely > broken the expected contract of tuplesort_gettupleslot, which is that > it copies the data into caller-owned memory. That cannot stand. Other comparable routines, like tuplesort_getindextuple(), could be argued to have always had a contract that allows for this kind of thing (recycling batch memory) because they get to examine a *should_free pointer -- do you accept that much? tuplesort_gettupleslot() does that for callers that happen to want to use a tuple slot, rather than a direct caller tuple. So, it doesn't seem like there is a very hard distinction there; that could also have broken something in an extension too, since the lifetime of memory controlled by tuplesort (should_free = false cases, managed by tuplesort memory context) was always a bit unclear. -- Peter Geoghegan
On 09/29/2016 07:04 PM, Peter Geoghegan wrote: > On Thu, Sep 29, 2016 at 4:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> My immediate reaction to this is WTF. It seems like you have completely >> broken the expected contract of tuplesort_gettupleslot, which is that >> it copies the data into caller-owned memory. That cannot stand. > > Other comparable routines, like tuplesort_getindextuple(), could be > argued to have always had a contract that allows for this kind of > thing (recycling batch memory) because they get to examine a > *should_free pointer -- do you accept that much? The comments in those other routines explicitly say that "If it is not set [*should_free], caller should not use tuple following next call here". I also didn't notice that tuplesort_gettupleslot doesn't contain say that, until now. That's unfortunate. AFAICS, we have no choice but palloc(), when tuplesort_gettupleslot() is used. For version 10, maybe we should reconsider that API. - Heikki
On Thu, Sep 29, 2016 at 10:08 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > The comments in those other routines explicitly say that "If it is not set > [*should_free], caller should not use tuple following next call here". I > also didn't notice that tuplesort_gettupleslot doesn't contain say that, > until now. They say that because the patch with the bug in question made them say that. The fact that tuplesort_gettupleslot() did not have such a comment added at the same time (and, more importantly, verification that that would be okay for all existing callers) was arguably the oversight in that patch/commit. That said, it probably wouldn't have been a good idea to make it work. There is an implicit requirement for tuplesort_gettupleslot() alone around the ownership of memory, per Tom -- caller wants ownership of the memory returned. This is confused by the fact that the memory is and was always allocated in sortcontext memory context. So, caller wants ownership of the memory, but gets it in the sortcontext. It looks like the callers sometimes ultimately call ExecCopySlotTuple() again, just to be safe (e.g., agg_retrieve_direct()), possibly because the contained-in-slot tuple may need to outlive the tuplesort itself (which represent the entire sort operation's lifetime). > That's unfortunate. AFAICS, we have no choice but palloc(), when > tuplesort_gettupleslot() is used. For version 10, maybe we should reconsider > that API. We have to do something to fix this bug, and performing an extra retail palloc() looks like the easiest solution. This does not actually defeat the purpose of batch memory, which was mostly about cache efficiency. I tend to think that the possible uses of the slot by tuplesort_gettupleslot() callers are too complex to tie down. -- Peter Geoghegan
On Wed, Oct 12, 2016 at 2:20 PM, Peter Geoghegan <pg@heroku.com> wrote: >> That's unfortunate. AFAICS, we have no choice but palloc(), when >> tuplesort_gettupleslot() is used. For version 10, maybe we should reconsider >> that API. > > We have to do something to fix this bug, and performing an extra > retail palloc() looks like the easiest solution. This does not > actually defeat the purpose of batch memory, which was mostly about > cache efficiency. I tend to think that the possible uses of the slot > by tuplesort_gettupleslot() callers are too complex to tie down. Attached is a fix inspired by the similar routine in tuplestore.c: tuplestore_gettupleslot(). This bug seems totally analogous to the one fixed by 25bf7f8b -- history repeats itself. This fix has us copy the MinimalTuple into sortcontext palloc() memory within tuplesort_gettupleslot() (based on commit 25bf7f8b). This still differs a little from tuplestore_gettupleslot(), which explicitly uses current context of caller, but we've always done things that way for tuplesort.c. It wouldn't be very hard to also add a tuplestore_gettupleslot()-style "copy" boolean argument to tuplesort_gettupleslot(), but I haven't done that here. That way, callers that are happy to rely on the current behavior (the sometimes dangerous behavior) could do so, and so could still benefit from avoiding the new heap_copy_minimal_tuple() call (doing so isn't broken for some callers -- those that are happy to not reuse contents of tuple slot across calls). For now, I haven't added that. I guess that's Postgres 10 work, as Heikki mentioned. Apologies for the delay on this. -- Peter Geoghegan
Attachment
On Wed, Oct 12, 2016 at 4:36 PM, Peter Geoghegan <pg@heroku.com> wrote: > This fix has us copy the MinimalTuple into sortcontext palloc() memory > within tuplesort_gettupleslot() (based on commit 25bf7f8b). This still > differs a little from tuplestore_gettupleslot(), which explicitly uses > current context of caller, but we've always done things that way for > tuplesort.c. Actually, it's only true that tuplesort sortcontext context is used when copy isn't needed, which is not predictable to caller, so the new comment is a bit inaccurate. The inconsistency seems inconsequential, since we've always assume that caller tuples allocated within sortcontext may be "owned" by caller (when should_free = true), despite not being in caller's own memory context. Attached is revision with tiny tweak to relevant comment. -- Peter Geoghegan
Attachment
On 10/13/2016 03:02 AM, Peter Geoghegan wrote: > On Wed, Oct 12, 2016 at 4:36 PM, Peter Geoghegan <pg@heroku.com> wrote: >> This fix has us copy the MinimalTuple into sortcontext palloc() memory >> within tuplesort_gettupleslot() (based on commit 25bf7f8b). This still >> differs a little from tuplestore_gettupleslot(), which explicitly uses >> current context of caller, but we've always done things that way for >> tuplesort.c. > > Actually, it's only true that tuplesort sortcontext context is used > when copy isn't needed, which is not predictable to caller, so the new > comment is a bit inaccurate. The inconsistency seems inconsequential, > since we've always assume that caller tuples allocated within > sortcontext may be "owned" by caller (when should_free = true), > despite not being in caller's own memory context. > > Attached is revision with tiny tweak to relevant comment. Hmm. That also adds a copy to the sorted-in-mem case. That's safe, but should we be worried about performance. Or is the extra copy so cheap that it doesn't matter? We could easily also check for TSS_SORTEDINMEM, if that matters. - Heikki
On Thu, Oct 13, 2016 at 12:59 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Hmm. That also adds a copy to the sorted-in-mem case. That's safe, but > should we be worried about performance. Or is the extra copy so cheap that > it doesn't matter? I think that simply reading random locations in memory is the dominant cost, but the exact overhead should be investigated before proceeding. > We could easily also check for TSS_SORTEDINMEM, if that matters. You're asking about this because TSS_SORTEDINMEM is a case that changes (gets caller tuple copied) from 9.5 to a patched 9.6 for what might seem like no particular reason. That's fair, but if you consider the code and not the history, then TSS_SORTEDINMEM isn't really a special case (plus, you can say the same thing about the equivalent code within tuplestore.c). That is, unnecessary copying will occur for perhaps over 99% of calls here -- this waste occurs with all TSS_SORTEDINMEM-state calls, but also with most TSS_FINALMERGE-state calls. The point I'm making is that we might be better off worrying about the general problem, by adding a tuplestore_gettupleslot()-style "copy" boolean argument at the same time, and having some callers pass "false" to avoid copying (when they determine no risk of use-after-free, by not keeping the contents of a slot active across calls to tuplesort_gettupleslot()). You indicated that you don't really want to go there for 9.6, but maybe it's worth reconsidering that. For example, maybe ABI breakage is avoided by making tuplesort_gettupleslot() into a shim. Or, maybe it's okay to put it in the release notes of 9.6.1 -- I'm not sure how manageable that is. -- Peter Geoghegan
On 10/13/2016 08:27 PM, Peter Geoghegan wrote: > On Thu, Oct 13, 2016 at 12:59 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> Hmm. That also adds a copy to the sorted-in-mem case. That's safe, but >> should we be worried about performance. Or is the extra copy so cheap that >> it doesn't matter? > > I think that simply reading random locations in memory is the dominant > cost, but the exact overhead should be investigated before proceeding. Ok. In quick testing, the extra palloc() indeed didn't seem to cost much. > The point I'm making is that we might be better off worrying about the > general problem, by adding a tuplestore_gettupleslot()-style "copy" > boolean argument at the same time, and having some callers pass > "false" to avoid copying (when they determine no risk of > use-after-free, by not keeping the contents of a slot active across > calls to tuplesort_gettupleslot()). You indicated that you don't > really want to go there for 9.6, but maybe it's worth reconsidering > that. For example, maybe ABI breakage is avoided by making > tuplesort_gettupleslot() into a shim. Or, maybe it's okay to put it in > the release notes of 9.6.1 -- I'm not sure how manageable that is. Pushed this isolated fix for now. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On 10/13/2016 08:27 PM, Peter Geoghegan wrote: >> The point I'm making is that we might be better off worrying about the >> general problem, by adding a tuplestore_gettupleslot()-style "copy" >> boolean argument at the same time, and having some callers pass >> "false" to avoid copying (when they determine no risk of >> use-after-free, by not keeping the contents of a slot active across >> calls to tuplesort_gettupleslot()). You indicated that you don't >> really want to go there for 9.6, but maybe it's worth reconsidering >> that. For example, maybe ABI breakage is avoided by making >> tuplesort_gettupleslot() into a shim. Or, maybe it's okay to put it in >> the release notes of 9.6.1 -- I'm not sure how manageable that is. > Pushed this isolated fix for now. FWIW, I think this is an appropriate solution for 9.6, but I'd favor changing the function's API to be like tuplestore_gettupleslot's in HEAD. regards, tom lane
On Mon, Oct 17, 2016 at 6:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > FWIW, I think this is an appropriate solution for 9.6, but I'd favor > changing the function's API to be like tuplestore_gettupleslot's > in HEAD. In light of the fact that the minimal fix has low overhead, I agree. I've drafted a patch to do this. I expect to be able to post it in the next few days. -- Peter Geoghegan
On Mon, Oct 17, 2016 at 6:29 PM, Peter Geoghegan <pg@heroku.com> wrote: > In light of the fact that the minimal fix has low overhead, I agree. > I've drafted a patch to do this. I expect to be able to post it in the > next few days. I've put this off until I can get some buy-in from Heikki about a more general issue with "should_free" arguments in tuplesort.c [1]. [1] https://www.postgresql.org/message-id/CAM3SWZQWZZ_N=DmmL7tKy_OUjGH_5mN=N=A6h7kHyyDvEhg2DA@mail.gmail.com -- Peter Geoghegan