Thread: PostgreSQL crashes with SIGSEGV
A customer recently reported a crash in a postgres backend. The backend encountered a SIGSEGV, crashing during SELECTs from a fairly complicated view using a grouping set directive. I've managed to reproduce it by tracking it down to a specific SELECT, but unfortunately couldn't yet manage to strip it down to a small, repeatable test case which doesn't involve the whole (sensitive) dataset. I'm reporting my findings so far, maybe it helps to track it down already. The crashed backend gives the following backtrace: Program received signal SIGSEGV, Segmentation fault. 0x000000000099be84 in pfree (pointer=0x17194e8) at mcxt.c:1007 1007 context = ((StandardChunkHeader *) (gdb) bt #0 0x000000000099be84 in pfree (pointer=0x17194e8) at mcxt.c:1007 #1 0x00000000004798d8 in heap_free_minimal_tuple (mtup=0x17194e8) at heaptuple.c:1403 #2 0x0000000000687202 in ExecClearTuple (slot=0x10a3f88) at execTuples.c:455 #3 0x0000000000686b89 in ExecResetTupleTable (tupleTable=0x10a06a0, shouldFree=0 '\000') at execTuples.c:169 #4 0x00000000006773ff in ExecEndPlan (planstate=0x10a0398, estate=0x10a01f0) at execMain.c:1469 #5 0x0000000000675927 in standard_ExecutorEnd (queryDesc=0x10910c0) at execMain.c:468 #6 0x0000000000675865 in ExecutorEnd (queryDesc=0x10910c0) at execMain.c:439 #7 0x000000000062c71c in PortalCleanup (portal=0xf79ae0) at portalcmds.c:280 #8 0x000000000099ce97 in PortalDrop (portal=0xf79ae0, isTopCommit=0 '\000') at portalmem.c:510 #9 0x0000000000813639 in exec_simple_query ( query_string=0xf3dbe0 "SELECT * FROM \"xxxx\".\"xxxx\" WHERE (xxxx = 'xxxx') LIMIT 1000;") at postgres.c:1095 #10 0x00000000008177ea in PostgresMain (argc=1, argv=0xee9e10, dbname=0xeb9a78 "xxxx", username=0xee9c78 "bernd") at postgres.c:4072 #11 0x000000000078cc27 in BackendRun (port=0xedc700) at postmaster.c:4342 #12 0x000000000078c377 in BackendStartup (port=0xedc700) at postmaster.c:4016 #13 0x0000000000788983 in ServerLoop () at postmaster.c:1721 #14 0x0000000000787fb8 in PostmasterMain (argc=1, argv=0xeb7960) at postmaster.c:1329 #15 0x00000000006d1b52 in main (argc=1, argv=0xeb7960) at main.c:228 I've tested this so far against very current REL9_6_STABLE and REL9_5_STABLE and got them to crash with the same backtrace. The crash is dependent on the chosen plan, experiments with work_mem show that the crash seems to happen only if you get external sorts into the execution plan. REL10_STABLE seems not affected, as my extracted application query doesn't crash there. Running the query against REL9_6_STABLE with valgrind shows the following results: ==00:00:01:33.067 13158== Invalid write of size 8 ==00:00:01:33.067 13158== at 0x93D53B: AllocSetFree (aset.c:998) ==00:00:01:33.067 13158== by 0x93EC03: pfree (mcxt.c:1012) ==00:00:01:33.067 13158== by 0x476E34: heap_free_minimal_tuple (heaptuple.c:1403) ==00:00:01:33.067 13158== by 0x6521A3: ExecClearTuple (execTuples.c:455) ==00:00:01:33.067 13158== by 0x651D4B: ExecResetTupleTable (execTuples.c:169) ==00:00:03:34.540 14381== Invalid read of size 8 ==00:00:03:34.540 14381== at 0x99E29C: pfree (mcxt.c:1007) ==00:00:03:34.540 14381== by 0x4798D7: heap_free_minimal_tuple (heaptuple.c:1403) ==00:00:03:34.540 14381== by 0x68753E: ExecClearTuple (execTuples.c:455) ==00:00:03:34.540 14381== by 0x686EC5: ExecResetTupleTable (execTuples.c:169) ==00:00:03:34.540 14381== by 0x67773B: ExecEndPlan (execMain.c:1469) ==00:00:03:34.540 14381== by 0x675C63: standard_ExecutorEnd (execMain.c:468) ==00:00:03:34.540 14381== by 0x675BA1: ExecutorEnd (execMain.c:439) ==00:00:03:34.540 14381== by 0x62CA58: PortalCleanup (portalcmds.c:280) ==00:00:03:34.540 14381== by 0x99F412: PortalDrop (portalmem.c:510) ==00:00:03:34.540 14381== by 0x8142A4: exec_simple_query (postgres.c:1095) ==00:00:03:34.540 14381== by 0x818455: PostgresMain (postgres.c:4072) ==00:00:03:34.540 14381== by 0x78CF63: BackendRun (postmaster.c:4342) ==00:00:03:34.540 14381== Address 0x188cd220 is 474,080 bytes inside a block of size 1,048,576 free'd ==00:00:03:34.540 14381== at 0x4C2FD18: free (vg_replace_malloc.c:530) ==00:00:03:34.540 14381== by 0x99A8DB: AllocSetDelete (aset.c:653) ==00:00:03:34.540 14381== by 0x99C7A4: MemoryContextDelete (mcxt.c:225) ==00:00:03:34.540 14381== by 0x99C855: MemoryContextDeleteChildren (mcxt.c:245) ==00:00:03:34.540 14381== by 0x99C772: MemoryContextDelete (mcxt.c:208) ==00:00:03:34.540 14381== by 0x9A5F10: tuplesort_end (tuplesort.c:1198) ==00:00:03:34.540 14381== by 0x692568: ExecEndAgg (nodeAgg.c:3449) ==00:00:03:34.540 14381== by 0x67BB46: ExecEndNode (execProcnode.c:755) ==00:00:03:34.540 14381== by 0x6AD652: ExecEndSubqueryScan (nodeSubqueryscan.c:181) ==00:00:03:34.540 14381== by 0x67BA69: ExecEndNode (execProcnode.c:697) ==00:00:03:34.540 14381== by 0x69DF6E: ExecEndLimit (nodeLimit.c:438) ==00:00:03:34.540 14381== by 0x67BB9D: ExecEndNode (execProcnode.c:779) ==00:00:03:34.540 14381== Block was alloc'd at ==00:00:03:34.540 14381== at 0x4C2EB6B: malloc (vg_replace_malloc.c:299) ==00:00:03:34.540 14381== by 0x99AED4: AllocSetAlloc (aset.c:866) ==00:00:03:34.540 14381== by 0x99DC7A: palloc (mcxt.c:904) ==00:00:03:34.540 14381== by 0x4798F6: heap_copy_minimal_tuple (heaptuple.c:1417) ==00:00:03:34.540 14381== by 0x6878D7: ExecCopySlotMinimalTuple (execTuples.c:593) ==00:00:03:34.540 14381== by 0x9AC1D2: copytup_heap (tuplesort.c:3998) ==00:00:03:34.540 14381== by 0x9A616C: tuplesort_puttupleslot (tuplesort.c:1345) ==00:00:03:34.540 14381== by 0x68D7D8: fetch_input_tuple (nodeAgg.c:601) ==00:00:03:34.540 14381== by 0x68FEC1: agg_retrieve_direct (nodeAgg.c:2168) ==00:00:03:34.540 14381== by 0x68F9AB: ExecAgg (nodeAgg.c:1903) ==00:00:03:34.540 14381== by 0x67B70A: ExecProcNode (execProcnode.c:503) ==00:00:03:34.540 14381== by 0x6AD40F: SubqueryNext (nodeSubqueryscan.c:53) ==00:00:03:34.540 14381== { <insert_a_suppression_name_here> Memcheck:Addr8 fun:pfree fun:heap_free_minimal_tuple fun:ExecClearTuple fun:ExecResetTupleTable fun:ExecEndPlan fun:standard_ExecutorEnd fun:ExecutorEnd fun:PortalCleanup fun:PortalDrop fun:exec_simple_query fun:PostgresMain fun:BackendRun } ==00:00:03:34.548 14381== Invalid read of size 4 ==00:00:03:34.548 14381== at 0x99E2AE: pfree (mcxt.c:1010) ==00:00:03:34.548 14381== by 0x4798D7: heap_free_minimal_tuple (heaptuple.c:1403) ==00:00:03:34.548 14381== by 0x68753E: ExecClearTuple (execTuples.c:455) ==00:00:03:34.548 14381== by 0x686EC5: ExecResetTupleTable (execTuples.c:169) ==00:00:03:34.548 14381== by 0x67773B: ExecEndPlan (execMain.c:1469) ==00:00:03:34.548 14381== by 0x675C63: standard_ExecutorEnd (execMain.c:468) ==00:00:03:34.548 14381== by 0x675BA1: ExecutorEnd (execMain.c:439) ==00:00:03:34.548 14381== by 0x62CA58: PortalCleanup (portalcmds.c:280) ==00:00:03:34.548 14381== by 0x99F412: PortalDrop (portalmem.c:510) ==00:00:03:34.548 14381== by 0x8142A4: exec_simple_query (postgres.c:1095) ==00:00:03:34.548 14381== by 0x818455: PostgresMain (postgres.c:4072) ==00:00:03:34.548 14381== by 0x78CF63: BackendRun (postmaster.c:4342) ==00:00:03:34.548 14381== Address 0x7f7f7f7f7f7f7f7f is not stack'd, malloc'd or (recently) free'd ==00:00:03:34.548 14381== { <insert_a_suppression_name_here> Memcheck:Addr4 fun:pfree fun:heap_free_minimal_tuple fun:ExecClearTuple fun:ExecResetTupleTable fun:ExecEndPlan fun:standard_ExecutorEnd fun:ExecutorEnd fun:PortalCleanup fun:PortalDrop fun:exec_simple_query fun:PostgresMain fun:BackendRun } ==00:00:03:34.548 14381== ==00:00:03:34.548 14381== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==00:00:03:34.548 14381== General Protection Fault ==00:00:03:34.548 14381== at 0x99E2AE: pfree (mcxt.c:1010) ==00:00:03:34.548 14381== by 0x4798D7: heap_free_minimal_tuple (heaptuple.c:1403) ==00:00:03:34.548 14381== by 0x68753E: ExecClearTuple (execTuples.c:455) ==00:00:03:34.548 14381== by 0x686EC5: ExecResetTupleTable (execTuples.c:169) ==00:00:03:34.548 14381== by 0x67773B: ExecEndPlan (execMain.c:1469) ==00:00:03:34.548 14381== by 0x675C63: standard_ExecutorEnd (execMain.c:468) ==00:00:03:34.548 14381== by 0x675BA1: ExecutorEnd (execMain.c:439) ==00:00:03:34.548 14381== by 0x62CA58: PortalCleanup (portalcmds.c:280) ==00:00:03:34.548 14381== by 0x99F412: PortalDrop (portalmem.c:510) ==00:00:03:34.548 14381== by 0x8142A4: exec_simple_query (postgres.c:1095) ==00:00:03:34.548 14381== by 0x818455: PostgresMain (postgres.c:4072) ==00:00:03:34.548 14381== by 0x78CF63: BackendRun (postmaster.c:4342) ==00:00:03:35.088 14381== ==00:00:03:35.088 14381== Process terminating with default action of signal 11 (SIGSEGV) ==00:00:03:35.088 14381== General Protection Fault ==00:00:03:35.088 14381== at 0x632016C: _dl_catch_error (in /usr/lib64/libc-2.25.so) ==00:00:03:35.088 14381== by 0x631F8E6: __libc_dlclose (in /usr/lib64/libc-2.25.so) ==00:00:03:35.088 14381== by 0x634B5E4: free_mem (in /usr/lib64/libc-2.25.so) ==00:00:03:35.088 14381== by 0x634B1E1: __libc_freeres (in /usr/lib64/libc-2.25.so) ==00:00:03:35.088 14381== by 0x4A296DB: _vgnU_freeres (vg_preloaded.c:77) ==00:00:03:35.088 14381== by 0x18118EA7: ??? ==00:00:03:35.088 14381== by 0x4798D7: heap_free_minimal_tuple (heaptuple.c:1403) ==00:00:03:35.088 14381== by 0x68753E: ExecClearTuple (execTuples.c:455) ==00:00:03:35.088 14381== by 0x686EC5: ExecResetTupleTable (execTuples.c:169) ==00:00:03:35.088 14381== by 0x67773B: ExecEndPlan (execMain.c:1469) ==00:00:03:35.088 14381== by 0x675C63: standard_ExecutorEnd (execMain.c:468) ==00:00:03:35.088 14381== by 0x675BA1: ExecutorEnd (execMain.c:439) ==00:00:03:35.088 14381== ==00:00:03:35.088 14381== HEAP SUMMARY: ==00:00:03:35.088 14381== in use at exit: 6,833,625 bytes in 531 blocks ==00:00:03:35.088 14381== total heap usage: 814,111 allocs, 1,264 frees, 118,761,978 bytes allocated ==00:00:03:35.088 14381== ==00:00:03:35.088 14381== For a detailed leak analysis, rerun with: -- leak-check=full ==00:00:03:35.088 14381== ==00:00:03:35.088 14381== For counts of detected and suppressed errors, rerun with: -v ==00:00:03:35.088 14381== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 35 from 6) It seems the backend tries to free a minimal tuple at executor end, which is already gone by deleting the memory context it was allocated in before. ExecResetTupleTable() is looping through a list with 70 entries, it encounters (after 6 or seven rounds) the first tuple slot with tts_shouldFreeMin set, all others before don't have it set: (gdb) p *slot $6 = {type = T_TupleTableSlot, tts_isempty = 0 '\000', tts_shouldFree = 0 '\000', tts_shouldFreeMin = 1 '\001', tts_slow = 1 '\001', tts_tuple = 0x10a6c48, tts_tupleDescriptor = 0x10d1be8, tts_mcxt = 0xf59668, tts_buffer = 0, tts_nvalid = 6, tts_values = 0x10d2640, tts_isnull = 0x10d26d8 "", tts_mintuple = 0x171c168, tts_minhdr = {t_len = 162, t_self = {ip_blkid = {bi_hi = 0, bi_lo = 0}, ip_posid = 0}, t_tableOid = 0, t_data = 0x171c160}, tts_off = 88} (gdb) p list_length(tupleTable) tts_mintuple is invalid, though: (gdb) p *slot->tts_mintuple Cannot access memory at address 0x171c168 The following attempt to clean it in ExecClearTuple() then crashes the backend. Bernd
On Fri, Dec 8, 2017 at 12:47 AM, Bernd Helmle <mailings@oopsware.de> wrote: > A customer recently reported a crash in a postgres backend. The backend > encountered a SIGSEGV, crashing during SELECTs from a fairly > complicated view using a grouping set directive. I've managed to > reproduce it by tracking it down to a specific SELECT, but > unfortunately couldn't yet manage to strip it down to a small, > repeatable test case which doesn't involve the whole (sensitive) > dataset. I'm reporting my findings so far, maybe it helps to track it > down already. Hmm. Even if you cannot reproduce an isolated test case, could it be possible to get an idea of the shape the SELECT query involved and of the schema plus the view? No need for sensitive data here. This would help in reproducing a test case. What are also the sizes involved? Even a small data set with work_mem low should trigger the problem? > I've tested this so far against very current REL9_6_STABLE and > REL9_5_STABLE and got them to crash with the same backtrace. The crash > is dependent on the chosen plan, experiments with work_mem show that > the crash seems to happen only if you get external sorts into the > execution plan. > REL10_STABLE seems not affected, as my extracted application query > doesn't crash there. That's one thing to begin with. So HEAD is not affected as well? -- Michael
On Thu, Dec 7, 2017 at 7:47 AM, Bernd Helmle <mailings@oopsware.de> wrote: > I've tested this so far against very current REL9_6_STABLE and > REL9_5_STABLE and got them to crash with the same backtrace. The crash > is dependent on the chosen plan, experiments with work_mem show that > the crash seems to happen only if you get external sorts into the > execution plan. > > REL10_STABLE seems not affected, as my extracted application query > doesn't crash there. Does "set replacement_sort_tuples = 0" change anything on REL9_6_STABLE? If you only get a crash when there is very little work_mem, then that might be a good avenue of investigation. Note that my changes to external sorting started in REL9_6_STABLE, so they can't be involved here. Are you aware of commit 512f67c8d02cc558f9c269cc848b0f0f788c4fe1, which fixed a bug affecting external sorts? Are you sure that you have that fix on REL9_5_STABLE + REL9_6_STABLE? -- Peter Geoghegan
On Thu, Dec 7, 2017 at 7:47 AM, Bernd Helmle <mailings@oopsware.de> wrote: > It seems the backend tries to free a minimal tuple at executor end, > which is already gone by deleting the memory context it was allocated > in before. ExecResetTupleTable() is looping through a list with 70 > entries, it encounters (after 6 or seven rounds) the first tuple slot > with tts_shouldFreeMin set, all others before don't have it set: On second thought, it seems more likely that the reason that REL10_STABLE is unaffected is commit 3856cf9607f41245ec9462519c53f1109e781fc5. As of that commit (which built on earlier v10 work) there is no question about memory for tuples retrieved via tuplesort_gettupleslot() not belonging to caller -- it must belong to caller. The old interface already resulted in bugs in early 9.6 point releases that looked similar to this one, so I was glad to remove it. (Note also that this picture was slightly complicated by the performance optimization commit fa117ee40330db401da776e7b003f047098a7d4c that followed, which made some callers opt out of copying when that was clearly safe, but that probably isn't important.) So, as you said, the question that we probably need to answer is: just how did grouping sets/nodeAgg.c code end up getting tuple memory lifetime wrong. One good way to get more information is to rerun Valgrind, but this time with track origins enabled. I myself run Valgrind like this when I want to see the origin of memory involved in an error. I specify: $ valgrind --leak-check=no --gen-suppressions=all --trace-children=yes --track-origins=yes --read-var-info=yes --suppressions=/home/pg/postgresql/root/source/src/tools/valgrind.supp -v postgres --log_line_prefix="%m %p " --log_statement=all --shared_buffers=64MB 2>&1 | tee postmaster.log (Probably the only change that you'll need is to make is to run Valgrind with an the extra "--track-origins=yes".) --track-origins=yes is usually something I use when I already know that Valgrind will complain, but want more information about the nature of the problem. -- Peter Geoghegan
Am Donnerstag, den 07.12.2017, 18:23 -0800 schrieb Peter Geoghegan: > Does "set replacement_sort_tuples = 0" change anything on > REL9_6_STABLE? If you only get a crash when there is very little > work_mem, then that might be a good avenue of investigation. Note > that > my changes to external sorting started in REL9_6_STABLE, so they > can't > be involved here. > replacement_sort_tuples = 0 changes the picture, indeed. With that setting, the query runs without problems in REL9_6_STABLE. > Are you aware of commit 512f67c8d02cc558f9c269cc848b0f0f788c4fe1, > which fixed a bug affecting external sorts? Are you sure that you > have > that fix on REL9_5_STABLE + REL9_6_STABLE? My test instances are build against a fresh pull from yesterday.
Am Donnerstag, den 07.12.2017, 18:54 -0800 schrieb Peter Geoghegan: > So, as you said, the question that we probably need to answer is: > just > how did grouping sets/nodeAgg.c code end up getting tuple memory > lifetime wrong. One good way to get more information is to rerun > Valgrind, but this time with track origins enabled. I myself run > Valgrind like this when I want to see the origin of memory involved > in > an error. I specify: > > $ valgrind --leak-check=no --gen-suppressions=all --trace- > children=yes > --track-origins=yes --read-var-info=yes > -- > suppressions=/home/pg/postgresql/root/source/src/tools/valgrind.supp > -v postgres --log_line_prefix="%m %p " --log_statement=all > --shared_buffers=64MB 2>&1 | tee postmaster.log > > (Probably the only change that you'll need is to make is to run > Valgrind with an the extra "--track-origins=yes".) > > --track-origins=yes is usually something I use when I already know > that Valgrind will complain, but want more information about the > nature of the problem. That's what i've already did. My usage of valgrind was this: valgrind --leak-check=no --gen-suppressions=all \ --track-origins=yes --suppressions=src/tools/valgrind.supp \ --time-stamp=yes --trace-children=yes postgres
On Fri, Dec 8, 2017 at 6:19 PM, Bernd Helmle <mailings@oopsware.de> wrote: > valgrind --leak-check=no --gen-suppressions=all \ > --track-origins=yes --suppressions=src/tools/valgrind.supp \ > --time-stamp=yes --trace-children=yes postgres I have been trying for a couple of hours to reproduce a failure using views with grouping sets, enforcing an external sort with things like that:zfAgg create table aa (a1 int, a2 int, a3 int, a4 int, a5 int, a6 int); insert into aa SELECT (random() * 2000000000)::int, (random() * 2000000000)::int, (random() * 2000000000)::int, (random() * 2000000000)::int, (random() * 2000000000)::int, (random() * 2000000000)::int FROM generate_series(1,1000000); set work_mem = '128kB'; create or replace view aav as SELECT a1, a2, a3, a4, a5, avg(a6) FROM aa GROUP BY GROUPING SETS (a1, a2, a3, a4, a5); explain analyze select * from aav order by a1, a2, a3, a4, a5; Not sure if you can provide this information, but what does the plan of the query look like? -- Michael
Michael Paquier writes: > Not sure if you can provide this information, but what does the plan > of the query look like? We did some more reducing work on the original query and data. The following testcase reproduces the double free reported by valgrind for me when run against a vanilla REL9_6_STABLE build. regards, Andreas --8<---------------cut here---------------start------------->8--- drop table if exists bug; create table bug (n text, v text, b text, t text); insert into bug select i%9, i%9, i%16 ,i%4096 from generate_series(1,100000) g(i); analyze bug; explain select * from ( select thecube.nv from ( select (n || ' ') || coalesce(v, '') as nv from bug group by ((n || ' ') || coalesce(v, '')) ,cube(b, t) ) thecube where nv = '8 8' ) sub limit 7000; --8<---------------cut here---------------end--------------->8--- QUERY PLAN --------------------------------------------------------------------------------------------------- Limit (cost=13422.32..15747.28 rows=7000 width=32) -> Subquery Scan on thecube (cost=13422.32..27622.49 rows=42754 width=32) -> GroupAggregate (cost=13422.32..27194.95 rows=42754 width=38) Group Key: (((bug.n || ' '::text) || COALESCE(bug.v, ''::text))), bug.b, bug.t Group Key: (((bug.n || ' '::text) || COALESCE(bug.v, ''::text))), bug.b Group Key: (((bug.n || ' '::text) || COALESCE(bug.v, ''::text))) Sort Key: (((bug.n || ' '::text) || COALESCE(bug.v, ''::text))), bug.t Group Key: (((bug.n || ' '::text) || COALESCE(bug.v, ''::text))), bug.t Filter: ((((bug.n || ' '::text) || COALESCE(bug.v, ''::text))) = '8 8'::text) -> Sort (cost=13422.32..13672.32 rows=100000 width=42) Sort Key: (((bug.n || ' '::text) || COALESCE(bug.v, ''::text))), bug.b, bug.t -> Seq Scan on bug (cost=0.00..2041.00 rows=100000 width=42)
On Thu, Dec 14, 2017 at 10:47 AM, Andreas Seltenreich <andreas.seltenreich@credativ.de> wrote: > We did some more reducing work on the original query and data. The > following testcase reproduces the double free reported by valgrind for > me when run against a vanilla REL9_6_STABLE build. > --8<---------------cut here---------------start------------->8--- > drop table if exists bug; > create table bug (n text, v text, b text, t text); > insert into bug > select i%9, i%9, i%16 ,i%4096 from generate_series(1,100000) g(i); > analyze bug; > > explain select * from ( > select thecube.nv > from ( select > (n || ' ') || coalesce(v, '') as nv > from bug > group by ((n || ' ') || coalesce(v, '')) ,cube(b, t) > ) thecube > where nv = '8 8' > ) sub limit 7000; > --8<---------------cut here---------------end--------------->8--- I can reproduce this against REL9_6_STABLE, once work_mem is set to 4MB, and replacement_sort_tuples is set to 150000. -- Peter Geoghegan
On Thu, Dec 14, 2017 at 12:06 PM, Peter Geoghegan <pg@bowt.ie> wrote: > I can reproduce this against REL9_6_STABLE, once work_mem is set to > 4MB, and replacement_sort_tuples is set to 150000. I've figured this one out. It's a double free. Consider this code: (Although REL9_5_STABLE is also affected, the following remarks reference REL9_6_STABLE-only code.) * The slot receives a copied tuple (sometimes allocated in caller memory * context) that will stay valid regardless of future manipulations of the * tuplesort's state. */ bool tuplesort_gettupleslot(/* SNIP */) I wrote that comment block, in commit a5f0bd77a, an early bugfix for 9.6 (I actually mentioned this specific commit to Bernd in passing already). This bug is very similar to the one fixed by a5f0bd77a, and is arguably a broader version of that same bug. Note that I didn't do anything with external sorts until 9.6, so I expect that this affects all stable branches. That said, it may not be possible to produce a hard crash on all versions. More or less by accident. The function opens with this code: MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext); SortTuple stup; bool should_free; if (!tuplesort_gettuple_common(state, forward, &stup, &should_free)) stup.tuple = NULL; MemoryContextSwitchTo(oldcontext); /* SNIP */ ... As long as the slot receives a copied tuple that may or may not be in caller's memory context (it may instead get allocated in a tuplesort-private memory context), and as long as destroying a tuplesort via tuplesort_end() counts as a "future manipulation", then you it's possible to crash with a double-pfree() while not violating the tuplesort_gettupleslot() contract. The memory may or may not be freed as part of tuplesort_end() teardown, based only on accidental implementation factors (that's the first time the memory gets freed when we do crash). Specifically, if state->status == TSS_SORTEDONTAPE, we risk this crash on versions before Postgres 10 (more on why 10 appears to be okay tomorrow), when tuplesort_end() just so happens to have been called (destroying the 2 memory contexts) before the executor's superfluous pfree() within ExecResetTupleTable(). The ambiguity about who owns the tuple memory when is the basic problem, once again. One could argue that it's the caller's fault for not knowing to not pfree() the tuple after tuplesort_end() is called, but I don't buy that because it seems like an undue burden on callers to do that. It seems okay to either provide a very weak, simple guarantee about tuple memory lifetime, or very strong simple guarantee about tuple memory lifetime. That's what routines like tuplesort_gettupleslot() and tuplesort_getindextuple() should limit their contracts to -- we've had enough problems in this area over the years that that seems pretty clear to me. ISTM that an appropriate fix is one that results in having tuplesort_gettupleslot() tuple memory that is *consistently* allocated within caller's own memory context, at least on versions prior to Postgres 10 (it's a bit more complicated in Postgres 10). I'll begin work at a patch for this tomorrow. It's not obvious what approach to take, because on 9.6+, memory is allocated in tuplecontext, not sortcontext. A well-targeted fix will not burden low-level READTUP() routines (that use tuplecontext) with direct knowledge of the problem. -- Peter Geoghegan
On Thu, Dec 14, 2017 at 5:58 PM, Peter Geoghegan <pg@bowt.ie> wrote: > The ambiguity about who owns the tuple memory when is the basic > problem, once again. One could argue that it's the caller's fault for > not knowing to not pfree() the tuple after tuplesort_end() is called, > but I don't buy that because it seems like an undue burden on callers > to do that. It seems okay to either provide a very weak, simple > guarantee about tuple memory lifetime, or very strong simple guarantee > about tuple memory lifetime. That's what routines like > tuplesort_gettupleslot() and tuplesort_getindextuple() should limit > their contracts to -- we've had enough problems in this area over the > years that that seems pretty clear to me. > > ISTM that an appropriate fix is one that results in having > tuplesort_gettupleslot() tuple memory that is *consistently* allocated > within caller's own memory context, at least on versions prior to > Postgres 10 (it's a bit more complicated in Postgres 10). This took longer than expected. Attached patch, which targets 9.6, shows what I have in mind. I'm going on vacation shortly, but wanted to post this patch before I leave. It could definitely use some more testing, since it was written under time pressure. I expect to be back early in the new year, so if someone feels like taking this off my hands, they're more than welcome to do so. Postgres v10 isn't affected because there is no such thing as should_free from commit f1f5ec1ef onwards (commit fa117ee40 reintroduced a no-copy optimization a while later, which looks like it made things unsafe again, but it's actually fine -- the ExecClearTuple() pfree() won't happen, so no crash). Everything is fine on v10 because the design was improved for v10. On v10, tuplesort_gettupleslot() calls ExecStoreMinimalTuple() with arguments that indicate that the memory still belongs to the tuplesort, unless it was an explicit copy made with heap_copy_minimal_tuple() there and then, in which case it belong to caller. This works because on v10 we're always managing some temp buffers, regardless of state->status. There is no question of a routine like readtup_alloc() allocating or using memory without knowing about its lifetime, because it's always using the smallish slab allocator buffers. To put in another way, in v10 everything is fine because callers to routines like tuplesort_gettupleslot() either have very limited guarantees about fetched tuple memory that is owned by tuplesort (it can go away almost immediately, on the next call to tuplesort_gettupleslot()), or infinitely flexible guarantees (it's unambiguously caller's copy, and lives in caller's memory context). There is no awkward mix of these two situations, as is the case in earlier versions. That awkward mix is: "You (caller) should free this memory, and can rely on this memory sticking around until tuplesort_end() is called, at which point I (tuplesort) will free it if you haven't -- though I might *not* free it then, based on whether or not I felt like using your memory context!". This is an accident waiting to happen. Remember, the crash isn't a use-after-free; it's a double-free caused by conflated responsibilities across two low-level, refcount-like mechanisms. The idea of a tts_shouldFree=true tuple whose memory exists in a context that isn't under the executor's direct control is a fundamentally bogus idea. It kinda works most of the time, but only to the extent that the lifetime of a tuplesort is accidentally tied to the lifetime of an executor node, without the tuple/slot making in into something like estate->es_tupleTable (as was the case in Andreas' example). An alternative approach to fixing this issue, that I also want to throw out there, is perform a simple change within ExecResetTupleTable() to make ExecClearTuple()'d slots force "tts_shouldFree = false" -- a second patch shows what I mean here. All ExecResetTupleTable() callers pass "shouldFree = false" anyway, because all memory is about to go away (within FreeExecutorState()). This second approach seems like a real kludge to me, but it's also a fix that requires a tiny tweak, rather than non-trivial tuplesort.c memory management changes. I would feel worse about the idea of applying this kludge if we had to rely on it forever, but since v10 is unaffected anyway, we really don't. But it's still an awful kludge. I find it hard to know what to do here (the vacation is overdue, it seems). (Note that v10 does have a minor, related bug, noted in commit messages of both patches.) -- Peter Geoghegan
Attachment
On Sun, Dec 17, 2017 at 10:40 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Thu, Dec 14, 2017 at 5:58 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> The ambiguity about who owns the tuple memory when is the basic >> problem, once again. One could argue that it's the caller's fault for >> not knowing to not pfree() the tuple after tuplesort_end() is called, >> but I don't buy that because it seems like an undue burden on callers >> to do that. It seems okay to either provide a very weak, simple >> guarantee about tuple memory lifetime, or very strong simple guarantee >> about tuple memory lifetime. That's what routines like >> tuplesort_gettupleslot() and tuplesort_getindextuple() should limit >> their contracts to -- we've had enough problems in this area over the >> years that that seems pretty clear to me. >> >> ISTM that an appropriate fix is one that results in having >> tuplesort_gettupleslot() tuple memory that is *consistently* allocated >> within caller's own memory context, at least on versions prior to >> Postgres 10 (it's a bit more complicated in Postgres 10). > > This took longer than expected. Attached patch, which targets 9.6, > shows what I have in mind. I'm going on vacation shortly, but wanted > to post this patch before I leave. It could definitely use some more > testing, since it was written under time pressure. I expect to be back > early in the new year, so if someone feels like taking this off my > hands, they're more than welcome to do so. Could you add that to the next commit fest registered as a bug fix? We don't want to lose track of that and this should be reviewed. -- Michael
On Sun, Dec 17, 2017 at 2:46 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > Could you add that to the next commit fest registered as a bug fix? We > don't want to lose track of that and this should be reviewed. Done. -- Peter Geoghegan