postgres crash on concurrent update of inheritance partitioned table - Mailing list pgsql-bugs
From | Oleksii Kliukin |
---|---|
Subject | postgres crash on concurrent update of inheritance partitioned table |
Date | |
Msg-id | EEF05F66-2871-4786-992B-5F45C92FEE2E@hintbits.com Whole thread Raw |
Responses |
Re: postgres crash on concurrent update of inheritance partitionedtable
|
List | pgsql-bugs |
Hello there, We have experienced a few crashes a day since upgrading to pg 12.1 last week, during the process of updating an inheritance-partitioned table by several concurrent sessions. Here’s the reproducer that consistency crashes pg 12.1 (and master) on my machine. Setup: create table p(flag boolean); create table c1() inherits(p); create table c2() inherits(p); insert into c1 select false from generate_series(1,10); insert into c2 select false from generate_series(1,10); session 1: session 2: begin; begin update p set flag = true; update p set flag = true; commit; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. The original backtrace we had from a production server (built w/o assertions), where it crashed from time to time. #0 0x0000557a89e7b3f4 in EvalPlanQualBegin (epqstate=epqstate@entry=0x557a8c3bed50) at execMain.c:2735 #1 0x0000557a89e7b80b in EvalPlanQual (epqstate=epqstate@entry=0x557a8c3bed50, relation=relation@entry=0x557a8bcdce98, rti=268,inputslot=inputslot@entry=0x557a8c55eb00) at execMain.c:2454 #2 0x0000557a89e9d633 in ExecUpdate (mtstate=mtstate@entry=0x557a8c3bec58, tupleid=0x7ffc99962f8a, oldtuple=0x0, slot=<optimizedout>, planSlot=0x557a8c50f4d8, epqstate=epqstate@entry=0x557a8c3bed50, estate=0x557a8c3bb0f0, canSetTag=true) at nodeModifyTable.c:1387 #3 0x0000557a89e9e612 in ExecModifyTable (pstate=0x557a8c3bec58) at nodeModifyTable.c:2223 #4 0x0000557a89e78b8b in ExecProcNode (node=0x557a8c3bec58) at ../../../src/include/executor/executor.h:239 #5 ExecutePlan (execute_once=<optimized out>, dest=0x7f1cdbf8c5c8, direction=<optimized out>, numberTuples=0, sendTuples=<optimizedout>, operation=CMD_UPDATE, use_parallel_mode=<optimized out>, planstate=0x557a8c3bec58, estate=0x557a8c3bb0f0)at execMain.c:1646 #6 standard_ExecutorRun (queryDesc=0x557a8c3ce440, direction=<optimized out>, count=0, execute_once=<optimized out>) atexecMain.c:364 #7 0x00007f1ceff25035 in pgss_ExecutorRun (queryDesc=0x557a8c3ce440, direction=ForwardScanDirection, count=0, execute_once=<optimizedout>) at pg_stat_statements.c:893 #8 0x0000557a89fcef60 in ProcessQuery (plan=<optimized out>, sourceText=0x557a8afd4c40 "BEGIN;\n CREATE TEMP TABLE foobar(a x, b y, c bigint, d e, f bigint, g bigint) ON COMMITDROP;\n COPY foobar(a, b"..., params=0x0, queryEnv=0x0, dest=0x7f1cdbf8c5c8, completionTag=0x7ffc999633b0 "") at pquery.c:161 #9 0x0000557a89fcf190 in PortalRunMulti (portal=portal@entry=0x557a8b06ce60, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,dest=dest@entry=0x7f1cdbf8c5c8, altdest=altdest@entry=0x7f1cdbf8c5c8, completionTag=completionTag@entry=0x7ffc999633b0 "") at pquery.c:1283 #10 0x0000557a89fcfcca in PortalRun (portal=portal@entry=0x557a8b06ce60, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,run_once=run_once@entry=true, dest=dest@entry=0x7f1cdbf8c5c8, altdest=altdest@entry=0x7f1cdbf8c5c8, completionTag=0x7ffc999633b0 "") at pquery.c:796 #11 0x0000557a89fcbd45 in exec_simple_query ( query_string=0x557a8afd4c40 ""BEGIN;\n CREATE TEMP TABLE foobar(a x, b y, c bigint, d e, f bigint, g bigint) ON COMMITDROP;\n COPY foobar(a, b"...) at postgres.c:1215 #12 0x0000557a89fcd5b1 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x557a8b0134c8, dbname=<optimized out>, username=<optimizedout>) at postgres.c:4236 #13 0x0000557a89f578f0 in BackendRun (port=0x557a8b011290, port=0x557a8b011290) at postmaster.c:4437 #14 BackendStartup (port=0x557a8b011290) at postmaster.c:4128 #15 ServerLoop () at postmaster.c:1704 #16 0x0000557a89f58850 in PostmasterMain (argc=5, argv=0x557a8afceae0) at postmaster.c:1377 #17 0x0000557a89cd66ba in main (argc=5, argv=0x557a8afceae0) at main.c:228 The actual line it crashes is (my sources are a bit newer than the backtrace, so the line number doesn’t match exactly): File: src/backend/executor/execMain.c 2721: /* 2722: * Force evaluation of any InitPlan outputs that could be needed 2723: * by the subplan, just in case they got reset since 2724: * EvalPlanQualStart (see comments therein). 2725: */ 2726: ExecSetParamPlanMulti(rcplanstate->plan->extParam, 2727: GetPerTupleExprContext(parentestate)); 2728: rcplanstate->plan was NULL during the crash. Poking into this, we found the following anomaly: (gdb) p *epqstate $2 = {parentestate = 0x557a8c3bb0f0, epqParam = 0, tuple_table = 0x557aa2e81918, relsubs_slot = 0x557a8c3ea3e8, plan = 0x7f1cdbf30798,arowMarks = 0x557a8c532208, origslot = 0x557a8c50f4d8, recheckestate = 0x557a8be7b7c0, relsubs_rowmark = 0x557a8b0ae5d0, relsubs_done = 0x557a8b53b9f8, recheckplanstate = 0x557a8b53b9f8} (Notice relsubs_done = 0x557a8b53b9f8, recheckplanstate = 0x557a8b53b9f8 share the same address, while being completely different fields). It looks like File: src/backend/executor/execMain.c 2714: MemSet(epqstate->relsubs_done, 0, rtsize * sizeof(bool)); zeroed-out not just the relsubs_done, but the recheckplansate, as it pointed to the same address, causing the segfault. Our theory is that the underlying context the epqstate members are allocated at gets reset at some point (perhaps when finishing updating a single partition), but the corresponding struct fields are not zeroed out, so they are reused and the memory they point to is allocated elsewhere, most likely for the next struct member, causing the condition above. The backtrace from the attempt to reproduce it (with asserts enabled): (gdb) bt #0 0x00007f90d113df25 in raise () from /usr/lib/libc.so.6 #1 0x00007f90d1127897 in abort () from /usr/lib/libc.so.6 #2 0x000055ba146697f5 in ExceptionalCondition (conditionName=0x55ba148423e0 "!(epqstate->relsubs_rowmark[scanrelid - 1]== ((void *)0))", errorType=0x55ba14842399 "FailedAssertion", fileName=0x55ba148423cf "execScan.c", lineNumber=89) at assert.c:54 #3 0x000055ba142cd24a in ExecScanFetch (node=0x55ba1661ff00, accessMtd=0x55ba14300f8c <SeqNext>, recheckMtd=0x55ba14301031<SeqRecheck>) at execScan.c:89 #4 0x000055ba142cd420 in ExecScan (node=0x55ba1661ff00, accessMtd=0x55ba14300f8c <SeqNext>, recheckMtd=0x55ba14301031 <SeqRecheck>)at execScan.c:200 #5 0x000055ba1430107f in ExecSeqScan (pstate=0x55ba1661ff00) at nodeSeqscan.c:112 #6 0x000055ba142caea0 in ExecProcNodeFirst (node=0x55ba1661ff00) at execProcnode.c:445 #7 0x000055ba142bf78f in ExecProcNode (node=0x55ba1661ff00) at ../../../src/include/executor/executor.h:239 #8 0x000055ba142c3ef6 in EvalPlanQualNext (epqstate=0x55ba16605638) at execMain.c:2695 #9 0x000055ba142c38b2 in EvalPlanQual (epqstate=0x55ba16605638, relation=0x7f904c2f1918, rti=4, inputslot=0x55ba1661aa50)at execMain.c:2467 #10 0x000055ba142fb8ab in ExecUpdate (mtstate=0x55ba16605540, tupleid=0x7ffe17d1d02a, oldtuple=0x0, slot=0x55ba16615510,planSlot=0x55ba16612df0, epqstate=0x55ba16605638, estate=0x55ba16604e68, canSetTag=true) at nodeModifyTable.c:1387 #11 0x000055ba142fcee3 in ExecModifyTable (pstate=0x55ba16605540) at nodeModifyTable.c:2223 #12 0x000055ba142caea0 in ExecProcNodeFirst (node=0x55ba16605540) at execProcnode.c:445 #13 0x000055ba142bf78f in ExecProcNode (node=0x55ba16605540) at ../../../src/include/executor/executor.h:239 #14 0x000055ba142c2167 in ExecutePlan (estate=0x55ba16604e68, planstate=0x55ba16605540, use_parallel_mode=false, operation=CMD_UPDATE,sendTuples=false, numberTuples=0, direction=ForwardScanDirection, dest=0x7f904c2c4b00, execute_once=true)at execMain.c:1646 #15 0x000055ba142bfe08 in standard_ExecutorRun (queryDesc=0x55ba16604a58, direction=ForwardScanDirection, count=0, execute_once=true)at execMain.c:364 #16 0x000055ba142bfc29 in ExecutorRun (queryDesc=0x55ba16604a58, direction=ForwardScanDirection, count=0, execute_once=true)at execMain.c:308 #17 0x000055ba144db8c1 in ProcessQuery (plan=0x7f904c2c4a08, sourceText=0x55ba1651a6b8 "UPDATE p i\n SET flag = true;",params=0x0, queryEnv=0x0, dest=0x7f904c2c4b00, completionTag=0x7ffe17d1d4a0 "") at pquery.c:161 #18 0x000055ba144dd385 in PortalRunMulti (portal=0x55ba16583028, isTopLevel=true, setHoldSnapshot=false, dest=0x7f904c2c4b00,altdest=0x7f904c2c4b00, completionTag=0x7ffe17d1d4a0 "") at pquery.c:1283 #19 0x000055ba144dc8c5 in PortalRun (portal=0x55ba16583028, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x7f904c2c4b00,altdest=0x7f904c2c4b00, completionTag=0x7ffe17d1d4a0 "") at pquery.c:796 #20 0x000055ba144d5f20 in exec_simple_query (query_string=0x55ba1651a6b8 "UPDATE p i\n SET flag = true;") at postgres.c:1215 #21 0x000055ba144da82f in PostgresMain (argc=1, argv=0x55ba16546c20, dbname=0x55ba16546b00 "shard4", username=0x55ba16516d78"ildus") at postgres.c:4236 #22 0x000055ba14423fb7 in BackendRun (port=0x55ba16539850) at postmaster.c:4437 #23 0x000055ba144236b6 in BackendStartup (port=0x55ba16539850) at postmaster.c:4128 #24 0x000055ba1441f739 in ServerLoop () at postmaster.c:1704 #25 0x000055ba1441eec5 in PostmasterMain (argc=1, argv=0x55ba16514d30) at postmaster.c:1377 #26 0x000055ba1433588d in main (argc=1, argv=0x55ba16514d30) at main.c:228 We (Ildus Kurbengaliev and I) came with a tentative fix of zero-ing out epqstate members at EvalPlanQualEnd, namely: diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index ac92340af5..b37023c46e 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -2996,4 +2996,6 @@ EvalPlanQualEnd(EPQState *epqstate) epqstate->recheckestate = NULL; epqstate->recheckplanstate = NULL; epqstate->origslot = NULL; + epqstate->relsubs_done = NULL; + epqstate->relsubs_rowmark = NULL; } With the patch above I haven’t experienced any crashes on my test setup. This is likely related to the recent crash Julian Schauder has reported at https://www.postgresql.org/message-id/flat/16219-578ae955109d9fd2%40postgresql.org and possibly to https://www.postgresql.org/message-id/16184-2f718844be10c863%40postgresql.org Regards, Oleksii “Alex” Kliukin
pgsql-bugs by date: