Thread: BUG #5269: postgres backend terminates with SIGSEGV
The following bug has been logged online: Bug reference: 5269 Logged by: Justin Pitts Email address: justinpitts@gmail.com PostgreSQL version: 8.4.2 Operating system: Debian Lenny 2.6.30-bpo.1-amd-64 kernel Description: postgres backend terminates with SIGSEGV Details: I originally experienced this on 8.4.1. Postgres backend terminates with SIGSEGV. I can reproduce this behavior only about 30% of the time by manipulating my application. The application performs a steady stream of inserts all the time, and the provoking action is a rather complex report. I am including a backtrace. Sanitizing a simple, clear repro case will take us some time, and it may be easier for us to host a duplicate machine for someone to ssh into. Please let me know if that is reasonable, or how else I can assist. Core was generated by `postgres: ssjiddwdbusr ssjiddwdb 192.168.20.35(46628) S'. Program terminated with signal 11, Segmentation fault. [New process 9564] #0 0x00007f1c43e0eee5 in memcpy () from /lib/libc.so.6 (gdb) bt #0 0x00007f1c43e0eee5 in memcpy () from /lib/libc.so.6 #1 0x00000000006bf259 in CopySnapshot (snapshot=0x167db40) at snapmgr.c:231 #2 0x00000000006bf36d in PushActiveSnapshot (snap=0x167db40) at snapmgr.c:276 #3 0x0000000000560443 in _SPI_execute_plan (plan=0x16bfe20, paramLI=0x16507a0, snapshot=0x0, crosscheck_snapshot=0x0, read_only=0 '\0', fire_triggers=1 '\001', tcount=2) at spi.c:1797 #4 0x0000000000560a96 in SPI_execute_plan (plan=0x16bfe20, Values=0x16cfec0, Nulls=0x16cfed8 " ", read_only=0 '\0', tcount=2) at spi.c:392 #5 0x00007f1c01dcdcda in exec_run_select (estate=0x7fffe98830e0, expr=0x1674708, maxtuples=2, portalP=0x0) at pl_exec.c:4149 #6 0x00007f1c01dcdfd2 in exec_eval_expr (estate=0x7fffe98830e0, expr=0x1674708, isNull=0x7fffe9882e6f "", rettype=0x7fffe9882e68) at pl_exec.c:4067 #7 0x00007f1c01dd0223 in exec_assign_expr (estate=0x16920f0, target=0x16cfda0, expr=0x0) at pl_exec.c:3428 #8 0x00007f1c01dd1eb2 in exec_stmts (estate=0x7fffe98830e0, stmts=<value optimized out>) at pl_exec.c:1345 #9 0x00007f1c01dd3216 in exec_stmt_block (estate=0x7fffe98830e0, block=0x1675f60) at pl_exec.c:1137 #10 0x00007f1c01dd47fc in plpgsql_exec_function (func=0x16c2b60, fcinfo=0x7fffe9883350) at pl_exec.c:315 #11 0x00007f1c01dca43e in plpgsql_call_handler (fcinfo=0x7fffe9883350) at pl_handler.c:95 #12 0x00000000005493e6 in ExecMakeFunctionResult (fcache=0x167dd00, econtext=0x167dbf0, isNull=0x167e9d8 "", isDone=0x167eaf0) at execQual.c:1685 #13 0x000000000054484e in ExecProject (projInfo=<value optimized out>, isDone=0x7fffe988383c) at execQual.c:5007 #14 0x0000000000556e49 in ExecResult (node=0x167dae0) at nodeResult.c:155 #15 0x0000000000543c5d in ExecProcNode (node=0x167dae0) at execProcnode.c:344 #16 0x00000000005418a2 in standard_ExecutorRun (queryDesc=0x16bed10, direction=ForwardScanDirection, count=0) at execMain.c:1504 #17 0x00000000005ed677 in PortalRunSelect (portal=0x164a780, forward=<value optimized out>, count=0, dest=0x1602cf0) at pquery.c:953 #18 0x00000000005eea29 in PortalRun (portal=0x164a780, count=9223372036854775807, isTopLevel=1 '\001', dest=0x1602cf0, altdest=0x1602cf0, completionTag=0x7fffe9883cd0 "") at pquery.c:779 #19 0x00000000005eb790 in PostgresMain (argc=4, argv=<value optimized out>, username=0x1563c40 "ssjiddwdbusr") at postgres.c:1928 #20 0x00000000005bfe08 in ServerLoop () at postmaster.c:3449 #21 0x00000000005c0b97 in PostmasterMain (argc=3, argv=0x1561240) at postmaster.c:1040 #22 0x000000000056a558 in main (argc=3, argv=0x1561240) at main.c:188 I notice bug # 5238, but I have no idea if it is related. I'll take a hint from it however: (gdb) p (char *) debug_query_string $1 = 0x166e190 "SELECT \"datawarehouse\".spr_update_device($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)" This is somewhat unexpected. I expected the 'complex report' to be here, but this is a writer process which manages a data warehouse schema. I know more information is needed. I am hoping this is helpful in the meantime.
"Justin Pitts" <justinpitts@gmail.com> writes: > [ crash in CopySnapshot ] This is a new one I think. Alvaro fixed some snapshot management issues back in October, but if you're seeing it on 8.4.2 then that didn't fix it :-( > I am including a backtrace. Sanitizing a simple, clear repro case will take > us some time, and it may be easier for us to host a duplicate machine for > someone to ssh into. Please let me know if that is reasonable, or how else I > can assist. The trace is definitely not sufficient info by itself. If you'd prefer to make a machine available than extract a test case, we can work with that. Contact me or Alvaro offlist to work out details. regards, tom lane
Justin Pitts <justinpitts@gmail.com> writes: > It went seriously heisenbug since I added %p. It took all day to crash it since I added that to the configuration. Exceptthis latest crash is completely different - signal 6 ABRT, different place. Specifically: 2010-01-13 01:50:42 UTC 5455 ERROR: snapshot reference 0xba6ac0 is not owned by resource owner TopTransaction 2010-01-13 01:50:42 UTC 5455 STATEMENT: COMMIT 2010-01-13 01:50:42 UTC 5455 WARNING: AbortTransaction while in COMMIT state 2010-01-13 01:50:42 UTC 5455 PANIC: cannot abort transaction 777554, it was already committed This looks different, but it matches what I had concluded from the previous dumps: snapmgr.c thinks it has a valid serializable snapshot, but it doesn't. In your previous crashes, CurrentSnaphot was pointing at trashed memory; here, it's still a valid-looking snapshot, but it's not registered in the TopTransactionResourceOwner like it's supposed to be (it wasn't in the other cases either). After puzzling over this for many hours, I have a theory that seems to fit the facts. What I think must be happening is that GetTransactionSnapshot is getting called outside any transaction --- that is, after AtEOXact_Snapshot but before CleanupTransaction (thus, the old TopTransactionResourceOwner is still there, so its attempt to register the snap doesn't dump core). Your build isn't assert-enabled so the extra snapshot registration wouldn't be noticed by the eventual ResourceOwnerDelete. Then, when we next start a transaction and request a serializable snapshot, GetTransactionSnapshot thinks it has nothing to do and happily hands back a pointer to garbage. Now, how the heck did GetTransactionSnapshot get called that way? Your log shows that you're using client software that likes to abort failed transactions with a previously prepared statement: 2010-01-13 01:50:41 UTC 5455 LOG: execute S_2: ROLLBACK This means that to get to CleanupTransaction, we have to look up a cached plan while in aborted transaction state. And that will call RevalidateCachedPlan, which does this if the plan is marked dead: /* * If a snapshot is already set (the normal case), we can just use * that for parsing/planning. But if it isn't, install one. Note: no * point in checking whether parse analysis requires a snapshot; * utility commands don't have invalidatable plans, so we'd not get * here for such a command. */ if (!ActiveSnapshotSet()) { PushActiveSnapshot(GetTransactionSnapshot()); snapshot_set = true; } As the comment indicates, it should not be possible to get here for a ROLLBACK plan. But look what got added in 8.4: ResetPlanCache(), which marks plans dead *unconditionally*. And one of the ways that that can get invoked is via a SI queue overrun event, which triggers a general cache flush and in particular calls PlanCacheSysCallback(). So what is needed to make this scenario happen is a SI queue overrun, which isn't a common event unless your application does a lot of catalog manipulations. I suppose that the reason your reporting query manages to tickle it is that it creates/deletes temp tables. And even then, you'd need an actual failure on the side receiving the overrun notification. In the cases at hand that failure is 2010-01-13 01:50:41 UTC 5455 ERROR: could not serialize access due to concurrent update which doesn't happen real often in your app, but does happen. So this theory explains what I found in the core dumps, which is otherwise dang hard to explain (I spent a lot of time looking for code paths that could bypass AtEOXact_Snapshot, but there aren't any). It explains why the failure is uncommon. It explains why the failures come in bunches (since a SI overrun is broadcast, and would thereby poison all active sessions with "dead" ROLLBACK plans; and the subsequent serialization errors could easily come in bunches too). It explains why we haven't seen similar stuff in testing --- preparing something as trivial as a ROLLBACK is pretty silly so nobody does it. And it explains why we didn't see it before 8.4 --- the bug was in fact introduced here: http://archives.postgresql.org/pgsql-committers/2008-09/msg00084.php So even though I haven't actually reproduced the problem, I have a fair amount of confidence in the explanation. We could perhaps fix this by changing the above-quoted code, but I'm inclined to teach ResetPlanCache that utility statements should never be marked dead. They don't have plans in the normal sense of the word so there's no point in forcing a revalidation cycle. I'm kinda beat right now but will work on a patch tomorrow. regards, tom lane
On Tue, 12 Jan 2010, Tom Lane wrote: > preparing something as trivial as a ROLLBACK is pretty silly so nobody > does it. Actually the JDBC driver has been preparing BEGIN, COMMIT, and ROLLBACK since the 8.0 release. http://archives.postgresql.org/pgsql-jdbc/2006-10/msg00149.php Kris Jurka
I wrote: > After puzzling over this for many hours, I have a theory that seems to > fit the facts. I think the attached patch will fix it for you --- please test. regards, tom lane Index: src/backend/utils/cache/plancache.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/utils/cache/plancache.c,v retrieving revision 1.27.2.1 diff -c -r1.27.2.1 plancache.c *** src/backend/utils/cache/plancache.c 14 Jul 2009 15:37:55 -0000 1.27.2.1 --- src/backend/utils/cache/plancache.c 13 Jan 2010 16:46:07 -0000 *************** *** 1050,1063 **** void ResetPlanCache(void) { ! ListCell *lc; ! foreach(lc, cached_plans_list) { ! CachedPlanSource *plansource = (CachedPlanSource *) lfirst(lc); CachedPlan *plan = plansource->plan; ! if (plan) ! plan->dead = true; } } --- 1050,1106 ---- void ResetPlanCache(void) { ! ListCell *lc1; ! foreach(lc1, cached_plans_list) { ! CachedPlanSource *plansource = (CachedPlanSource *) lfirst(lc1); CachedPlan *plan = plansource->plan; + ListCell *lc2; ! /* No work if it's already invalidated */ ! if (!plan || plan->dead) ! continue; ! ! /* ! * We *must not* mark transaction control statements as dead, ! * particularly not ROLLBACK, because they may need to be executed in ! * aborted transactions when we can't revalidate them (cf bug #5269). ! * In general there is no point in invalidating utility statements ! * since they have no plans anyway. So mark it dead only if it ! * contains at least one non-utility statement. ! */ ! if (plan->fully_planned) ! { ! /* Search statement list for non-utility statements */ ! foreach(lc2, plan->stmt_list) ! { ! PlannedStmt *plannedstmt = (PlannedStmt *) lfirst(lc2); ! ! Assert(!IsA(plannedstmt, Query)); ! if (IsA(plannedstmt, PlannedStmt)) ! { ! /* non-utility statement, so invalidate */ ! plan->dead = true; ! break; /* out of stmt_list scan */ ! } ! } ! } ! else ! { ! /* Search Query list for non-utility statements */ ! foreach(lc2, plan->stmt_list) ! { ! Query *query = (Query *) lfirst(lc2); ! ! Assert(IsA(query, Query)); ! if (query->commandType != CMD_UTILITY) ! { ! /* non-utility statement, so invalidate */ ! plan->dead = true; ! break; /* out of stmt_list scan */ ! } ! } ! } } }
Sorry for the delay. I am attempting to construct a JDBC test case that reproduces the problem. I have installed the patch and have not seen the crash since. On Jan 13, 2010, at 11:58 AM, Tom Lane wrote: > I wrote: >> After puzzling over this for many hours, I have a theory that seems to >> fit the facts. > > I think the attached patch will fix it for you --- please test. > > regards, tom lane > > Index: src/backend/utils/cache/plancache.c > =================================================================== > RCS file: /cvsroot/pgsql/src/backend/utils/cache/plancache.c,v > retrieving revision 1.27.2.1 > diff -c -r1.27.2.1 plancache.c > *** src/backend/utils/cache/plancache.c 14 Jul 2009 15:37:55 -0000 1.27.2.1 > --- src/backend/utils/cache/plancache.c 13 Jan 2010 16:46:07 -0000 > *************** > *** 1050,1063 **** > void > ResetPlanCache(void) > { > ! ListCell *lc; > > ! foreach(lc, cached_plans_list) > { > ! CachedPlanSource *plansource = (CachedPlanSource *) lfirst(lc); > CachedPlan *plan = plansource->plan; > > ! if (plan) > ! plan->dead = true; > } > } > --- 1050,1106 ---- > void > ResetPlanCache(void) > { > ! ListCell *lc1; > > ! foreach(lc1, cached_plans_list) > { > ! CachedPlanSource *plansource = (CachedPlanSource *) lfirst(lc1); > CachedPlan *plan = plansource->plan; > + ListCell *lc2; > > ! /* No work if it's already invalidated */ > ! if (!plan || plan->dead) > ! continue; > ! > ! /* > ! * We *must not* mark transaction control statements as dead, > ! * particularly not ROLLBACK, because they may need to be executed in > ! * aborted transactions when we can't revalidate them (cf bug #5269). > ! * In general there is no point in invalidating utility statements > ! * since they have no plans anyway. So mark it dead only if it > ! * contains at least one non-utility statement. > ! */ > ! if (plan->fully_planned) > ! { > ! /* Search statement list for non-utility statements */ > ! foreach(lc2, plan->stmt_list) > ! { > ! PlannedStmt *plannedstmt = (PlannedStmt *) lfirst(lc2); > ! > ! Assert(!IsA(plannedstmt, Query)); > ! if (IsA(plannedstmt, PlannedStmt)) > ! { > ! /* non-utility statement, so invalidate */ > ! plan->dead = true; > ! break; /* out of stmt_list scan */ > ! } > ! } > ! } > ! else > ! { > ! /* Search Query list for non-utility statements */ > ! foreach(lc2, plan->stmt_list) > ! { > ! Query *query = (Query *) lfirst(lc2); > ! > ! Assert(IsA(query, Query)); > ! if (query->commandType != CMD_UTILITY) > ! { > ! /* non-utility statement, so invalidate */ > ! plan->dead = true; > ! break; /* out of stmt_list scan */ > ! } > ! } > ! } > } > }
Justin Pitts <jpitts@bplglobal.net> writes: > My guess is that I am not provoking a 'SI queue overrun' The 100 temp table creations probably will do that just fine. > Am I completely off base about how this should be reproducing? Two points: the session you hope to have crash *must* be in serializable mode, and the crash would actually happen in the transaction after the one that's rolled back. The error doesn't have to be a serialization error, so in principle you should be able to make it fail with something as simple as begin; select 1/0; rollback; select * from foo; as long as the ROLLBACK is done with a prepared statement and you've forced a SI overrun since the ROLLBACK was prepared. regards, tom lane
BTW --- from my own perspective, what is much more interesting is testing time on your real application to make sure that you don't see a crash any more. I am certain that I identified and fixed *a* bug, what I'm less certain about is that it is actually what you were hitting. The previous advice would help you reproduce the bug that I fixed, but ... regards, tom lane
Apologies for over-quoting on the previous message. My approach to a regression test was this single-threaded test ( actual cod= e at bottom )=20 Connection 1 - implicitly autocommiting every statement create a test table and populate it with a single row.=20 CREATE TABLE FOO ( a int not null ); INSERT INTO FOO VALUES (1); set the connection autocommit to false ( implied transaction begin on nex= t statement ) read=20 SELECT * FROM FOO; modify UPDATE FOO SET a=3Da+1; ( transaction is still open ) Connection 2 set the connection autocommit to false read SELECT * FROM FOO; Connection 1 commit; Connection 2 modify UPDATE FOO SET a=3Da+1 throws concurrent serialization error Connection 3 do a bunch of catalog updates for f in 1..??? ( i tried 1, 10, 100, and 4096 ) CREATE TEMP TABLE BAR$f (a int not null ); Connection 2 rollback <- I am expecting a server crash here. It does not crash. My guess is that I am not provoking a 'SI queue overrun'=20 ( I am assuming this is a reference to the queue implemented in sinvaladt.c= . My C skills are rusty at best. ) Am I completely off base about how this should be reproducing? Bug5269Test.java import java.sql.Connection; import java.sql.Driver; import java.sql.SQLException; import java.sql.Statement; import java.util.Properties; /* Regression test for PostgreSQL Bug # 5269=20 * http://archives.postgresql.org/pgsql-bugs/2010-01/msg00080.php=20 * */ public class Bug5269Test { private Connection newConnection(Driver driver) { Properties info =3D new Properties(); info.put("user", "postgres"); try { return driver.connect("jdbc:postgresql://localhost:5433", info); } catch (SQLException e) { e.printStackTrace(); } return null; } public void test() throws SQLException, InterruptedException { Driver driver =3D new org.postgresql.Driver(); Connection a =3D newConnection(driver); setupTable(a); Connection b =3D newConnection(driver); a.setAutoCommit(false); a.setTransactionIsolation(Connection.TRANSACTION_SERIALIZABLE); b.setAutoCommit(false); b.setTransactionIsolation(Connection.TRANSACTION_SERIALIZABLE); String select =3D "select * from foo"; String update =3D "update foo set a =3D a + 1"; a.prepareStatement(select).execute(); System.out.println("select a"); a.prepareStatement(update).execute(); System.out.println("update a"); b.prepareStatement(select).execute(); System.out.println("select b"); a.commit(); System.out.println("commit a"); try { b.prepareStatement(update).execute(); b.commit(); throw new RuntimeException("Expected a serialization failure."); } catch (SQLException sqle) { System.out.println(sqle.getMessage()); flogCatalogs(driver); System.out.println("flog catalogs"); // The hope is that an unpatched system would crash here. b.rollback(); System.out.println("rollback."); } a.close(); b.close(); } private void setupTable(Connection a) throws SQLException { try { a.prepareStatement("drop table foo").execute(); } catch (Exception e) { e.printStackTrace(); } a.prepareStatement("create table foo(a int not null)").executeUpdate(); a.prepareStatement("insert into foo values (1);").executeUpdate(); } private void flogCatalogs(final Driver driver) throws SQLException { final Connection conn =3D newConnection(driver); try { conn.setAutoCommit(false); Statement ps =3D conn.createStatement(); String a =3D "create temp table bar"; String b =3D " (a int not null); "; for (int f =3D 0; f < 128; f++) { StringBuilder sb =3D new StringBuilder(); sb.append(a).append(f).append(b); ps.execute(sb.toString()); } conn.commit(); } catch (SQLException e) { e.printStackTrace(); } finally { try { conn.close(); } catch (SQLException e) { e.printStackTrace(); } } } public static void main(String... args) throws SQLException, InterruptedException { Bug5269Test test =3D new Bug5269Test(); test.test(); } } On Jan 13, 2010, at 10:22 PM, Justin Pitts wrote: > Sorry for the delay. >=20 > I am attempting to construct a JDBC test case that reproduces the problem= .=20 >=20 > I have installed the patch and have not seen the crash since. >=20 > On Jan 13, 2010, at 11:58 AM, Tom Lane wrote: >=20
Justin Pitts <jpitts@bplglobal.net> writes: > On Jan 14, 2010, at 10:16 AM, Tom Lane wrote: >> The 100 temp table creations probably will do that just fine. > Is there a way to verify this? You could add an elog(LOG, "message") into ResetPlanCache so you could tell when it had been called. > I don't follow. Are you suggesting I begin another transaction on connection 1 with a read, and that > would provoke the crash? Yes. The rollback only sets the stage for the next transaction to try to use a snapshot that isn't there anymore. regards, tom lane
As difficult as it was to reproduce the bug, I'd like a clear cut regressio= n test. The use case where it manifested is fairly rare in normal use - ser= ver start after prolonged outage, where the queue of inbound device events = is built up enough to make serialization errors more likely.=20 On Jan 14, 2010, at 10:19 AM, Tom Lane wrote: > BTW --- from my own perspective, what is much more interesting is > testing time on your real application to make sure that you don't > see a crash any more. I am certain that I identified and fixed > *a* bug, what I'm less certain about is that it is actually what > you were hitting. The previous advice would help you reproduce > the bug that I fixed, but ... >=20 > regards, tom lane
On Jan 14, 2010, at 10:16 AM, Tom Lane wrote: > Justin Pitts <jpitts@bplglobal.net> writes: >> My guess is that I am not provoking a 'SI queue overrun'=20 >=20 > The 100 temp table creations probably will do that just fine. >=20 Is there a way to verify this? >> Am I completely off base about how this should be reproducing? >=20 > Two points: the session you hope to have crash *must* be in serializable > mode, The 2 competing sessions doing the read/modify sequence on foo are set to S= ERIALIZABLE. > and the crash would actually happen in the transaction after the > one that's rolled back. >=20 I don't follow. Are you suggesting I begin another transaction on connectio= n 1 with a read, and that would provoke the crash? > The error doesn't have to be a serialization error, so in principle > you should be able to make it fail with something as simple as >=20 > begin; > select 1/0; > rollback; > select * from foo; >=20 > as long as the ROLLBACK is done with a prepared statement and you've > forced a SI overrun since the ROLLBACK was prepared. >=20 > regards, tom lane
On Jan 14, 2010, at 10:44 AM, Tom Lane wrote: > Justin Pitts <jpitts@bplglobal.net> writes: >> On Jan 14, 2010, at 10:16 AM, Tom Lane wrote: >>> The 100 temp table creations probably will do that just fine. >=20 >> Is there a way to verify this? >=20 > You could add an elog(LOG, "message") into ResetPlanCache so you could > tell when it had been called. >=20 Done. Sometimes I see it, sometimes not. >> I don't follow. Are you suggesting I begin another transaction on connec= tion 1 with a read, and that >> would provoke the crash? >=20 > Yes. The rollback only sets the stage for the next transaction to try > to use a snapshot that isn't there anymore. >=20 Oh, duh. A read from the same session that rolled-back. That didn't get it = working (failing?) however. Running concurrent instances of this test reliably provokes the crash on un= -patched 8.4.2. They do not provoke a crash with the patch. That's what i was looking for. Thanks! > regards, tom lane