Thread: Assertion failure in AtCleanup_Portals
If I run the following sequence of commands, I get an assertion failure in current HEAD. postgres=# BEGIN; BEGIN postgres=# SELECT 1/0; ERROR: division by zero postgres=# ROLLBACK TO A; ERROR: no such savepoint postgres=# \q The process fails when the session is closed and aborted transaction gets cleaned at the proc_exit time. The stack trace is as below. (gdb) bt #0 0x00c16422 in __kernel_vsyscall () #1 0x00244651 in *__GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #2 0x00247a82 in *__GI_abort () at abort.c:92 #3 0x0844b800 in ExceptionalCondition (conditionName=0x86161e8 "!(portal->cleanup == ((void *)0))", errorType=0x8615e94 "FailedAssertion", fileName=0x8615e88 "portalmem.c", lineNumber=793) at assert.c:57 #4 0x08472a5a in AtCleanup_Portals () at portalmem.c:793 #5 0x080f2535 in CleanupTransaction () at xact.c:2373 #6 0x080f40d7 in AbortOutOfAnyTransaction () at xact.c:3855 #7 0x0845cc5e in ShutdownPostgres (code=0, arg=0) at postinit.c:966 #8 0x08332ba7 in shmem_exit (code=0) at ipc.c:221 #9 0x08332ab3 in proc_exit_prepare (code=0) at ipc.c:181 #10 0x08332a15 in proc_exit (code=0) at ipc.c:96 #11 0x0835b07c in PostgresMain (argc=2, argv=0xa12fa64, username=0xa12f958 "pavan") at postgres.c:4091 I analyzed this a bit. It seems that the first statement throws an error, runs AbortCurrentTransaction and puts the transaction block in TBLOCK_ABORT state. Any commands executed after this would usually fail with error "current transaction is aborted". But the ROLLBACK TO gets past this check because IsTransactionExitStmt() returns success (and rightly so). So we end up creating a portal to run the ROLLBACK TO command. This command fails because the SAVEPOINT A does not exist and we throw an error. But since the transaction block is already in TBLOCK_ABORT state, the AbortCurrentTransaction doesn't do any work. The cleanup routine for the portal remains unexecuted. Later when the backend exits and cleanup routines are called, the assertion inside AtCleanup_Portals fails. I am not sure what is the right fix for this. I see Tom fixed a similar complain sometime back. http://git.postgresql.org/pg/commitdiff/6252c4f9e201f619e5eebda12fa867acd4e4200e But may its not enough to cover all code paths, as demonstrated by this example. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Pavan Deolasee <pavan.deolasee@gmail.com> writes: > If I run the following sequence of commands, I get an assertion > failure in current HEAD. > postgres=# BEGIN; > BEGIN > postgres=# SELECT 1/0; > ERROR: division by zero > postgres=# ROLLBACK TO A; > ERROR: no such savepoint > postgres=# \q > The process fails when the session is closed and aborted transaction > gets cleaned at the proc_exit time. The stack trace is as below. Hmm. It works fine if you issue an actual ROLLBACK command there, so my gut reaction is that AbortOutOfAnyTransaction isn't sufficiently duplicating the full-fledged ROLLBACK code path. No time to dig further right now though. regards, tom lane
On Tue, Feb 7, 2012 at 3:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Pavan Deolasee <pavan.deolasee@gmail.com> writes: >> If I run the following sequence of commands, I get an assertion >> failure in current HEAD. > >> postgres=# BEGIN; >> BEGIN >> postgres=# SELECT 1/0; >> ERROR: division by zero >> postgres=# ROLLBACK TO A; >> ERROR: no such savepoint >> postgres=# \q > >> The process fails when the session is closed and aborted transaction >> gets cleaned at the proc_exit time. The stack trace is as below. > > Hmm. It works fine if you issue an actual ROLLBACK command there, > so my gut reaction is that AbortOutOfAnyTransaction isn't sufficiently > duplicating the full-fledged ROLLBACK code path. No time to dig further > right now though. > It works OK for a ROLLBACK command because we create a new unnamed portal for the ROLLBACK command, silently dropping the old one if it already exists. Since the ROLLBACK command then runs successfully, we don't see the same assertion. Would it be safe to drop FAILED unnamed portals during AbortOutAnyTransaction ? May be it is if we can do that before creating a new portal for ROLLBACK command itself. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Pavan Deolasee <pavan.deolasee@gmail.com> writes: > On Tue, Feb 7, 2012 at 3:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hmm. It works fine if you issue an actual ROLLBACK command there, >> so my gut reaction is that AbortOutOfAnyTransaction isn't sufficiently >> duplicating the full-fledged ROLLBACK code path. No time to dig further >> right now though. > It works OK for a ROLLBACK command because we create a new unnamed > portal for the ROLLBACK command, silently dropping the old one if it > already exists. Since the ROLLBACK command then runs successfully, we > don't see the same assertion. Would it be safe to drop FAILED unnamed > portals during AbortOutAnyTransaction ? May be it is if we can do that > before creating a new portal for ROLLBACK command itself. I poked at this some more, and noticed another dangerous-seeming issue: at the time PortalCleanup is called, if it does get called, the portal's stmts and sourceText are already pointing at garbage. The reason is that exec_simple_query blithely does this: /* * We don't have to copy anything into the portal, because everything * we are passing here is inMessageContext, which will outlive the * portal anyway. */ PortalDefineQuery(portal, NULL, query_string, commandTag, plantree_list, NULL); That comment is a true statement for the normal successful path of control, wherein we reach the PortalDrop a few dozen lines below. But it's not true if the command suffers an error. We will mark the portal PORTAL_FAILED right away (in one of various PG_CATCH blocks) but we don't clean it up until another command arrives, so the current contents of MessageContext are long gone when portal cleanup happens. It seems to me that the most reliable fix for these issues is to institute the same policy for transitioning a portal to FAILED state as we already have for transitions to DONE state, ie, we should run the cleanup hook immediately. IOW it would be a good idea to create a function MarkPortalFailed that is just like MarkPortalDone except for the specific portal state transition, and use that instead of merely setting portal->status = PORTAL_FAILED in error cleanup situations. This would ensure that the cleanup hook gets to run before we possibly cut the portal's memory out from under it. It's more than is probably needed to resolve the specific crash you're reporting, but it seems likely to forestall future issues. I haven't actually tested such a fix yet, but will go do that now. regards, tom lane