Thread: Note about robustness of transaction-related data structures
While running a parallel regression test I saw a failure"relation deleted while still in use" out of the subtransactions test, after which the backend dumped core. The coredump was in AtSubAbort_smgr, and it was failing because upperPendingDeletes was NIL (which it should not have been inside a subtransaction, of course). The underlying problem doesn't seem very reproducible --- I tried several times without seeing it again. But what I deduce from the core dump is that we had an error during subtransaction cleanup, leading to an attempt to re-abort an already partially aborted transaction, and the system just could not cope. The problem is that the list-of-lists data structure used in smgr.c is brittle: it won't survive two executions of AtSubAbort_smgr at the same nesting level. In my local copy I've removed the list-of-lists data structure and reverted to a flat list of pending deletion requests, with the addition of a transaction nesting level field in each entry. AtSubCommit_smgr does this: int nestLevel = GetCurrentTransactionNestLevel(); PendingRelDelete *pending; for (pending = pendingDeletes; pending != NULL; pending = pending->next) { if (pending->nestLevel >= nestLevel) pending->nestLevel = nestLevel - 1; } while AtSubAbort_smgr acts only on entries with nestlevel >= transaction nest level. This makes the data structure safe against multiple execution. I think we will have to make similar adjustments in the other places where we've used list-of-lists data structures. We may need to look a little closer at the manipulations of the transaction state stack in xact.c, as well. regards, tom lane
On Thu, Jul 15, 2004 at 05:33:56PM -0400, Tom Lane wrote: > The underlying problem doesn't seem very reproducible --- I tried > several times without seeing it again. But what I deduce from the core > dump is that we had an error during subtransaction cleanup, leading to > an attempt to re-abort an already partially aborted transaction, and > the system just could not cope. The problem is that the list-of-lists > data structure used in smgr.c is brittle: it won't survive two > executions of AtSubAbort_smgr at the same nesting level. I saw the same problem while developing savepoints. > I think we will have to make similar adjustments in the other places > where we've used list-of-lists data structures. The solution seems reasonable on that list, which is expected to be short most of the time, but will it be acceptable on longer lists like the one on inval.c? I'm not sure when to start worrying about sequential walking. > We may need to look a little closer at the manipulations of the > transaction state stack in xact.c, as well. Not sure how to protect this. How about a static int with the current nesting level, and checking that against CurrentTransactionState->nestingLevel on PopTransaction() and PushTransaction()? Differences should be treated as FATAL errors, I presume. Or throw a warning and return doing nothing else. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "Tiene valor aquel que admite que es un cobarde" (Fernandel)